Commit 19a27d11 authored by Ingo Scholtes's avatar Ingo Scholtes
Browse files

revising notebooks for week 01 and 02

parent 25526650
......@@ -2,25 +2,48 @@
"cells": [
{
"cell_type": "markdown",
"source": [
"# 01: The `python` ecosystem for network science\r\n",
"\r\n",
"In this notebook, we explain how you can set up the data science environment that we will use in the practice lectures. The environment consists of a `python3` interpreter, the network analysis package `pathpy`, some additional packages for data analysis, and visualisation, the versioning system `git`, a `jupyter` notebook server, as well as - optionally - the development environment Visual Studio Code."
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
"source": [
"# 01: The `python` ecosystem for network science\n",
"\n",
"In this notebook, we explain how you can set up the data science environment that we will use in the practice lectures. The environment consists of a `python3` interpreter, the network analysis package `pathpy`, some additional packages for data analysis, and visualisation, the versioning system `git`, a `jupyter` notebook server, as well as - optionally - the development environment Visual Studio Code."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"source": [
"x = 42\n",
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Setting up `python` and `jupyter`\n",
"\n",
"To run the practice lecture notebooks and work on the exercise sheets, you need a `python 3.7` environment running on an operating system of your choice. For Windows, MacOS, and Linux users we recommend to install the latest [Anaconda distribution](https://www.anaconda.com/download/), an OpenSource `python` distribution pre-configured for data science and machine learning tasks. \n",
"To run the practice lecture notebooks and work on the exercise sheets, you need a `python >= 3.7` environment running on an operating system of your choice. For Windows, MacOS, and Linux users we recommend to install the latest [Anaconda distribution](https://www.anaconda.com/download/), an OpenSource `python` distribution pre-configured for data science and machine learning tasks. \n",
"\n",
"Just download and run the installer and you should have almost everything you need for this course. Beware of alternative methods to install a barebone python distribution, as a careless installation may conflict with the python version already present on your system. We have had students that managed to wreck their Mac OS X or Linux operating system by accidentally removing the standard python runtime!\n",
"Just download and run the installer and you should have almost everything you need for this course. Beware of alternative methods to install a barebone python distribution, as a careless installation may conflict with the python version already present on your system. We have had students that managed to wreck their Mac OS by accidentally removing the standard python runtime!\n",
"\n",
"If you prefer starting from a barebone `python 3.x` installation, you will also need to manually install additional packages via the python package manager `pip`. To see a list of python packages that are already installed, you can open a terminal and run \n",
"\n",
......@@ -47,15 +70,11 @@
"```\n",
"> python3.x -m pip install PACKAGENAME\n",
"``` "
],
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up `pathpy`\n",
"\n",
......@@ -78,11 +97,15 @@
"`pathpy` requires `python 3.x`. If you have both ` python 2.x` and `python 3.x` installed, you can explicitly install a package for `python 3` by using the command `pip3` instead of `pip`. Note that `python 2` is deprecated since April 2020, so you should always use `python 3` anyway.\n",
"\n",
"If by any chance you had previously installed an offifical release version of pathpy via pip (e.g. for our data science course), you will need to manually uninstall it. You can check this via the command `pip list`. Uninstalling prior versions is necessary because all `pathpy` versions use the same namespace `pathpy` and for reasons of backwards compatibility we still provide older major versions in case someone needs them. If for some reason you want to quickly switch between different major versions, we recommend to use [virtual environments in Anaconda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)."
],
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Using `jupyter` notebooks\n",
"\n",
......@@ -100,111 +123,144 @@
"As a first step, you can try this in the directory that contains this notebook (i.e. the corresponding `.ipynb` notebook file). A browser will start and you should see the notebook file in the file browser panel on the left of your screen. Click it to open an interactive document. Similarly, you can create a new notebook. Doubleclicking an input cell in this notebook will allow you to edit the underlying text or code. Pressing Shift+Enter will execute the code and either display the formatted textor the output of the underlying code:\n",
"\n",
"Let us try this in the following input cell, that contains `python` code generating a simple textual output:"
],
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
}
]
},
{
"cell_type": "code",
"execution_count": 1,
"source": [
"x = 2 * 21\n",
"print(x)"
],
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"metadata": {}
"source": [
"x = 2 * 21\n",
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can create a new input cell below the current cursor either by clicking the `+` button in the top left menu or by pressing `b` on the keyboard. If you are editing a cell, you can press `Esc` to enter the command mode, in which you can add, manipulate or delete a cell. To delete a cell press `D` twice in command mode. To change the cell type from `python` to `markdown` press `m` in command mode. Press `y` to change it back to `python` code. Let us try this with the following markdown cell, that contains a LaTeX formula as well as a chunk of HTML code:"
],
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: We can also use LaTeX formulas: $\\int_0^\\pi \\sin(t) dt$"
],
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All `python` code will be executed by the underlying `python` kernel, whose current status is displayed in the top right circle of the notebook window. An unfilled circle (i.e. white center) indicates that the `kernel` is currently idle. If the circle is filled with black, the `kernel` is busy computing. You can see this when you execute the following cell (press Shift+Enter):"
],
"metadata": {}
"All `python` code will be executed by the underlying `python` kernel, whose current status is displayed in the status bar of the notebook window. You can see this when you execute the following cell (press Shift+Enter):"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"50000000\n"
]
}
],
"source": [
"s = 0\n",
"for i in range(50000000):\n",
" s += 1\n",
"print(s)"
],
"outputs": [],
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, it is important to note that the `python` kernel is simply a single interpreter process that sequentially runs the code in all cells that you execute. This implies that the order of your execution determines the current state of the kernel, i.e. which variables exist and what the values of those variables are. In particular, state is maintained across multiple cell, as you can see in the following example:"
],
"metadata": {}
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"x = 42"
],
"outputs": [],
"metadata": {}
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"source": [
"print(x)"
],
"outputs": [],
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you happened to execute those two cells in reverse order, you would generate an error. This seems trivial at first, but for complex notebooks where you execute cells back and forth it can become quite difficult to understand what is the current state. You can always wipe the current state by killing the current kernel, starting a new interpreter process. You can do this either by selecting \"Restart kernel\" in the Kernel menu above, or you enter command mode (press ESC) and hit `0` twice. Try this and then try to execute the following cell, which should return an error as a variable with the name `x` has not been defined in the new kernel."
],
"metadata": {}
"If you happened to execute those two cells in reverse order, you would generate an error. This seems trivial at first, but for complex notebooks where you execute cells back and forth it can become quite difficult to understand what is the current state. You can always wipe the current state by killing the current kernel, starting a new interpreter process. You can do this by selecting \"Restart kernel\" in the Kernel menu above. Try it and then execute the following cell, which returns an error as a variable with the name `x` has not been defined in the new kernel."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'x' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-2-fc17d851ef81>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'x' is not defined"
]
}
],
"source": [
"print(x)"
],
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"metadata": {}
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"interpreter": {
"hash": "52aa0e2b95504bbc17c43da6f291deef2c892ed740ca6dd5f795071c64d0eb86"
},
"kernelspec": {
"name": "python374jvsc74a57bd0179f2c9954461ddf657daf1ee3f9df7374d575c8403facec5648a064395b52ac",
"display_name": "Python 3.7.4 64-bit ('anaconda3': conda)"
"display_name": "Python 3.8.5 64-bit ('anaconda3': conda)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
......@@ -216,9 +272,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4-final"
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
\ No newline at end of file
}
%% Cell type:markdown id: tags:
# 01: The `python` ecosystem for network science
In this notebook, we explain how you can set up the data science environment that we will use in the practice lectures. The environment consists of a `python3` interpreter, the network analysis package `pathpy`, some additional packages for data analysis, and visualisation, the versioning system `git`, a `jupyter` notebook server, as well as - optionally - the development environment Visual Studio Code.
%% Cell type:code id: tags:
```
x = 42
print(x)
```
%%%% Output: stream
42
%% Cell type:markdown id: tags:
## Setting up `python` and `jupyter`
To run the practice lecture notebooks and work on the exercise sheets, you need a `python 3.7` environment running on an operating system of your choice. For Windows, MacOS, and Linux users we recommend to install the latest [Anaconda distribution](https://www.anaconda.com/download/), an OpenSource `python` distribution pre-configured for data science and machine learning tasks.
To run the practice lecture notebooks and work on the exercise sheets, you need a `python >= 3.7` environment running on an operating system of your choice. For Windows, MacOS, and Linux users we recommend to install the latest [Anaconda distribution](https://www.anaconda.com/download/), an OpenSource `python` distribution pre-configured for data science and machine learning tasks.
Just download and run the installer and you should have almost everything you need for this course. Beware of alternative methods to install a barebone python distribution, as a careless installation may conflict with the python version already present on your system. We have had students that managed to wreck their Mac OS X or Linux operating system by accidentally removing the standard python runtime!
Just download and run the installer and you should have almost everything you need for this course. Beware of alternative methods to install a barebone python distribution, as a careless installation may conflict with the python version already present on your system. We have had students that managed to wreck their Mac OS by accidentally removing the standard python runtime!
If you prefer starting from a barebone `python 3.x` installation, you will also need to manually install additional packages via the python package manager `pip`. To see a list of python packages that are already installed, you can open a terminal and run
```
> pip list
```
If you installed Anaconda on a Windows system you should use the `Anaconda prompt` terminal that has been installed by Anaconda. This will make sure that all environment variables are correctly set. Moreover, to install packages, it is best to open this command prompt as an administrator (or use `su` on a Unix-based system). To complete the practice lectures and group exercises, we will need the following packages:
`jupyter` - provides an environment for interactive data science projects in your browser. We will extensively use so-called `jupyter notebooks`, which are interactive computable documents that you can also use to compile reports.
`pathpy` - provides implementations of common scientific and statistical computing techniques for python.
`scipy` - provides implementations of common scientific and statistical computing techniques for python.
`numpy` - provides support for multi-dimensional arrays an matrices as well as high-level mathematical functions. This project originated as a smaller core part of `scipy`.
`matplotlib` - provides advanced plotting functions based on the data types introduced in `numpy`. Visualisations can be directly integrated into `jupyter` notebooks.
`pandas` - popular package for the management, analysis, and manipulation of multi-dimensional **pan**el **da**ta (thus the name). Provides convenient interfaces for the import and export of data from files or databases.
To install the packages above, except for `pathpy` just run the following command in the terminal for each of the packages above:
```
> pip install PACKAGENAME
```
If you see no error messages, you should be all set to continue with the next steps. For the same reason presented above pip may be associated to the python version of the system. In order to have a full control on the version in which the packages should be installed you can use
```
> python3.x -m pip install PACKAGENAME
```
%% Cell type:markdown id: tags:
## Setting up `pathpy`
In this course we will use `pathpy`, a network analysis and visualisation package that is currently being developed at my chair.
Compared to many other packages, `pathpy` has a couple of advantages. First, it is easy to install as it should have no dependencies not already included in a default `anaconda` installation. Second, `pathpy` has a user-friendly API making it easy to handle directed and undirected networks, networks with single and multiple edges, multi-layer networks or temporal networks. It also provides interactive HTML-based visualisations that can be directly displayed inside `jupyter` notebooks, which makes it particularly suitable for educational settings. Third, it supports the analysis and visualisation of time series data on networked systems, such as time-stamped edges or data on paths in networks.
Since `pathpy` is not included in the default Anaconda installation, we first need to install it. In previous iterations of the course, we used the stable version `pathy 2.0`. Right now, we are in the process of finishing a heavily revised version 3.0, which comes with many advantages. It has a cleaner API, is more efficient, and provides advanced plotting functions. To benefit from those advantages, we use the development version of `pathpy3` from gitHub. The best way to install it is to (1) clone the git repository to a local directory, and (2) install an editable version of the `pathpy` module from this cloned repository. This approach will allow you to execute `git pull` from the commandline to always update to the latest development version.
To install pathpy 3 open a command line as administrator and run:
```
> git clone https://github.com/pathpy/pathpy pathpy3
> cd pathpy3
> pip install -e .
```
This will create a new directory `pathpy3` on your machine. Changing to this directory and running `pip install -e .` will *install `pathpy` as an editable python module*. Rather than copying the code to a separate directory in your python module path, this creates a symbolic link to the current directory. This has the advantage that you can update your local installation of `pathpy` simply by entering the directory and running `git pull`. After this the new version will be immediately available in all `python` processes launched after the update. This allows us to update your local `pathpy` installations by means of a simple `git push`, without the need to uninstall and install again. Note that after updating `pathpy` you must always restart the python process (or jupyter kernel) before changes become effective!
`pathpy` requires `python 3.x`. If you have both ` python 2.x` and `python 3.x` installed, you can explicitly install a package for `python 3` by using the command `pip3` instead of `pip`. Note that `python 2` is deprecated since April 2020, so you should always use `python 3` anyway.
If by any chance you had previously installed an offifical release version of pathpy via pip (e.g. for our data science course), you will need to manually uninstall it. You can check this via the command `pip list`. Uninstalling prior versions is necessary because all `pathpy` versions use the same namespace `pathpy` and for reasons of backwards compatibility we still provide older major versions in case someone needs them. If for some reason you want to quickly switch between different major versions, we recommend to use [virtual environments in Anaconda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
%% Cell type:markdown id: tags:
## Using `jupyter` notebooks
Now that all the packages have been installed, you should be able to start a `jupyter` server, a process that can execute chunks of `python code` in a so-called `kernel` and return the result. This `jupyter` server can run local `python kernels`, which are actually local servers that accept requests with `python` code from clients, execute the code and then return the result. A popular but not the only way to interact with such `kernels` is through the browser-based clients `jupyter notebook` and `jupyter lab`.
Those two clients are quite similar in the sense that both allow you to open, edit, and run `jupyter` notebooks in your browser. Each notebook consists of multiple input and output cells, where the input contains the code being executed while the adjacent output cell displays the result of the computation. Importantly, these cells can contain code in multiple languages such as `julia`, `python`, or `R` (note the name: `jupyter`), as well as Markdown-formatted text, chunks of HTML or even LaTeX equations. This makes `jupyter` notebooks a great tool to compile interactive computable documents, that can directly be exported to HTML or LaTex/PDF reports.
While you can use both the `jupyter notebook` or the `jupyter lab` server for this course, the latter as recently been released as the next-generation `jupyter` interface. We will thus use this for our course. First install `jupyter lab` via pip. To start a `jupyter lab` server, just navigate to the directory in your filesystem in which you wish to create or open a `jupyter` notebook and execute the following command in your terminal:
```
> jupyter lab
```
As a first step, you can try this in the directory that contains this notebook (i.e. the corresponding `.ipynb` notebook file). A browser will start and you should see the notebook file in the file browser panel on the left of your screen. Click it to open an interactive document. Similarly, you can create a new notebook. Doubleclicking an input cell in this notebook will allow you to edit the underlying text or code. Pressing Shift+Enter will execute the code and either display the formatted textor the output of the underlying code:
Let us try this in the following input cell, that contains `python` code generating a simple textual output:
%% Cell type:code id: tags:
```
x = 2 * 21
print(x)
```
%%%% Output: stream
42
%% Cell type:markdown id: tags:
You can create a new input cell below the current cursor either by clicking the `+` button in the top left menu or by pressing `b` on the keyboard. If you are editing a cell, you can press `Esc` to enter the command mode, in which you can add, manipulate or delete a cell. To delete a cell press `D` twice in command mode. To change the cell type from `python` to `markdown` press `m` in command mode. Press `y` to change it back to `python` code. Let us try this with the following markdown cell, that contains a LaTeX formula as well as a chunk of HTML code:
%% Cell type:markdown id: tags:
**Note**: We can also use LaTeX formulas: $\int_0^\pi \sin(t) dt$
%% Cell type:markdown id: tags:
All `python` code will be executed by the underlying `python` kernel, whose current status is displayed in the top right circle of the notebook window. An unfilled circle (i.e. white center) indicates that the `kernel` is currently idle. If the circle is filled with black, the `kernel` is busy computing. You can see this when you execute the following cell (press Shift+Enter):
All `python` code will be executed by the underlying `python` kernel, whose current status is displayed in the status bar of the notebook window. You can see this when you execute the following cell (press Shift+Enter):
%% Cell type:code id: tags:
```
s = 0
for i in range(50000000):
s += 1
print(s)
```
%%%% Output: stream
50000000
%% Cell type:markdown id: tags:
Finally, it is important to note that the `python` kernel is simply a single interpreter process that sequentially runs the code in all cells that you execute. This implies that the order of your execution determines the current state of the kernel, i.e. which variables exist and what the values of those variables are. In particular, state is maintained across multiple cell, as you can see in the following example:
%% Cell type:code id: tags:
```
x = 42
```
%% Cell type:code id: tags:
```
print(x)
```
%%%% Output: stream
42
%% Cell type:markdown id: tags:
If you happened to execute those two cells in reverse order, you would generate an error. This seems trivial at first, but for complex notebooks where you execute cells back and forth it can become quite difficult to understand what is the current state. You can always wipe the current state by killing the current kernel, starting a new interpreter process. You can do this either by selecting "Restart kernel" in the Kernel menu above, or you enter command mode (press ESC) and hit `0` twice. Try this and then try to execute the following cell, which should return an error as a variable with the name `x` has not been defined in the new kernel.
If you happened to execute those two cells in reverse order, you would generate an error. This seems trivial at first, but for complex notebooks where you execute cells back and forth it can become quite difficult to understand what is the current state. You can always wipe the current state by killing the current kernel, starting a new interpreter process. You can do this by selecting "Restart kernel" in the Kernel menu above. Try it and then execute the following cell, which returns an error as a variable with the name `x` has not been defined in the new kernel.
%% Cell type:code id: tags:
```
print(x)
```
%%%% Output: error
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-fc17d851ef81> in <module>
----> 1 print(x)
NameError: name 'x' is not defined
%% Cell type:code id: tags:
```
```
......
......@@ -2,19 +2,24 @@
"cells": [
{
"cell_type": "markdown",
"source": [
"# 02: Working with `git` and Visual Studio Code\r\n",
"\r\n",
"Now that we have set up our data science environment, we introduce the distributed versioning system `git` as well as the Open Source development environment Visual Studio Code, which you can optionally use to interact with a `jupyter` kernel and collaborate via `git`."
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
"source": [
"# 02: Working with `git` and Visual Studio Code\n",
"\n",
"Now that we have set up our data science environment, we introduce the distributed versioning system `git` as well as the Open Source development environment Visual Studio Code, which you can optionally use to interact with a `jupyter` kernel and collaborate via `git`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Using the distributed versioning system `git`\n",
"\n",
......@@ -104,15 +109,11 @@
"```\n",
"\n",
"and you will find that the file is at the latest state again. These simple commands can help you a lot when you develop code or write a report or thesis. It also allows you to archive the whole history of a project simply by archiving the folder that includes the git repository data. Ay repository meta data containing previous versions of files etc. is automatically stored in a hidden directory `.git`."
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Working with remote `git` repositories\n",
......@@ -130,7 +131,7 @@
"1.) We could clone an already existing remote repository to obtain a local copy (that includes the whole history) which is automatically linked to the remote repository.\n",
"2.) We can use a local working repository and push it to a a newly created so-called bare repository at a remote location, that can then be cloned by our collaborators.\n",
"\n",
"Let's start with the second approach and then show how the first approach works later. We first need a bare `git` repository hosted at a remote location that can be accessed by our collaborators. The Web provider [github](https://www.github.com) provides free access to such repositories. Similarly, we can use a privately hosted instance of [gitLab](https://www.gitlab.com). Both are very similar in terms of handling. Once you have created an account, you can simply create a new repository. Do not check the box to initialise the repository, as we want a bare repository to which we can push our already existing local repository.\n",
"Let's start with the second approach and then show how the first approach works later. We first need a bare `git` repository hosted at a remote location that can be accessed by our collaborators. The Web provider [github](https://www.github.com) provides free access to such repositories. Similarly, we can use the university-hosted instance of [gitLab](https://gitlab.informatik.uni-wuerzburg.de). Both are similar in terms of handling. Once you have created an account, you can create a new repository. Do not check the box to initialise the repository, as we want a bare repository to which we can push our already existing local repository.\n",
"\n",
"Once this repository has been created, you will have to check the URL of your repository (e.g. using the HTTPS protocol). We then add this URL as a new remote location that we will call `origin` to our local repository.\n",
"\n",
......@@ -170,11 +171,15 @@
"```\n",
"\n",
"This is probably enough to digest for now and for a more in-depth introduction to `git` I refer you to the [official documentation](https://git-scm.com/doc). Anyway for the group exercises you will hardly need more than what we introduced above. Moreover, if you are using Visual Studio Code (see below) syncing a repository (i.e. pulling remote changes and pushing local changes) is even easier: Just click the **sync** symbol in the status bar of Visual Studio Code to update your local copy of the repository and pushing any pending commits."
],
"metadata": {}
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Introducing Visual Studio Code\n",
"\n",
......@@ -187,54 +192,54 @@
"``` \n",
"\n",
"in the terminal of your operating system."
],
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"To conveniently work with `python` and `jupyter` notebooks, we need the official [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) extension, which adds `python` code editing, debugging, and linting functionality (linting referring to static code analysis that can highlight syntax and identify issues in python code as you type). You can install it from the Extension Manager that is integrated in Visual Studio Code. Just click the \"module\" icon in the bottom of the left menu bar or press `Ctrl+Shift+X`. This will bring up the Extensions window. Type `python` and click the top-most search result [Python 2021.3.680753044](https://marketplace.visualstudio.com/items?itemName=ms-python.python). In the window on the right, click install. A restart of Visual Studio Code completes the installation.\n",
"\n",
"Once the installation is complete, open Visual Studio Code, click `File -> Open Folder` and navigate to a folder that contains your `python` code files. Create a new file and save it as `test.py`. As you type code, you shall see the syntax being highlighted. You can also run the code by running the debugger (press F5).\n",
"\n",
"Apart from editing and running python files with the standard python interpreter, we can also use `jupyter` to run code directly within VS Code in a regular python file. To see how this works, copy the following code (including the comments `#%%`) into your `python` file (the last line is wrong on purpose):"
],
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
}
]
},
{
"cell_type": "code",
"execution_count": 2,
"source": [
"x = 'Hello World'\n",
"\n",
"#%%\n",
"print(x)\n",
"\n",
"#%% \n",
"x:= 42"
],
"metadata": {},
"outputs": [
{
"output_type": "error",
"ename": "SyntaxError",
"evalue": "invalid syntax (<ipython-input-2-e2e288d370b7>, line 8)",
"output_type": "error",
"traceback": [
"\u001b[1;36m File \u001b[1;32m\"<ipython-input-2-e2e288d370b7>\"\u001b[1;36m, line \u001b[1;32m8\u001b[0m\n\u001b[1;33m x:= 42\u001b[0m\n\u001b[1;37m ^\u001b[0m\n\u001b[1;31mSyntaxError\u001b[0m\u001b[1;31m:\u001b[0m invalid syntax\n"
]
}
],
"metadata": {}
"source": [
"x = 'Hello World'\n",
"\n",
"#%%\n",
"print(x)\n",
"\n",
"#%% \n",
"x:= 42"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"In this code, we have added some lines `#%%`, i.e. the comment character `#` followed by `%%`. For Visual Studio Code this marks the beginning of a computable `jupyter` cell and a so-called \"Code Lens\" `Run Cell` will appear above those lines. We can now execute the code in a `jupyter cell` (delimited by two adjaceny `#%%` tags) by clicking the `Run Cell` command. \n",
"\n",
......@@ -243,19 +248,14 @@
"As you can see in the third cell, Visual Studio Code will automatically highlight syntax errors (and, to a certain extent, type errors) as you write your code. Moreover, Visual Studio Code automatically shows documentation extracted from the `docstring` of python classes and methods. You can try this by typing `print(`, which will bring up the documentation of parameters of the `print` function. You can bring up this context-sensitive help manually by hitting CTRL-SPACE. \n",
"\n",
"Finally, VS Code comes with integrated support for collaboration via git. As soon as you open a file or directory that is located within a `git` repository, the bottom left corner of the status bar shows a branch symbol \"master\". Next to the symbol a small sync icon is displayed. Whenever you click this icon, you will automatically pull changes from the remote repository, and push any pending local commits. Via the `git` item in the left navigation bar (the third symbol from the top), you can open list of local changes and untracked files. You can now add those changes that you wish to commit, eneter a commit message and press CTRL-Enter to create a local commit. Moreover, you can manually push or pull changes."
],
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"source": [],
"metadata": {},
"outputs": [],
"metadata": {}
"source": []
}
],
"metadata": {
......@@ -280,4 +280,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
\ No newline at end of file
}
%% Cell type:markdown id: tags:
# 02: Working with `git` and Visual Studio Code
Now that we have set up our data science environment, we introduce the distributed versioning system `git` as well as the Open Source development environment Visual Studio Code, which you can optionally use to interact with a `jupyter` kernel and collaborate via `git`.
%% Cell type:markdown id: tags:
## Using the distributed versioning system `git`
While you can download all code and data necessary to complete the exercises from [Moodle](https://moodle.uni-wuppertal.de/course/view.php?id=25405), we recommend to use the distributed version control system `git` to (i) clone the code of the practice lectures from gitHub, and (ii) collaborate on the group exercises. I generally cannot recommend `git` enough for any files or directories for which versioning is useful, even if you are working alone on your local machine. It can save you if you accidentially deleted a file or a text section in a report, allowing you two switch back and forth between different versions of your work.
In the following, we will give a brief introduction to `git`. `git` is very powerful and while it is simple to use for basic tasks, performing more advanced tasks in a collaborative environment with multiple parties requires some exercise. `git` is widely used for software development, data science projects and other types of collaborative editing tasks. In the past years, it has largely superseded other versioning systems like CVS or SVN.
We first make sure that you have installed `git`. Some Linux systems come with a pre-installed `git` distribution, or offer to install `git` via the official package repository (e.g. `apt` on Ubuntu). For Mac and Windows, you can download the `git` command line tools [here](https://git-scm.com/download/). Assuming you have a working installation of `git`, you can clone a remote repository by executing the following command in the terminal. Note that the `git` distribution above comes with a special `git bash` terminal, that makes it particularly easy to work with `git`.
The basic unit in `git` (as well as in CVS and SVN) is a repository, which you can think of as a versioned directory that contain files as well as subdirectories. Just like in CVS or SVN, we first need a local copy of a repository. We can locally edit the files in our local repository. When we have reached a state that we wish to save in a new revision, or that we wish to share with our collaborators, we peform a `commit` operation, which permanently records the changes that we have made, while keeping the version of all files in the previous commit.
If you are familiar with the versioning systems CVS and SVN, you will know that those are based on a server that maintains a central server of the repository. Thus, a commit in CVS and SVN will actually transfer the local changes to the remote repository. A major difference in `git` is that repositories are completely decentralised, i.e. no server is needed to perform a commit. Instead, a `git` repository can locally record the full history of commits independent of any central server. Nevertheless we can easily collaborate with others, simply by creating a local clone of their repository. The local commits of other developers and our own commits can then be exchanged by explicitly pulling remote changes (i.e. multiple commits) to our local repository, or pushing local commits to the remote repository.
If we are dealing with multiple collaborators, we can actually host a remote repository on a server, which then acts much like the central server in CVS/SVN. However, it is important to note that all clones of this repository nevertheless store the full history, i.e. the central repository can later be deleted without losing any information. A popular way to centrally host such `git` repositories is `github` or an institutional `gitlab` server, like the one installed at [git.uni-wuppertal.de](https://git.uni-wuppertal.de).
Let's now get our hands dirty and play with git. We first create a local repository in our filesystem. For this, we simply open a `git` terminal, enter the directory that shall contain the files we wish to version and execute:
```
> git init
```
Whenever we are in a subdirectory of a `git` repository, we can execute
```
> git status
```
to check the current status of the repository. This will indicate whether there are any (i) newly created untracked files, or (ii) changes to tracked files that have not been committed. Let us try this by creating a new text file `file1.txt` and running `git status` again. We now see that `git` has discovered an untracked file, i.e. a file that exists locally but that has not been added to the repository yet. By default files are untracked unless we specifically add them, which is useful because we often have local temporary files that we do not really want to include in the versioning. We can use the command `git add` to explicitly add a file to the repository, i.e.
```
> git add file1.txt
```
With this command, you basically tell `git` that you would like to track a file, but we have not yet committed the file. We can do this by executing:
```
> git commit -am 'added a file'
```
If `git` hasn't been previously used you will be requested to provide your username and email.
The option `-a` automatically commits all changes to tracked files but it will not touch any new files that you haven't explicitly told `git` to track via the `git add` command. The `-m` option allows you specify a commit comment on the command line. If you now check the status again using
```
> git status
```
you will see that there are now changes that have not been committed. Moreover, executing
```
> git log
```
shows the history of the repository, which in our case consists of a single commit. Let us now make a change in the file `file1.txt`, commit the change and then check the log again.
```
> git commit -am 'changed file'
> git log
```
It is not surprising that the log now contains two entries, one representing the state of the file before our change and one representing the state of the file after the change. Using the `git show` and `git checkout` command, we can show the changes made in a specific commit and we can easily move back and forth between different versions. To uniquely identify a specific commit, `git` uses hash values that are represented as a long sequence of letters and numbers. You can see those hash values in an entry like this in the log:
```
commit d50bedf3304bafc865ff3cc4a73f91a423aeb02c (HEAD -> master)
Author: Ingo Scholtes <scholtes@uni-wuppertal.de>
Date: Wed Oct 9 09:15:41 2019 +0200
changed file
```
Don't worry, you won't have to type the whole hash value to display the context of a commit. It is enough to give the first few characters that uniquely identify a commit within your repository. Let's see what was changed in that file in our last commit:
```
> git show d50bed
```
This will print lines that have been deleted (red) and added (green) compared to the previous version. If we want to move to the previous version of the file, we can simply checkout this version using the `git checkout` command, specifying the hash of the previous commit that we want to checkout:
```
> git checkout 5126
```
You will find that the file is now at the state before our last commit. To go back to the latest version in the main (master) branch of our repository, we can simply execute:
```
> git checkout master
```
and you will find that the file is at the latest state again. These simple commands can help you a lot when you develop code or write a report or thesis. It also allows you to archive the whole history of a project simply by archiving the folder that includes the git repository data. Ay repository meta data containing previous versions of files etc. is automatically stored in a hidden directory `.git`.
%% Cell type:markdown id: tags:
## Working with remote `git` repositories
While working with `git` repositories locally is helpful, the real power of `git` only becomes apparant when collaborating with others via **remote repositories**. Again, `git` uses a fully distributed approach so technically there is no difference between your local repository and a remote repository. Indeed, from the perspective of the other party, your local repository is actually the remote repository. In any case, we can push and pull commits between the repositories to synchronize them, while automatically or manually handling merges or possible conflicting changes.
The first question we have to answer is how to link a repository to a remote repository. To see all existing links of a repository to remote repositories we can open a `git` terminal and run the command
```
git remote -v
```
within the local repository created above. It is not surprising that this shows an empty list, because we have not linked it to a remote repository yet. There is two basic ways how such a link can be established:
1.) We could clone an already existing remote repository to obtain a local copy (that includes the whole history) which is automatically linked to the remote repository.
2.) We can use a local working repository and push it to a a newly created so-called bare repository at a remote location, that can then be cloned by our collaborators.
Let's start with the second approach and then show how the first approach works later. We first need a bare `git` repository hosted at a remote location that can be accessed by our collaborators. The Web provider [github](https://www.github.com) provides free access to such repositories. Similarly, we can use a privately hosted instance of [gitLab](https://www.gitlab.com). Both are very similar in terms of handling. Once you have created an account, you can simply create a new repository. Do not check the box to initialise the repository, as we want a bare repository to which we can push our already existing local repository.
Let's start with the second approach and then show how the first approach works later. We first need a bare `git` repository hosted at a remote location that can be accessed by our collaborators. The Web provider [github](https://www.github.com) provides free access to such repositories. Similarly, we can use the university-hosted instance of [gitLab](https://gitlab.informatik.uni-wuerzburg.de). Both are similar in terms of handling. Once you have created an account, you can create a new repository. Do not check the box to initialise the repository, as we want a bare repository to which we can push our already existing local repository.
Once this repository has been created, you will have to check the URL of your repository (e.g. using the HTTPS protocol). We then add this URL as a new remote location that we will call `origin` to our local repository.
```
> git remote add origin https://YOUR_URL.git
```
We then push all the commits in the `master` branch of our local repository to the remote location `origin`. The parameter `-u` sets this remote location as the default `upstream` location of this branch, i.e. in the future you can simply run `git push` to send your changes in this branch to the remote repository.
```
> git push -u origin master
```
That's all. Now the remote repository hosted on gitHub or gitLab contains the same history than your local `master` branch. You can now point a collaborator to this URL to start collaborating on the project. They will need to make a local copy of repository that is linked to the remote location. This is easy, all they have to do is open a terminal and run:
```
> git clone HTTPS://YOUR_URL.git LOCAL_DIRECTORY
```
which will create a local working copy of the remote repository. Running the `git remote` command again, i.e.
```
git remote -v
```
shows that this local working copy is automatically linked to the remote repository (named `origin` by default) from which it was cloned. Your collaborators can now work on files, commit changes, etc. The latest version from the remote repository can be pulled by executing
```
> git pull
```
while any locally comitted changes can be pushed to the remote repository via
```
> git push
```
This is probably enough to digest for now and for a more in-depth introduction to `git` I refer you to the [official documentation](https://git-scm.com/doc). Anyway for the group exercises you will hardly need more than what we introduced above. Moreover, if you are using Visual Studio Code (see below) syncing a repository (i.e. pulling remote changes and pushing local changes) is even easier: Just click the **sync** symbol in the status bar of Visual Studio Code to update your local copy of the repository and pushing any pending commits.
%% Cell type:markdown id: tags: