Commit 7414ead4 authored by Ingo Scholtes's avatar Ingo Scholtes
Browse files

-

parent 600807f8
This diff is collapsed.
This diff is collapsed.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# P01 - 01: The `python` ecosystem for network science\n",
"\n",
"**April 15, 2021** \n",
"*Ingo Scholtes* \n",
"\n",
"In this notebook, we explain how you can set up the data science environment that we will use in the practice lectures. The environment consists of a `python3` interpreter, the network analysis package `pathpy`, some additional packages for data analysis, and visualisation, the versioning system `git`, a `jupyter` notebook server, as well as - optionally - the development environment Visual Studio Code."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Setting up `python` and `jupyter`\n",
"\n",
"To run the practice lecture notebooks and work on the exercise sheets, you need a `python 3.7` environment running on an operating system of your choice. For Windows, MacOS, and Linux users we recommend to install the latest [Anaconda distribution](https://www.anaconda.com/download/), an OpenSource `python` distribution pre-configured for data science and machine learning tasks. \n",
"\n",
"Just download and run the installer and you should have almost everything you need for this course. Beware of alternative methods to install a barebone python distribution, as a careless installation may conflict with the python version already present on your system. We have had students that managed to wreck their Mac OS X or Linux operating system by accidentally removing the standard python runtime!\n",
"\n",
"If you prefer starting from a barebone `python 3.x` installation, you will also need to manually install additional packages via the python package manager `pip`. To see a list of python packages that are already installed, you can open a terminal and run \n",
"\n",
"```\n",
"> pip list\n",
"```\n",
"\n",
"If you installed Anaconda on a Windows system you should use the `Anaconda prompt` terminal that has been installed by Anaconda. This will make sure that all environment variables are correctly set. Moreover, to install packages, it is best to open this command prompt as an administrator (or use `su` on a Unix-based system). To complete the practice lectures and group exercises, we will need the following packages: \n",
"\n",
"`jupyter` - provides an environment for interactive data science projects in your browser. We will extensively use so-called `jupyter notebooks`, which are interactive computable documents that you can also use to compile reports. \n",
"`pathpy` - provides implementations of common scientific and statistical computing techniques for python. \n",
"`scipy` - provides implementations of common scientific and statistical computing techniques for python. \n",
"`numpy` - provides support for multi-dimensional arrays an matrices as well as high-level mathematical functions. This project originated as a smaller core part of `scipy`. \n",
"`matplotlib` - provides advanced plotting functions based on the data types introduced in `numpy`. Visualisations can be directly integrated into `jupyter` notebooks. \n",
"`pandas` - popular package for the management, analysis, and manipulation of multi-dimensional **pan**el **da**ta (thus the name). Provides convenient interfaces for the import and export of data from files or databases. \n",
"\n",
"To install the packages above, except for `pathpy` just run the following command in the terminal for each of the packages above:\n",
"\n",
"```\n",
"> pip install PACKAGENAME\n",
"```\n",
"\n",
"If you see no error messages, you should be all set to continue with the next steps. For the same reason presented above pip may be associated to the python version of the system. In order to have a full control on the version in which the packages should be installed you can use\n",
"```\n",
"> python3.x -m pip install PACKAGENAME\n",
"``` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up `pathpy`\n",
"\n",
"In this course we will use `pathpy`, a network analysis and visualisation package that is currently being developed at my chair.\n",
"\n",
"Compared to many other packages, `pathpy` has a couple of advantages. First, it is easy to install as it should have no dependencies not already included in a default `anaconda` installation. Second, `pathpy` has a user-friendly API making it easy to handle directed and undirected networks, networks with single and multiple edges, multi-layer networks or temporal networks. It also provides interactive HTML-based visualisations that can be directly displayed inside `jupyter` notebooks, which makes it particularly suitable for educational settings. Third, it supports the analysis and visualisation of time series data on networked systems, such as time-stamped edges or data on paths in networks.\n",
"\n",
"Since `pathpy` is not included in the default Anaconda installation, we first need to install it. In previous iterations of the course, we used the stable version `pathy 2.0`. Right now, we are in the process of finishing a heavily revised version 3.0, which comes with many advantages. It has a cleaner API, is more efficient, and provides advanced plotting functions. To benefit from those advantages, we use the development version of `pathpy3` from gitHub. The best way to install it is to (1) clone the git repository to a local directory, and (2) install an editable version of the `pathpy` module from this cloned repository. This approach will allow you to execute `git pull` from the commandline to always update to the latest development version. \n",
"\n",
"To install pathpy 3 open a command line as administrator and run: \n",
"\n",
"```\n",
"> git clone https://github.com/pathpy/pathpy pathpy3\n",
"> cd pathpy3\n",
"> pip install -e .\n",
"``` \n",
"\n",
"This will create a new directory `pathpy3` on your machine. Changing to this directory and running `pip install -e .` will *install `pathpy` as an editable python module*. Rather than copying the code to a separate directory in your python module path, this creates a symbolic link to the current directory. This has the advantage that you can update your local installation of `pathpy` simply by entering the directory and running `git pull`. After this the new version will be immediately available in all `python` processes launched after the update. This allows us to update your local `pathpy` installations by means of a simple `git push`, without the need to uninstall and install again. Note that after updating `pathpy` you must always restart the python process (or jupyter kernel) before changes become effective!\n",
"\n",
"`pathpy` requires `python 3.x`. If you have both ` python 2.x` and `python 3.x` installed, you can explicitly install a package for `python 3` by using the command `pip3` instead of `pip`. Note that `python 2` is deprecated since April 2020, so you should always use `python 3` anyway.\n",
"\n",
"If by any chance you had previously installed an offifical release version of pathpy via pip (e.g. for our data science course), you will need to manually uninstall it. You can check this via the command `pip list`. Uninstalling prior versions is necessary because all `pathpy` versions use the same namespace `pathpy` and for reasons of backwards compatibility we still provide older major versions in case someone needs them. If for some reason you want to quickly switch between different major versions, we recommend to use [virtual environments in Anaconda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Using `jupyter` notebooks\n",
"\n",
"Now that all the packages have been installed, you should be able to start a `jupyter` server, a process that can execute chunks of `python code` in a so-called `kernel` and return the result. This `jupyter` server can run local `python kernels`, which are actually local servers that accept requests with `python` code from clients, execute the code and then return the result. A popular but not the only way to interact with such `kernels` is through the browser-based clients `jupyter notebook` and `jupyter lab`. \n",
"\n",
"Those two clients are quite similar in the sense that both allow you to open, edit, and run `jupyter` notebooks in your browser. Each notebook consists of multiple input and output cells, where the input contains the code being executed while the adjacent output cell displays the result of the computation. Importantly, these cells can contain code in multiple languages such as `julia`, `python`, or `R` (note the name: `jupyter`), as well as Markdown-formatted text, chunks of HTML or even LaTeX equations. This makes `jupyter` notebooks a great tool to compile interactive computable documents, that can directly be exported to HTML or LaTex/PDF reports.\n",
"\n",
"\n",
"While you can use both the `jupyter notebook` or the `jupyter lab` server for this course, the latter as recently been released as the next-generation `jupyter` interface. We will thus use this for our course. First install `jupyter lab` via pip. To start a `jupyter lab` server, just navigate to the directory in your filesystem in which you wish to create or open a `jupyter` notebook and execute the following command in your terminal:\n",
"\n",
"```\n",
"> jupyter lab\n",
"```\n",
"\n",
"As a first step, you can try this in the directory that contains this notebook (i.e. the corresponding `.ipynb` notebook file). A browser will start and you should see the notebook file in the file browser panel on the left of your screen. Click it to open an interactive document. Similarly, you can create a new notebook. Doubleclicking an input cell in this notebook will allow you to edit the underlying text or code. Pressing Shift+Enter will execute the code and either display the formatted textor the output of the underlying code:\n",
"\n",
"Let us try this in the following input cell, that contains `python` code generating a simple textual output:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"42\n"
]
}
],
"source": [
"x = 2 * 21\n",
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can create a new input cell below the current cursor either by clicking the `+` button in the top left menu or by pressing `b` on the keyboard. If you are editing a cell, you can press `Esc` to enter the command mode, in which you can add, manipulate or delete a cell. To delete a cell press `D` twice in command mode. To change the cell type from `python` to `markdown` press `m` in command mode. Press `y` to change it back to `python` code. Let us try this with the following markdown cell, that contains a LaTeX formula as well as a chunk of HTML code:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: We can also use LaTeX formulas: $\\int_0^\\pi \\sin(t) dt$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All `python` code will be executed by the underlying `python` kernel, whose current status is displayed in the top right circle of the notebook window. An unfilled circle (i.e. white center) indicates that the `kernel` is currently idle. If the circle is filled with black, the `kernel` is busy computing. You can see this when you execute the following cell (press Shift+Enter):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = 0\n",
"for i in range(50000000):\n",
" s += 1\n",
"print(s)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, it is important to note that the `python` kernel is simply a single interpreter process that sequentially runs the code in all cells that you execute. This implies that the order of your execution determines the current state of the kernel, i.e. which variables exist and what the values of those variables are. In particular, state is maintained across multiple cell, as you can see in the following example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = 42"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you happened to execute those two cells in reverse order, you would generate an error. This seems trivial at first, but for complex notebooks where you execute cells back and forth it can become quite difficult to understand what is the current state. You can always wipe the current state by killing the current kernel, starting a new interpreter process. You can do this either by selecting \"Restart kernel\" in the Kernel menu above, or you enter command mode (press ESC) and hit `0` twice. Try this and then try to execute the following cell, which should return an error as a variable with the name `x` has not been defined in the new kernel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(x)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"name": "python374jvsc74a57bd0179f2c9954461ddf657daf1ee3f9df7374d575c8403facec5648a064395b52ac",
"display_name": "Python 3.7.4 64-bit ('anaconda3': conda)"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4-final"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
\ No newline at end of file
%% Cell type:markdown id: tags:
# P01 - 01: The `python` ecosystem for network science
**April 15, 2021**
*Ingo Scholtes*
In this notebook, we explain how you can set up the data science environment that we will use in the practice lectures. The environment consists of a `python3` interpreter, the network analysis package `pathpy`, some additional packages for data analysis, and visualisation, the versioning system `git`, a `jupyter` notebook server, as well as - optionally - the development environment Visual Studio Code.
%% Cell type:markdown id: tags:
## Setting up `python` and `jupyter`
To run the practice lecture notebooks and work on the exercise sheets, you need a `python 3.7` environment running on an operating system of your choice. For Windows, MacOS, and Linux users we recommend to install the latest [Anaconda distribution](https://www.anaconda.com/download/), an OpenSource `python` distribution pre-configured for data science and machine learning tasks.
Just download and run the installer and you should have almost everything you need for this course. Beware of alternative methods to install a barebone python distribution, as a careless installation may conflict with the python version already present on your system. We have had students that managed to wreck their Mac OS X or Linux operating system by accidentally removing the standard python runtime!
If you prefer starting from a barebone `python 3.x` installation, you will also need to manually install additional packages via the python package manager `pip`. To see a list of python packages that are already installed, you can open a terminal and run
```
> pip list
```
If you installed Anaconda on a Windows system you should use the `Anaconda prompt` terminal that has been installed by Anaconda. This will make sure that all environment variables are correctly set. Moreover, to install packages, it is best to open this command prompt as an administrator (or use `su` on a Unix-based system). To complete the practice lectures and group exercises, we will need the following packages:
`jupyter` - provides an environment for interactive data science projects in your browser. We will extensively use so-called `jupyter notebooks`, which are interactive computable documents that you can also use to compile reports.
`pathpy` - provides implementations of common scientific and statistical computing techniques for python.
`scipy` - provides implementations of common scientific and statistical computing techniques for python.
`numpy` - provides support for multi-dimensional arrays an matrices as well as high-level mathematical functions. This project originated as a smaller core part of `scipy`.
`matplotlib` - provides advanced plotting functions based on the data types introduced in `numpy`. Visualisations can be directly integrated into `jupyter` notebooks.
`pandas` - popular package for the management, analysis, and manipulation of multi-dimensional **pan**el **da**ta (thus the name). Provides convenient interfaces for the import and export of data from files or databases.
To install the packages above, except for `pathpy` just run the following command in the terminal for each of the packages above:
```
> pip install PACKAGENAME
```
If you see no error messages, you should be all set to continue with the next steps. For the same reason presented above pip may be associated to the python version of the system. In order to have a full control on the version in which the packages should be installed you can use
```
> python3.x -m pip install PACKAGENAME
```
%% Cell type:markdown id: tags:
## Setting up `pathpy`
In this course we will use `pathpy`, a network analysis and visualisation package that is currently being developed at my chair.
Compared to many other packages, `pathpy` has a couple of advantages. First, it is easy to install as it should have no dependencies not already included in a default `anaconda` installation. Second, `pathpy` has a user-friendly API making it easy to handle directed and undirected networks, networks with single and multiple edges, multi-layer networks or temporal networks. It also provides interactive HTML-based visualisations that can be directly displayed inside `jupyter` notebooks, which makes it particularly suitable for educational settings. Third, it supports the analysis and visualisation of time series data on networked systems, such as time-stamped edges or data on paths in networks.
Since `pathpy` is not included in the default Anaconda installation, we first need to install it. In previous iterations of the course, we used the stable version `pathy 2.0`. Right now, we are in the process of finishing a heavily revised version 3.0, which comes with many advantages. It has a cleaner API, is more efficient, and provides advanced plotting functions. To benefit from those advantages, we use the development version of `pathpy3` from gitHub. The best way to install it is to (1) clone the git repository to a local directory, and (2) install an editable version of the `pathpy` module from this cloned repository. This approach will allow you to execute `git pull` from the commandline to always update to the latest development version.
To install pathpy 3 open a command line as administrator and run:
```
> git clone https://github.com/pathpy/pathpy pathpy3
> cd pathpy3
> pip install -e .
```
This will create a new directory `pathpy3` on your machine. Changing to this directory and running `pip install -e .` will *install `pathpy` as an editable python module*. Rather than copying the code to a separate directory in your python module path, this creates a symbolic link to the current directory. This has the advantage that you can update your local installation of `pathpy` simply by entering the directory and running `git pull`. After this the new version will be immediately available in all `python` processes launched after the update. This allows us to update your local `pathpy` installations by means of a simple `git push`, without the need to uninstall and install again. Note that after updating `pathpy` you must always restart the python process (or jupyter kernel) before changes become effective!
`pathpy` requires `python 3.x`. If you have both ` python 2.x` and `python 3.x` installed, you can explicitly install a package for `python 3` by using the command `pip3` instead of `pip`. Note that `python 2` is deprecated since April 2020, so you should always use `python 3` anyway.
If by any chance you had previously installed an offifical release version of pathpy via pip (e.g. for our data science course), you will need to manually uninstall it. You can check this via the command `pip list`. Uninstalling prior versions is necessary because all `pathpy` versions use the same namespace `pathpy` and for reasons of backwards compatibility we still provide older major versions in case someone needs them. If for some reason you want to quickly switch between different major versions, we recommend to use [virtual environments in Anaconda](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html).
%% Cell type:markdown id: tags:
## Using `jupyter` notebooks
Now that all the packages have been installed, you should be able to start a `jupyter` server, a process that can execute chunks of `python code` in a so-called `kernel` and return the result. This `jupyter` server can run local `python kernels`, which are actually local servers that accept requests with `python` code from clients, execute the code and then return the result. A popular but not the only way to interact with such `kernels` is through the browser-based clients `jupyter notebook` and `jupyter lab`.
Those two clients are quite similar in the sense that both allow you to open, edit, and run `jupyter` notebooks in your browser. Each notebook consists of multiple input and output cells, where the input contains the code being executed while the adjacent output cell displays the result of the computation. Importantly, these cells can contain code in multiple languages such as `julia`, `python`, or `R` (note the name: `jupyter`), as well as Markdown-formatted text, chunks of HTML or even LaTeX equations. This makes `jupyter` notebooks a great tool to compile interactive computable documents, that can directly be exported to HTML or LaTex/PDF reports.
While you can use both the `jupyter notebook` or the `jupyter lab` server for this course, the latter as recently been released as the next-generation `jupyter` interface. We will thus use this for our course. First install `jupyter lab` via pip. To start a `jupyter lab` server, just navigate to the directory in your filesystem in which you wish to create or open a `jupyter` notebook and execute the following command in your terminal:
```
> jupyter lab
```
As a first step, you can try this in the directory that contains this notebook (i.e. the corresponding `.ipynb` notebook file). A browser will start and you should see the notebook file in the file browser panel on the left of your screen. Click it to open an interactive document. Similarly, you can create a new notebook. Doubleclicking an input cell in this notebook will allow you to edit the underlying text or code. Pressing Shift+Enter will execute the code and either display the formatted textor the output of the underlying code:
Let us try this in the following input cell, that contains `python` code generating a simple textual output:
%% Cell type:code id: tags:
```
x = 2 * 21
print(x)
```
%%%% Output: stream
42
%% Cell type:markdown id: tags:
You can create a new input cell below the current cursor either by clicking the `+` button in the top left menu or by pressing `b` on the keyboard. If you are editing a cell, you can press `Esc` to enter the command mode, in which you can add, manipulate or delete a cell. To delete a cell press `D` twice in command mode. To change the cell type from `python` to `markdown` press `m` in command mode. Press `y` to change it back to `python` code. Let us try this with the following markdown cell, that contains a LaTeX formula as well as a chunk of HTML code:
%% Cell type:markdown id: tags:
**Note**: We can also use LaTeX formulas: $\int_0^\pi \sin(t) dt$
%% Cell type:markdown id: tags:
All `python` code will be executed by the underlying `python` kernel, whose current status is displayed in the top right circle of the notebook window. An unfilled circle (i.e. white center) indicates that the `kernel` is currently idle. If the circle is filled with black, the `kernel` is busy computing. You can see this when you execute the following cell (press Shift+Enter):
%% Cell type:code id: tags:
```
s = 0
for i in range(50000000):
s += 1
print(s)
```
%% Cell type:markdown id: tags:
Finally, it is important to note that the `python` kernel is simply a single interpreter process that sequentially runs the code in all cells that you execute. This implies that the order of your execution determines the current state of the kernel, i.e. which variables exist and what the values of those variables are. In particular, state is maintained across multiple cell, as you can see in the following example:
%% Cell type:code id: tags:
```
x = 42
```
%% Cell type:code id: tags:
```
print(x)
```
%% Cell type:markdown id: tags:
If you happened to execute those two cells in reverse order, you would generate an error. This seems trivial at first, but for complex notebooks where you execute cells back and forth it can become quite difficult to understand what is the current state. You can always wipe the current state by killing the current kernel, starting a new interpreter process. You can do this either by selecting "Restart kernel" in the Kernel menu above, or you enter command mode (press ESC) and hit `0` twice. Try this and then try to execute the following cell, which should return an error as a variable with the name `x` has not been defined in the new kernel.
%% Cell type:code id: tags:
```
print(x)
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
{"cells":[{"cell_type":"markdown","metadata":{"slideshow":{"slide_type":"slide"}},"source":["# P01 - 05: Reading and writing data\n","\n","**April 15, 2021** \n","*Ingo Scholtes* \n","\n","In the fifth unit, we show how we can read and write data from files or databases."]},{"cell_type":"markdown","metadata":{},"source":["## Reading and writing data from and to files\n","\n","Without data there is no network science, so we better learn how to read and write data in different formats in `python`. The basic, low-level interface to read and write data from/to the filesystem is provided by the function `open`. It returns a handle to a file that can be used to read/write text or binary data. If we only want to read data, we can pass the path to an existing file and open the file in read mode by specifying the access mode `r` as the second argument. In general, we would have to manually close a file that we opened by calling `f.close`. If we fail to do so, the file might remain locked or the contents we intended to write may actually not be fully written when the process exits. To save us the hazzle of remembering to manually close the file, we can use `python`'s `with` construct. It allows us to group statements and couple them to a so-called context manager, which will automatically close the file `f` for us as soon as we leave the scope of the compound statements.\n","\n","Let us read a text file in (default) text mode by calling the read function. This will return a single string object that contains the whole contents of the file."]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["<class 'str'>\n193158\nP\n"]}],"source":["with open('data/posterior_analytics.txt', 'r') as f:\n"," text = f.read()\n","print(type(text))\n","print(len(text))\n","print(text[0])"]},{"cell_type":"markdown","metadata":{},"source":["Apart from reading the file contents into a single string, it is often convenient to read individual lines as separate string, returning a list of strings that contain the lines of the file. We can do this by using the function `readlines`:"]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["<class 'list'>\n3192\nPosterior Analytics\n\n"]}],"source":["with open('data/posterior_analytics.txt', 'r') as f:\n"," lines = f.readlines()\n","print(type(lines))\n","print(len(lines))\n","print(lines[0])"]},{"cell_type":"markdown","metadata":{},"source":["Above, we have opened the file in default text mode, which assumes a default character encoding that can be changed via the `encoding` argument. Alternatively, we can open a file in binary mode, in which case the function read will return a stream of bytes. Each entry in the iterable `bytes` object is a single byte, represented by an integer value in the range from 0 to 255. Looking up the value `80` in an ASCII encoding table confirms that the first character in the file is a `P`."]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["<class 'bytes'>\n196349\n80\n"]}],"source":["with open('data/posterior_analytics.txt', 'rb') as f:\n"," binary = f.read()\n","print(type(binary))\n","print(len(binary))\n","print(binary[0])"]},{"cell_type":"markdown","metadata":{},"source":["Let us perform the simplest possible data analytics on the text file, i.e. counting the frequency of words. Since this lecture is not about text mining and natural language processing, here we take a maximally simple approach to split the text into words. We simply split the string based on a regular expression that contains newline, tab and whitespace characters. We can specify this using `python`'s regular expression module `re`. We can then use the `Counter` class introduced in the previous unit to efficiently count word frequencies and display the ten most common words:"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["[('the', 2280),\n"," ('of', 1411),\n"," ('is', 1249),\n"," ('a', 886),\n"," ('and', 841),\n"," ('to', 707),\n"," ('in', 659),\n"," ('that', 594),\n"," ('', 576),\n"," ('be', 559)]"]},"metadata":{},"execution_count":4}],"source":["from collections import Counter\n","import re\n","\n","freq = Counter([ x.strip() for x in re.split(\" |\\n|\\t|-\", text.lower())])\n","freq.most_common(10)"]},{"cell_type":"markdown","metadata":{},"source":["Let us assume we wish to store this statistics for later use. A low-level way to do this is to format the word-frequency pairs as strings, and write individual lines to a file. For this, we can simply use the `open` function again to obtain a file handle to a new file with write (`w`) permission. In this case, the file does not need to exist before and it will be overwritten whenever we open it again. If we instead wanted to append to an already existing file, we need to specify the access string `a`. Each call to the write function will add the string to the end of the current file stream. Moreover, we use the `python`'s `format` function to format the string and integer values into a comma-separated value."]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[],"source":["with open('data/word_frequencies.dat', 'w') as f:\n"," for word in freq:\n"," f.write('{0},{1}\\n'.format(word, freq[word]))"]},{"cell_type":"markdown","metadata":{},"source":["We have manually created a file of comma-separated values (CSV), which is an important general text-based format to exchange data between applications and systems. In fact, we should not do this manually, as it leaves a lot of room for errors. Even our simple file will not be easy to parse, because some of the words actually contain commas. We missed to escape those special characters. A better idea is to use the support for the import and export of `CSV` data integrated in `python`. For this, we can simply use the `reader` and `writer` classes in the module `csv`. We an set the value delimiters as wells as the (OS-dependent) newline format and the writer class will automatically take care about values that contain the delimiter character upon writing and reading. We can even export data in formats that are easy to interpret by third-party applications like Excel."]},{"cell_type":"code","execution_count":6,"metadata":{},"outputs":[],"source":["import csv\n","\n","with open(\"data/word_frequencies.csv\", 'w', newline='') as f:\n"," writer = csv.writer(f, delimiter=',')\n"," for word in freq:\n"," writer.writerow([word, freq[word]])"]},{"cell_type":"markdown","metadata":{},"source":["In the following, we show how we can use the `reader` class in the `csv` module to read CSV data with a given `delimiter` character. We can directly access data in the different columns of a row by indexing the `row` entries of the iterable `reader`."]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["2280\n"]}],"source":["counts = {}\n","\n","with open(\"data/word_frequencies.csv\", 'r', newline='') as f:\n"," reader = csv.reader(f, delimiter=',')\n"," for row in reader:\n"," counts[row[0]] = row[1]\n","\n","print(counts['the'])"]},{"cell_type":"markdown","metadata":{},"source":["## Serialising `python` objects\n","\n","In the examples above, we have manually taken care of the format in which we want to store our data. Another approach is to directly write `python` objects into files, a process that is called serialization because it requires us to convert a potentially nested object structure to a sequence of bytes. Due to the importance of the World Wide Web and JavaScript, over the past years `JSON` (short for JavaScript Object Notation) has become a universal format to exchange arbitrary objects with nested data fields. You can basically think about this as a simple way to store a nested dictionary structure in an easily interpretable and human-readable text file. In fact, you are using `JSON` right now since `jupyter` notebooks are just `JSON` files. `python` provides an easy way to store arbitrary objects as a `JSON` file. We just need to import the `json` module and call the `dumbs` and `loads` functions to retrieve a `JSON`-formatted string representation of an object. We can then read or write this string using the functions above."]},{"cell_type":"code","execution_count":8,"metadata":{},"outputs":[],"source":["import json\n","\n","j = json.dumps(freq)\n","\n","with open('data/word_frequencies.json', 'w') as f:\n"," f.write(j)"]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["<class 'dict'>\n"]}],"source":["with open('data/word_frequencies.json', 'r') as f:\n"," j = f.read()\n","\n","loaded_freq = json.loads(j)\n","print(type(loaded_freq))"]},{"cell_type":"markdown","metadata":{},"source":["Rather than calling the read/write function ourselves, we can also use the functions without the `s` suffix (which stands for `string`) to directly operate on a file handle:"]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["dict"]},"metadata":{},"execution_count":10}],"source":["with open('data/word_frequencies.json', 'r') as f:\n"," loaded_freq = json.load(f)\n","type(loaded_freq)"]},{"cell_type":"markdown","metadata":{},"source":["In any case, you see that we retrieve a `dict` object that only contains the words and frequencies stored in the original `Counter` object. This is due to the fact that the specific `python` classes and data types cannot (and should not!) be stored in the language-independent `JSON` format. If we want to store a full copy of an arbritrary object or data type while maintaining this information, we can use the `pickle` module. As the name indicates, this `preserves` an object in a file for later use. It also uses a binary file format, which makes it more efficient to store large objects. Using `pickle` is very similar to the `json` module. We can simply `dump` an object into a (binary) file and read it again using the `load` function."]},{"cell_type":"code","execution_count":11,"metadata":{},"outputs":[],"source":["import pickle\n","\n","with open('data/word_frequencies.pickled', 'wb') as f:\n"," pickle.dump(freq, f)"]},{"cell_type":"markdown","metadata":{},"source":["Different from `JSON` files, here we retain all information about the object, so if we load the object we obtain an instance of the `Counter` class and we can directly use its function `most_common` to extract the most common words."]},{"cell_type":"code","execution_count":12,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["<class 'collections.Counter'>\n"]},{"output_type":"execute_result","data":{"text/plain":["[('the', 2280),\n"," ('of', 1411),\n"," ('is', 1249),\n"," ('a', 886),\n"," ('and', 841),\n"," ('to', 707),\n"," ('in', 659),\n"," ('that', 594),\n"," ('', 576),\n"," ('be', 559)]"]},"metadata":{},"execution_count":12}],"source":["with open('data/word_frequencies.pickled', 'rb') as f:\n"," loaded_freq = pickle.load(f)\n"," \n","print(type(loaded_freq))\n","loaded_freq.most_common(10)"]},{"cell_type":"markdown","metadata":{},"source":["As a general rule, using `pickle` is preferrable if you do not care about interoperability and a future-proof archival of data, while using `JSON` is a good idea for data that you intend to use in environments different than `python`, or across different `python` versions."]},{"cell_type":"markdown","metadata":{},"source":["## Basic data management with sqlite\n","\n","We now introduce some basics on advanced data management in `python` using `SQL`-based relational database management systems. Since you had an introductory course on databases in your BSc, I will not discuss general aspects of relational databases, but I will rather show how we can conveniently use them in `python`. Here we focus on the simplest possible setup, using the in-process, file-based database `sqlite`. This means you will neither have to install a database system and start a server, nor do you need to set up users or priviledges. We can simply create databases with multiple tables that are stored in a single stand-alone file, and that can easily be exchanged between systems. For simple data analytics applications that often do not require concurrent write-access, authentication, or transaction support, `sqlite` is a very efficient Open Source solution to data management, that is further available on any machine that can run `python`. As a side note, `sqlite` files are also used as the base data format in a number of commercial applications, such as for the catalogue files of `Adobe Lightroom`, the local database of `Evernote`, or as data store within `Google Chrome` and `Mozilla Thunderbird`.\n","\n","Creating a new or conencting to an existing database is extremely easy. All we have to do is (1) import the standard `python` module `sqlite3`, (2) connect to a local (possibly existing) database file, and (3) obtain a `cursor` object that we can use to create tables, manipulate or query data."]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[],"source":["import sqlite3\n","\n","con = sqlite3.connect('data/example.db')\n","c = con.cursor()"]},{"cell_type":"markdown","metadata":{},"source":["Let us first create a table that can store our word frequency data. We then create a list of tuples consisting of the words and their frequencies and add all of them at once by calling the `executemany` function. As you see below, in the query below we use two placeholder characters `?` that will be automatically filled by the corresponding values in the (ordered) tuple. While we could also manually create and execute string SQL queries that contain the data, I strongly advise against this approach as it makes your code vulnerable against SQL injections. Moreover, you will need to be careful to escape any special characters in your data. The placeholder function of `execute` and `executemany` automatically handle such situations, making sure that your code is both robust and secure!\n","\n","Once we have executed the queries, we need to issue a `commit` command to actually write the changes to the underlying database."]},{"cell_type":"code","execution_count":18,"metadata":{},"outputs":[],"source":["c.execute('CREATE TABLE word_freq (word text, count real)')\n","\n","data = [ (word, freq[word]) for word in freq ]\n","c.executemany('INSERT INTO word_freq VALUES (?,?)', data)\n","con.commit()"]},{"cell_type":"markdown","metadata":{},"source":["This structured approach to store and manage data in a relational database allows us to use the full power of `SQL` queries, i.e. rather than efficiently searching the data ourselves we can leave this to the database system. The following `SQL` query returns the ten most frequent words:"]},{"cell_type":"code","execution_count":19,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[('the', 2280.0), ('of', 1411.0), ('is', 1249.0), ('a', 886.0), ('and', 841.0), ('to', 707.0), ('in', 659.0), ('that', 594.0), ('', 576.0), ('be', 559.0)]\n"]}],"source":["c.execute('SELECT word, count FROM word_freq ORDER BY count DESC LIMIT 10')\n","print(c.fetchall())"]},{"cell_type":"markdown","metadata":{},"source":["The following query returns the count of the word `end`, again using the placeholder mechanism of the `execute` function."]},{"cell_type":"code","execution_count":20,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["('end', 11.0)\n"]}],"source":["c.execute('SELECT * FROM word_freq WHERE word=?', ('end',))\n","print(c.fetchone())\n","con.close()"]}],"metadata":{"language_info":{"name":"python","codemirror_mode":{"name":"ipython","version":3},"version":"3.7.4-final"},"orig_nbformat":2,"file_extension":".py","mimetype":"text/x-python","name":"python","npconvert_exporter":"python","pygments_lexer":"ipython3","version":3,"kernelspec":{"name":"python374jvsc74a57bd0179f2c9954461ddf657daf1ee3f9df7374d575c8403facec5648a064395b52ac","display_name":"Python 3.7.4 64-bit ('anaconda3': conda)"}},"nbformat":4,"nbformat_minor":2}
\ No newline at end of file
This diff is collapsed.
{"cells":[{"cell_type":"markdown","metadata":{},"source":["# P01 - 07: Introducing `numpy`\n","\n","**April 15, 2021** \n","*Ingo Scholtes* \n","\n","This notebook introduces the numerical package `numpy`, which allows us to peform mathematical operations on vectors and matrices.\n","\n","## `numpy` arrays\n","\n","One of the reasons why `python` has become one of the most popular language for data science and scientific computing is the package `numpy`, which provides classes and functions to perform advanced mathematical and statistical analyses. One of the key features is its support for vector and matrix algebra that is built on the concept of `numpy` arrays. \n","\n","Let us first understand how the standard `list` class in python differs from a `numpy` array. Consider the following example of two lists containing numerical values. If we perform an addition operation on those two lists we get:"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["[0, 1, 0, 1, 0, 1, 2, 1]"]},"metadata":{},"execution_count":1}],"source":["python_arr_1 = [0,1,0,1]\n","python_arr_2 = [0,1,2,1]\n","\n","python_arr_1 + python_arr_2"]},{"cell_type":"markdown","metadata":{},"source":["For `python` lists, the mathematical operators are overloaded in such a way, that they help us to create or merge lists, but there is no implementation of mathematical operations that would allow us to use them to perform vetor, or matrix-based operations. For instance, if we multiply a list with a scalar value, we get:\n"]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]"]},"metadata":{},"execution_count":2}],"source":["[0, 1, 2]*5"]},{"cell_type":"markdown","metadata":{},"source":["A `numpy` array is fundamentally different from a `python` list in the sense that it represents a mathematical vector on which we can perform arithmetic operations. \n","\n","To use numpy arrays, we first have to import the package. We can then directly initialise them from a python list as follows:"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([0, 2, 2, 2])"]},"metadata":{},"execution_count":3}],"source":["import numpy as np\n","\n","np_arr_1 = np.array(python_arr_1)\n","np_arr_2 = np.array(python_arr_2)\n","\n","np_arr_1 + np_arr_2"]},{"cell_type":"markdown","metadata":{},"source":["If we pass a simple list, we get a one-dimensional vector whose number of dimensions matches the number of elements in the list. To change the structure or dimensionality of an numpy array we can use the function `reshape`. We can pass a list of integers, which specify the length of the array in multiple dimensions. A value of -1 means that the length of that dimension will be automatically determined based on the entries in the array that are remaining. In fact, it can become pretty complicated to understand the consequence of a `reshape` call, so let us make some examples:\n","\n","The array above is a one-dimensional array with 4 elements. We can turn this into a 2x2 array as follows:"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[0 1]\n [0 1]]\n"]}],"source":["x = np_arr_1.reshape(2, 2)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["We can also change the one-dimensional to a two-dimensional array, where the first dimension contains 4 elements, each being an array with a single element. Rather than specifying a value of four for the first dimension, we can pass a value of -1, which automatically uses the number of all remaining values (in this case four) as an argument. The following call effectively transposes a row vector to a matrix that consists of a single column vector."]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[0]\n [1]\n [0]\n [1]]\n"]}],"source":["x = np_arr_1.reshape(-1, 1)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["If we want to reshape this column vector into a one-dimensional row vector, we can call:"]},{"cell_type":"code","execution_count":6,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[0 1 0 1]\n"]}],"source":["x = x.reshape(-1)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["Instead of passing a one-dimensional list and then reshaping the resulting array, we can also initialise numpy arrays from nested lists. The dimensions of the resulting array will be automatically inferred from the lengths of the lists."]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1 0 1]\n [0 1 0]\n [0 0 1]]\n"]}],"source":["matrix = np.array([[1,0,1], [0,1,0], [0,0,1]])\n","print(matrix)"]},{"cell_type":"markdown","metadata":{},"source":["if we are unsure about the shape of an array, we can look at the `shape` property, which returns a tuple that contains the number of elements in all dimensions:"]},{"cell_type":"code","execution_count":8,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["(3, 3)\n"]}],"source":["print(matrix.shape)"]},{"cell_type":"markdown","metadata":{},"source":["Naturally, this works with arbitrary of nesting, i.e. the following initialisation generates 3x3x3 tensor, which we can visualise as a cube consisting of three layers of two-dimensional matrices."]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[[0 1 0]\n [1 0 1]\n [1 1 1]]\n\n [[0 0 0]\n [1 0 1]\n [1 1 0]]\n\n [[1 0 0]\n [0 0 0]\n [1 1 1]]]\n(3, 3, 3)\n"]}],"source":["tensor = np.array([[[0,1,0],[1,0,1],[1,1,1]], [[0,0,0],[1,0,1],[1,1,0]], [[1,0,0],[0,0,0],[1,1,1]]])\n","print(tensor)\n","print(tensor.shape)"]},{"cell_type":"markdown","metadata":{},"source":["## Indexing and slicing `numpy` arrays\n","\n","An important `numpy` feature that you need to understand if you want to work with arrays is so-called **array slicing**, a special type of indexing, that we can use certain (subsets) of values in different dimensions. First, we can directly use standard python indexing to access elements in an numpy array. For instance, we can access the top-left element (zero) in the first matrix in the tensor above as follows:"]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["0\n"]}],"source":["print(tensor[0][0][0])"]},{"cell_type":"markdown","metadata":{},"source":["Alternatively, we can also use a single bracket, separating indices in multiple dimensions through commas, i.e. if we want to access the bottom-right element in the third matrix, we could write:"]},{"cell_type":"code","execution_count":11,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["1\n"]}],"source":["print(tensor[2,2,2])"]},{"cell_type":"markdown","metadata":{},"source":["We can use slicing to specify a sequence of indices that we seek to return using the notation `start:stop:step`, where `start` is the inclusive start index, `stop` is the exclusive stop index, and `step` is the increment. For example, we can use the following to retrieve five elements in the middle of a one-dimensional array with seven elements:"]},{"cell_type":"code","execution_count":12,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[2 3 4 5 6]\n"]}],"source":["x = np.array([1,2,3,4,5,6,7])\n","print(x[1:6:1])"]},{"cell_type":"markdown","metadata":{},"source":["We can select every second element by changing the step:"]},{"cell_type":"code","execution_count":13,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[2 4 6]\n"]}],"source":["print(x[1:6:2])"]},{"cell_type":"markdown","metadata":{},"source":["If we omit the `step` parameter, a default step value of 1 is assumed, i.e. we can write:"]},{"cell_type":"code","execution_count":14,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[2 3 4 5 6]\n"]}],"source":["print(x[1:6])"]},{"cell_type":"markdown","metadata":{},"source":["if we additionally omit the `start` or `stop` index, they default to zero or the last index in the array respectively, i.e. we have:"]},{"cell_type":"code","execution_count":15,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1 2 3 4 5 6]\n"]}],"source":["print(x[:6])"]},{"cell_type":"code","execution_count":16,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[2 3 4 5 6 7]\n"]}],"source":["print(x[1:])"]},{"cell_type":"markdown","metadata":{},"source":["This implies that, if we omit the `start`, `stop` and `step` parameter alltogether and simply write a colon `:`, we retrieve all elements of the array:"]},{"cell_type":"code","execution_count":17,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1 2 3 4 5 6 7]\n"]}],"source":["print(x[:])"]},{"cell_type":"markdown","metadata":{},"source":["In fact, slicing is a standard feature that is provided by `python`, so we can do the same in a `python` list:"]},{"cell_type":"code","execution_count":18,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1, 2, 3, 4, 5, 6, 7]\n"]}],"source":["y = [1,2,3,4,5,6,7]\n","print(y[:])"]},{"cell_type":"markdown","metadata":{},"source":["However, `numpy` takes this powerful concept one step further by generalising it to arrays with arbitrary dimensions, where we can simply separate the slicing expression for individual dimensions by commas.\n","\n","Let's play with this in the matrix example from above. First, we can get the full matrix by using an empty slicing operator on both dimensions:"]},{"cell_type":"code","execution_count":19,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1 0 1]\n [0 1 0]\n [0 0 1]]\n"]}],"source":["print(matrix[:,:])"]},{"cell_type":"markdown","metadata":{},"source":["What if we want to extract the first row of the matrix, i.e. [1 0 1]. For the first dimension, we can pass the index 0, which selects the first row. For the second dimension, we specify an empty slice, which returns all values in the rows selected in the first dimension:"]},{"cell_type":"code","execution_count":20,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1 0 1]\n"]}],"source":["print(matrix[0,:])"]},{"cell_type":"markdown","metadata":{},"source":["Clearly, this is just a complicated way to write:"]},{"cell_type":"code","execution_count":21,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1 0 1]\n"]}],"source":["print(matrix[0])"]},{"cell_type":"markdown","metadata":{},"source":["However, using the slicing notation we can also efficiently extract elements that would otherwise require us to write a loop. For instance, we can get the first column vector in the matrix above by (i) specifying an empty slice in the first dimensions (this retrieves all rows) and (ii) specifying index zero for the second dimension, which selects the first element in each of the rows selected by the first dimension:"]},{"cell_type":"code","execution_count":22,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1 0 0]\n"]}],"source":["print(matrix[:,0])"]},{"cell_type":"markdown","metadata":{},"source":["Finally, we can also do more complicated things, like, e.g. extracting the top left and bottom left 2x2 block matrices from the matrix above:"]},{"cell_type":"code","execution_count":23,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1 0]\n [0 1]]\n[[0 1]\n [0 0]]\n"]}],"source":["print(matrix[:2,:2])\n","print(matrix[1:,:2])"]},{"cell_type":"markdown","metadata":{},"source":["Working with multi-dimensional arrays, slicing can turn into a brain twister and thus requires a bit of practice. But once you have mastered slicing in `numpy`, you will not want to miss it in the analysis of multi-variate data.\n","\n","## Vector and matrix algebra\n","\n","Apart from storing multi-dimensional data, we can perform advanced algebraic operations like powers, matrix multiplications, or eigenvector calculations. In numpy, these operations are implemented in the module `linalg`. To compute the k-th power of the matrix above (i.e. we multiply the matrix k times with itself), we can use the `matrix_power` function:"]},{"cell_type":"code","execution_count":24,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1 0 2]\n [0 1 0]\n [0 0 1]]\n"]}],"source":["print(np.linalg.matrix_power(matrix, 2))"]},{"cell_type":"markdown","metadata":{},"source":["The (dot) product of two matrices (as well as a matrix with a vector) is implemented in the `dot` function, i.e. the following yields the same result like the `matrix_power` function:"]},{"cell_type":"code","execution_count":25,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1 0 2]\n [0 1 0]\n [0 0 1]]\n"]}],"source":["print(matrix.dot(matrix))"]},{"cell_type":"markdown","metadata":{},"source":["If you recall the definition of matrix multiplication (i.e. the dot product), you will remember that the product A*B is only defined if the number of **columns** in A equals the number of **rows** in B. The resulting product will then have the same number of rows as A and the same number of columns as B. Let us try this in `numpy`:"]},{"cell_type":"code","execution_count":26,"metadata":{},"outputs":[],"source":["v = np.array([2,3,1])\n","M = np.array([[1,2,3], [2,4,2], [1,1,0]])"]},{"cell_type":"markdown","metadata":{},"source":["Let us consider `v` as a row vector (i.e. we have three columns and one row). `M` is a 3x3 matrix, i.e. the number of columns in `v` matches the number of rows in M and the result is again a row vector with three columns."]},{"cell_type":"code","execution_count":27,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[ 9 17 12]\n"]}],"source":["print(v.dot(M))"]},{"cell_type":"markdown","metadata":{},"source":["If we change the order of the multiplication, the result should not be defined. Let us try this:"]},{"cell_type":"code","execution_count":28,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[11 18 5]\n"]}],"source":["print(M.dot(v))"]},{"cell_type":"markdown","metadata":{},"source":["That may come as a surprise! The reason for this is that `numpy` actually does not distinguish between a row and a column vector. That is, in this case it has interpreted `v` as a column vector, and the result is a (different) vector with three rows and a single column. If we explicitly reshape the vector as described above, we get the same result, but the result vector will now be in the same shape as `v`:"]},{"cell_type":"code","execution_count":29,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[11]\n [18]\n [ 5]]\n"]}],"source":["print(M.dot(v.reshape(-1, 1)))"]},{"cell_type":"markdown","metadata":{},"source":["A number of important methods in data analytics are based on the calculation of eigenvector and eigenvalues, i.e. for a given matrix A we are interested in solutions to the eigenvalue equation $\\mathbf{A} v = \\lambda v$, where $v$ is called an eigenvector and $\\lambda$ is called an eigenvalue of $\\mathbf{A}$. We will explain eigenvectors in a later lecture, but in a nutshell we are interested in vectors $v_i$ that are simply scaled by a factor of $\\lambda_i$ if we multiply them with $M$.\n","\n","A prominent example of an eigenvector based method is the PageRank value of a web page, which is one factor used by Google to rank search results. This is actually the eigenvalue of a matrix that encodes the hyperlink structure of web pages, i.e. entries in the matrix capture which web pages refer to each other.\n","\n","To calculate eigenvalues and eigenvectors, we can use the function `eig` in the module `linalg`:"]},{"cell_type":"code","execution_count":30,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["(array([ 5.74872177, -1.2886655 , 0.53994373]),\n"," array([[-0.49838336, -0.82177082, 0.75029071],\n"," [-0.83533743, 0.09852589, -0.59738112],\n"," [-0.23200302, 0.56123558, 0.28319543]]))"]},"metadata":{},"execution_count":30}],"source":["np.linalg.eig(M)"]},{"cell_type":"markdown","metadata":{},"source":["You see that in the example above the output of the function consists of two arrays, that contain all of the eigenvectors and the eigenvalues of the matrix. It is important to highlight that those are the **right** eigenvalues or eigenvectors, i.e. these are the solutions of the equation $Av = \\lambda v$ where the vector $v$ is multiplied from the right.\n","\n","An easier way to access the eigenvalues and eigenvectors is to unpack the tuple-valued return value into two numpy arrays as follows:"]},{"cell_type":"code","execution_count":31,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[ 5.74872177 -1.2886655 0.53994373]\n[[-0.49838336 -0.82177082 0.75029071]\n [-0.83533743 0.09852589 -0.59738112]\n [-0.23200302 0.56123558 0.28319543]]\n"]}],"source":["w, v = np.linalg.eig(M)\n","print(w)\n","print(v)"]},{"cell_type":"markdown","metadata":{},"source":["The numpy array w contains the three eigenvalues of matrix M and the two-dimensional numpy array w contains the three eigenvectors. It is important to that the eigenvectors are returned such that the first component in each of the three inner arrays contained in w gives the first eigenvector, while the second component gives the second, and so on. That is, we can use the following numpy slicing to access the i-th eigenvector:"]},{"cell_type":"code","execution_count":32,"metadata":{},"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([-0.49838336, -0.83533743, -0.23200302])"]},"metadata":{},"execution_count":32}],"source":["i = 0\n","v[:,i]"]},{"cell_type":"markdown","metadata":{},"source":["We can easily confirm whether `v[:,i]` and `w[i]` are indeed the i-th eigenvector and an eigenvalue of M. Using the eigenvalue equation above, both sides must be equal and we find:"]},{"cell_type":"code","execution_count":33,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[-2.86506728 -4.80212248 -1.33372079]\n[-2.86506728 -4.80212248 -1.33372079]\n[ 1.0589877 -0.12696692 -0.72324493]\n[ 1.0589877 -0.12696692 -0.72324493]\n[ 0.40511476 -0.32255219 0.15290959]\n[ 0.40511476 -0.32255219 0.15290959]\n"]}],"source":["for i in range(3):\n"," print(np.dot(M, v[:,i]))\n"," print(np.dot(w[i], v[:,i]))\n"," print('---')"]},{"cell_type":"markdown","metadata":{},"source":["Looking at the i-th eigenvector above, it is easy to confirm visually that the multiplication with M has simply scaled the eigenvectors by a factor that corresponds to their associated eigenvalues."]}],"metadata":{"language_info":{"name":"python","codemirror_mode":{"name":"ipython","version":3},"version":"3.7.3-final"},"orig_nbformat":2,"file_extension":".py","mimetype":"text/x-python","name":"python","npconvert_exporter":"python","pygments_lexer":"ipython3","version":3,"kernelspec":{"name":"python373jvsc74a57bd082db51cffef479cc4d0f53089378e5a2925f9e7adca31d741132ceba61ecca6f","display_name":"Python 3.7.3 64-bit (conda)"}},"nbformat":4,"nbformat_minor":2}
\ No newline at end of file
{"cells":[{"cell_type":"markdown","metadata":{},"source":["# P01 - 08: Introducing `scipy`\n","\n","**April 15, 2021** \n","*Ingo Scholtes* \n","\n","This notebook introduces basic statistical concepts in the package `scipy`, which we will use for statistical data analysis and hypothesis in the following practice lectures.\n","\n","## Sparse matrix algebra\n","\n","Naturally extending the introduction of vector and matrix operations in `numpy` in the last unit, we first highlight `scipy`'s support for efficient operations on sparse matrices and vector, i.e. large matrices where most entries are actually zeros. In data science, we often deal with such matrices, be it in the context of network analysis and graph mining (where each non-existing link leads to a zero entry in the adjacency matrix) or when we we deal with matrices capturing similarity scores (where zero entries indicate objects that are either not similar or for which the similarty is not known).\n","\n","Clearly, it is not efficient to perform mathematical operations on matrices where the majority of entries is zero anyway. The module `scipy.sparse` thus offers classes to represent such matrices. The class `csr_matrix` (CSR = Compressed Sparse Row) for instance can be initialised from a list of `python` lists or, conveniently, from a `numpy` array. As we see below, different from a `numpy` array, the class `csr_matrix` only stores the non-zero entries of the matrix:"]},{"cell_type":"code","execution_count":1,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1. 0. 1.]\n [0. 1. 0.]\n [0. 0. 1.]]\n (0, 0)\t1.0\n (0, 2)\t1.0\n (1, 1)\t1.0\n (2, 2)\t1.0\n"]}],"source":["import numpy as np\n","import scipy.sparse\n","\n","matrix = np.array([[1., 0., 1.], [0., 1., 0.], [0., 0., 1.]])\n","print(matrix)\n","\n","sparse_matrix = scipy.sparse.csr_matrix(matrix)\n","sparse_matrix\n","print(sparse_matrix)"]},{"cell_type":"markdown","metadata":{},"source":["If we wish to recover a representation with zeros, we can call the `to_dense` method on an instance of `csr_matrix`. This function returns an instance of the class `numpy.matrix`:"]},{"cell_type":"code","execution_count":2,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[1. 0. 1.]\n [0. 1. 0.]\n [0. 0. 1.]]\n<class 'numpy.matrix'>\n"]}],"source":["print(sparse_matrix.todense())\n","print(type(sparse_matrix.todense()))"]},{"cell_type":"markdown","metadata":{},"source":["We can directly perform vector and matrix operations on sparse matrices, i.e. we can do the following:"]},{"cell_type":"code","execution_count":3,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[[2. 3. 3.]]\n"]}],"source":["v = scipy.sparse.csr_matrix([2,3,1])\n","w = v.dot(sparse_matrix)\n","print(w.todense())"]},{"cell_type":"markdown","metadata":{},"source":["To compute eigenvalues and eigenvectors, we can use the `eigs` function:"]},{"cell_type":"code","execution_count":4,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1.00000001+0.j]\n[[-1.00000000e+00+0.j]\n [ 9.87713611e-12+0.j]\n [-1.05367098e-08+0.j]]\n"]}],"source":["import scipy.sparse.linalg\n","\n","w, v = scipy.sparse.linalg.eigs(sparse_matrix, k=1)\n","print(w)\n","print(v)"]},{"cell_type":"code","execution_count":5,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[-1.00000001e+00+0.j 9.87713611e-12+0.j -1.05367098e-08+0.j]\n[-1.00000001e+00+0.j 9.87713621e-12+0.j -1.05367099e-08+0.j]\n"]}],"source":["print(sparse_matrix.dot(v[:,0]))\n","print(w[0]* v[:,0])"]},{"cell_type":"markdown","metadata":{},"source":["## Statistical computing with `scipy.stats`\n","\n","In our course, we will extensively use `scipy`'s implementation of statistical functions, random variables, and probability distributions to generate random multi-variate data, perform linear regression, or test hypotheses. These functions are implemented in the statistics module `scipy.stats`, which contains an [extensive list of classes](https://docs.scipy.org/doc/scipy/reference/stats.html) that implements random variables distributed according various well-known probability distributions. We can, for instance, create a continuous random variable distributed according to a standard normal distribution (with mean 0 and standard deviation 1) as follows:\n","\n","\n","\n"]},{"cell_type":"code","execution_count":6,"metadata":{},"outputs":[],"source":["import scipy.stats\n","norm = scipy.stats.norm()"]},{"cell_type":"markdown","metadata":{},"source":["Depending on the domain of the distribution, `scipy` returns an instance of class [`rv_continuous`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous) or [`rv_discrete`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_discrete.html#scipy.stats.rv_discrete). Both classes provide functions that -- among other things -- allow us to (i) generate random realisations of the random variable, (ii) calculate moments of the distribution (i.e. mean, variance, skewness, etc.), and (iii) get the underlying probability distribution or cumulative distribution function.\n","\n","To generate a `numpy` array of 100 random realisations of the normally distributed random variable, we can do the following:"]},{"cell_type":"code","execution_count":7,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[-0.66975642 2.3454461 -0.35237725 -0.1670733 -0.87966226 -0.38989923\n -0.59144888 0.00863451 -1.29340677 0.86032947 -0.45119352 -1.3246982\n -0.13923478 0.97937667 1.74619828 1.64851046 -2.2726483 -0.5757284\n 0.07019521 -0.10978854 1.34373141 -1.00358464 0.10846409 1.11189358\n 1.2050748 -0.13237762 -0.24718771 0.05720575 -0.9684534 -0.98316235\n 1.27002382 0.81642015 0.26818308 0.55335742 0.80155249 0.51939531\n 0.90715355 -0.54481638 0.8632171 -0.8252128 -0.09435516 -0.23077331\n -0.85391092 -0.84396604 -0.74426311 0.14946551 -0.00328898 0.95716492\n 0.49708613 0.14056808 0.26113865 0.44852727 -1.06257444 -1.42335139\n -1.30966333 -1.00645247 -1.1652416 0.25869848 -0.76929135 0.13122104\n -1.34567788 -0.64779052 2.20057194 1.17834609 -0.33878343 -0.0054512\n -0.74165003 0.50979959 -0.05086295 1.35997614 0.41600819 -0.00696718\n -0.26240877 -0.05735918 -0.09091828 -1.24183265 0.47160152 -0.45736\n 0.93465271 -0.97745945 -0.59695997 -0.67940726 1.2923594 -0.99536717\n 0.51276665 -0.32565864 0.94359985 -0.45879989 -1.61030584 -0.14212145\n 0.63491997 0.36689207 0.60641272 0.23619071 0.78970451 1.24504474\n 0.56845321 0.07162936 0.7702343 -0.05911627]\n"]}],"source":["x = scipy.stats.norm.rvs(size=100)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["The above call generates 100 realisations of a normally distributed random variable with mean $\\mu=0$ and standard deviation $\\sigma^2=1$. A simple way to create realisations drawn from a normal distribution with arbitrary $\\mu$ and $\\sigma$ is to perform an element-wise addition and multiplication of the resulting `numpy` array. To obtain realisations from a normal distribution with mean $\\mu=5$ and standard deviation $\\sigma=2$, we can thus write:"]},{"cell_type":"code","execution_count":8,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["Sample mean = 4.9876854554473296\nSample variance = 1.9720629148459246\n"]}],"source":["x = 5 + 2 * scipy.stats.norm.rvs(size=5000)\n","print('Sample mean = {0}'.format(np.mean(x)))\n","print('Sample variance = {0}'.format(np.std(x)))"]},{"cell_type":"markdown","metadata":{},"source":["Let us try this with a discrete random variable, that is distributed according to the Poisson distibution. We can directly pass the single parameter $\\mu$ of the distribution to the `rvs` function."]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[4 6 8 4 7 6 7 1 8 6 2 4 2 1 3 6 1 5 4 3 4 3 4 9 4 8 8 3 4 6 4 3 4 3 2 6 5\n 5 4 6 4 5 5 2 1 4 3 7 3 7 4 7 6 5 4 9 4 2 4 5 5 7 2 4 7 3 3 4 7 8 7 9 2 7\n 6 6 6 5 2 3 4 1 0 4 3 1 6 7 7 9 9 4 9 2 8 4 2 6 4 1]\n"]}],"source":["x = scipy.stats.poisson.rvs(mu=5, size=100)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["The following generates 100 realisations of a random variable distributed according to a Binomial distribution with parameters $n$ and $k$. For $n=1$ we obtain a sequence of Bernoulli trials that can be 0 or 1."]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["[1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 0\n 0 0 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0\n 0 1 0 1 1 1 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1]\n"]}],"source":["x = scipy.stats.binom.rvs(p=0.5, n=1, size=100)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["For continuous random variables, we can obtain the value of the probability distribution or the cumulative distribution function at any point $x$ as follows:"]},{"cell_type":"code","execution_count":11,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["0.3969525474770118\n0.539827837277029\n"]}],"source":["x = scipy.stats.norm.pdf(x=0.1)\n","print(x)\n","\n","x = scipy.stats.norm.cdf(x=0.1)\n","print(x)"]},{"cell_type":"markdown","metadata":{},"source":["For a discrete random variable, we can directly calculate the probability of any specific value based on the probability mass function:"]},{"cell_type":"code","execution_count":12,"metadata":{},"outputs":[{"output_type":"stream","name":"stdout","text":["0.00976562500000001\n"]}],"source":["x = scipy.stats.binom.pmf(p=0.5, n=10, k=9)\n","print(x)"]}],"metadata":{"language_info":{"name":"python","codemirror_mode":{"name":"ipython","version":3},"version":"3.7.3-final"},"orig_nbformat":2,"file_extension":".py","mimetype":"text/x-python","name":"python","npconvert_exporter":"python","pygments_lexer":"ipython3","version":3,"kernelspec":{"name":"python373jvsc74a57bd082db51cffef479cc4d0f53089378e5a2925f9e7adca31d741132ceba61ecca6f","display_name":"Python 3.7.3 64-bit (conda)"}},"nbformat":4,"nbformat_minor":2}
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
v,w,nodes
a,c,True
a,b,True
v,w,type,weight
a,c,friendship,
a,b,,2.0
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment