This material requires at least some basic understanding of programming concepts. Whilst Python is strictly not required, it is recommended. If you are a biologist with little to no experience (e.g. less than a year) then You are strongly recommended to get some programming exprience first.
There are two potential target audiences for this course:
This is a data science course based on bioinformatics. It has the motivational advantage of being completely hands-on and based on a real, concrete problem. But the downside is that it will only cover materials that are relevant to biological analysis, fortunately that still allows to cover a lot of terrain. But do not expect to find, for example natural language processing techniques here.
This tutorial is composed of
The fundamental content is this text and the Notebooks. In the text you will find the explanations and in the notebooks the running code.
The videos gives you an introduction to the material and the presentations are mostly used by myself when teaching this content. You might want to have a look at them, especially at the very beginning
Online systems for notebooks (mybinder or SageMathCloud) are problably not a good idea because our notebooks need a lot of disk space and compute power.
Now lets look at the more realistic options...
Allocate 100GB of disk space for the data that you will need to download.
Have a good Internet connection. You will download ~50GB of data.
We will be using Python 3. Legacy versions are not supported. Part of the code will require at least 3.5.
All the text assumes that you used one of the options below to access the material. While you can just read as specified above, this material is intended for you to hack and tweak.
Finally this is not introductory material.
The easiest way to install the notebooks is via Docker.
docker pull tiagoantao/data-science-teaching docker run -p 8888:8888 tiagoantao/data-science-teaching
And then point your browser to
Install Kitematic from the Docker toolbox, find
tiagoantao/data-science-teaching and run it. Point your browser
to the exposed HTTP port
The “manual” installation procedure is to get the notebooks from github
on a local installation. The usage of Anaconda Python is strongly
recommended. Not only it includes all the Python packages but also
all the R content that we will be using here. You can have an idea of
the necessary packages by looking at our Dockerfile (check the
conda install lines).
It goes without saying that many options underlying this course are open for discussion. From the programming language of choice, to the selected material and its organization. There are pleny of alternatives in terms of technologies, course structuring that are worthwhile considering. But there is one that I feel it worthwhile to talk about.
While there are plenty of amazing Python-based charting libraries (Matplotlib, Bokeh...) that interact well with the browser they cannot give you the flexibilty on in-browser based programming for visualization.
I will be providing some links to external reading. If you want to go deeper in some concepts where I do not provide links, then your suggested first port of call should be Wikipedia. Be aware that while the Engish version of Wikipedia provides high-quality versions of articles, other versions might be lacking. Read the English version first.