Scientific Programming

Instructors: 

Davide Crepaldi

Amount of frontal teaching: 

42 hours

Description of the course

Programming is quickly becoming a necessary skill for any scientist to have, especially with regards to data collection and data analysis. This course is aimed at building a framework for understanding programming, as well as developing concrete skills in the main areas of data science: collection, organization, visualization, analysis and modelling. During the course we will answer questions such as these:

- What are the strong and weak sides of programming languages used in science the most: Python, R and Matlab? Which language to use for what task?

- How can we take advantage of programming languages to create nicely controlled stimulus delivery and data collection?

- How to store and transform your data efficiently? Is there a better way of organizing data, or it doesn't matter? (Spoiler alert: it does)

- How to easily visualize data for exploration and for publication? Which plots are better to use for each type of data? How to create descriptive, but easy to understand figures?

- How to do statistical testing, and what are the best practises? When and how to use resampling methods and bootstrap? How can you use simulations to understand if a certain statistical test is a good fit for your data?

- What is machine learning, when to use it and how? What are the pitfalls of machine learning and how to avoid them?

Given hard work, after taking the course you will be able to:

- Write more clean and understandable code

- Catching errors in your code in a systematic way

- Write well-controlled scripts for experiment delivery and data collection

- Load and manipulate any type of data

- Transform and organize your data in the most convenient way for the analysis, visualization and storage

- Plot any kind of data and create clear figures for reports, thesis and publications

- Apply statistical tests to your data

- Use resampling methods to create statistical tests

- Run a simulation test on a statistical procedure to better understand if the test is good for your data

- Understand the basics of machine learning and its core principles: supervision, searching parameter space, overfitting, cross-validation, pipelining, etc.

- Use many basic machine learning algorithms for modelling, predicting or transforming your data: linear and logistic regression, support vector machines, k-means clustering, principal component analysis, etc.

Given time, we will also touch upon more advanced topics, such as taking advantage of object-oriented languages, handling errors, multithreading (parallelization), optimization for computational speed vs optimization for memory, working with big data (larger than RAM datasets) and version control systems.

The primary language of the course is going to be Python, however you can also use R or Matlab for class work and assignments. The core message of the course is that language is secondary to the understanding of concepts: one should pick a language to fit a certain task, not the opposite. That said, on the practical side you will have a possibility to learn more about Python than other languages.

The course will be taught by Sergey Antopolskiy and Davide Crepaldi (experiment delivery and data collection part) and will include lectures and hands-on tutorials. Everybody is welcome to attend. For the lectures and especially for tutorials it is best if you will have your own laptop, but it will be possible to use SISSA terminals.

The course will run from the 10rd of April to the 9th of May in the afternoon from 15:00 to 17:00 (first week) and in the morning from 10:00 to 12:00 (other weeks). On some days there will be lectures in the morning and exercises in the afternoon from 15:00 to 17:30.

The exact schedule can be found here.

If you're interested in the Scientific Programming course, please send a message to Sergey on s.antopolsky@gmail.com so that he can add you to the mailing list for further communication and understand the amount of potential students. If you have any questions regarding the course and organization, use the same email address. Learning will depend critically on the amount of work you'll spend on the assignments between classes. Sergey will be available to help you out with this during office hours 2-3 times a week for 2 hours (time and day to be determined later).

Please find the detailed syllabus below. Keep in mind that although topics are split into lessons thematically, and each lesson might take up several lectures and practical tutorials. The exact pacing of the course will be determined on the go, because it will depend on the composition of the class in terms of students' backgrounds.

 

Syllabus

Lesson 1. Programming basics

In the first lesson we will overview main programming languages, briefly reiterate basic programming concepts, take a look at different data types and why we need them. We will take a detailed look at NumPy arrays in Python as an example of data type highly optimized for computation, and learn how we can work with them efficiently. We will end by discussing best coding practises and take a more general look at programming beyond data analysis.

Why we learn to program

Weak and strong sides of main programming languages used for science

Scripting basics

- Variables and assignment

- Loops: for and while

- If statements

- Errors

Data types overview

- Why different data types are needed

- Number types: int, float

- Strings

- Mapping types: dictionaries

- Arrays

- Dataframes

Arrays (NumPy)

Working with your code

Where to go from here

Lesson 2. Data organization

In the second lesson we will look at data organization practises and why it is important to keep your data tidy. We will learn how to read data from several sources and store it in a dataframe (table). Next we will cover data transformation ("munging") and a now-standard Split-Apply-Combine approach to data analysis. We will consider several scenarios in which the data is very complex and learn some rules of thumb about what to do about it. Throughout the lesson we will work out organization of several different datasets.

Why data organization is important

Why we want to be able to transform data

Reading data from files and other sources

Series and Dataframes

- Index and columns

- Indexing dataframes: loc, ix, iloc

Data transformations

- Long data vs wide data

- Melting and pivoting

- Split-apply-combine

Complicated scenarios

- Missing data

- When data is HUGE

- When data is really complicated and diverse

Where to go from here

Lesson 3. Data collection

In this lesson Davide Crepaldi will cover creating and controlling psychophysical stimuli and collecting data in experimental settings.

Lesson 4. Data visualization

The focus of this lesson is data visualization for exploration, communication and publication. In particular, how to visualize different types of data effectively. We will start with a brief overview of most simple plots, then go on to distribution plot (probably the most important type for publications) to see what are our options and when particular type better be used. We will review how we can show uncertainty on distribution plots. Next we will take a look at colormaps and 3D plots and move on to customizing figures and making compound figures. Then we will consider more advanced and specialized plotting tools currently available. Finally we will talk through saving figures and which formats to use.

Use cases for data visualizations

Line, scatter

Distribution plots

- Histograms

- Barplots (why you shouldn't be using them)

- Boxplots

- Violin plots

- Swarm plots

- Showing uncertainty

Grid, mesh and colormaps

Customizing plots

Interactive figures

Compound figures

Other plotting tools

Saving figures

Where to go from here

Lesson 5. Data analysis

In this lesson we will go cover principles of statistical analysis using frequentist inference and resampling. We will take a very pragmatic approach: instead of discussing theory, we will discuss the implications of the theory for practical use. We will also learn about and demonstrate statistical simulations and how they can help us gain intuition and insight into the inner workings of statistics. The lesson will be featuring Romain Brasselet and show how to put into practice concepts he covered in his lectures on statistical inference.

Reminder of statistics basics

- Why we use statistics and when not to use it

- Limitations of statistics

Descriptive statistics

- Central limit theorem

- Mean

- Median

- Variance/STD

- Quartiles/percentiles

Frequentist inference

- Student's t-test

- Welch's t-test

- U-test

- Welch's U-test

- ANOVA

Bootstrap and resampling

- Theory behind bootstrap

- Implementations

Statistical simulations

- test exploration with simulation

- exploration of alpha-beta-N-d' relationship

Where to go from here

Lesson 6. Machine learning

Here we will focus on conceptual and practical understanding of machine learning. We will almost entirely ignore the mathematical aspects. We will consider use cases and general rules of thumb. First we will cover basic concept everyone should know before they run any machine learning algorithm: model selection, overfitting, cross-validation, etc. Then we will discuss basic algorithms, such as linear and logistic regression, principal component analysis, K-means and K-neighbors. Later we will see uses for more advanced algorithms, such a support vector machines and stochastic gradient descent. This lesson will be loaded with extensive examples.

What is machine learning and why we use it

Key concepts

- Types of machine learning algorithms

- Model selection

- Parameter space and gridsearch

- Overfitting and double dipping

- Cross-validation

- Pipelining

Basic algorithms

- Linear regression

- Logistic regression

- Principal component analysis

- K-means

- K-neighbors

More advanced algorithms

- Linear discriminant analysis

- Support vector machines

- Stochastic gradient descent

Where to go from here

Extra topics

This is a list of topics we might consider during the last part of the class or during office hours. Whether we cover these topics will depend on the pace of the course, which, in turn, will depend on students' backgrounds and time they can devote to studying and processing the course materials.

- Object-oriented programming

- Creating and modifying objects — why and how

- Sanity checks

- Handling errors

- Multithreading (parallelization)

- Working with big data (larger than RAM)

- Optimizing for memory vs optimizing for speed (memory-speed trade-off)

- Version control systems

- Contributing to a project