Data Modelling


Davide Crepaldi

Amount of frontal teaching: 

16 hours (plus around 48h of individual work)



In this course we will explore the data analysis process from the beginning —the moment a set of data comes out of a subject's brain and/or behaviour— to the end —the moment we make an inference about how the brain and/or cognition works. In doing this, we wil cover data treatment and management (e.g., tidy datasets), graphical data exploration, data manipulation (e.g., standardisation, centering) and, of course, data modelling

If this all sounds very general, that's exactly it—you can think of this course as a series of hands-on meetings on what is our everyday task as scientists, that is, transforming experimental data into solid science (which is somewhat correlated with having a publishable paper, or getting a PhD 😜). This approach will allow a better grasping of the statistical concepts in context, and give you guys knowledge that is more immediately transferrable to your everyday routine. In line with this philosophy, the course will feature classes of three hours, which will be roughly divided into two halves—a more frontal-teaching part where we'll go together through some illustrative code, followed by an individual session where you'll try to apply the code to new data (or, potentially, your own data). 

For what concerns data modelling, we will focus on mixed-effect models. This is a very general technique, which covers essentially all kind of datasets and statistical tests a neuroscientist might ever want to tackle -- from cell biology, to system neuroscience, to psychophysics.

Much attention will be given to monitoring our own work on issues such as data overstretch, p hacking, and other bad research practices. More generally, we'll very much work with an Open Science approach (e.g., in the view of sharing our data and code with the community), which implies extra care on things like data curation, tidiness, code comments and, more generally, a structured approach to data analysis.

A preliminary (but very represenative) syllabus is as follows:

  • Class 1: data import, clean-up, variable formats, a quick introduction to git.
  • Class 2: merging, variable transformation, tables and distribution plots.
  • Class 3: advanced graphical data exploration and theoretical introduction to mixed models.
  • Class 4: model fitting and model evaluation (summary, anova, ranef, fixef, fitted and Effects).
  • Class 5: strategies for model fitting (all-in, fully factorial, model simplification, handling of covariates).
  • Class 6: coding schemes, reference levels, predicted values.
  • Class 7: random slopes.
  • Class 8: binomial Ys, non-linear models.
  • Bonus tracks (unlikely we'll cover this, but you never know): bootstrap, out-of-sample validation, GAM (generalised additive models).

The software I'll use for the course is R. There's no commitment in this statement of course, and students with good knowledge of other coding languages (e.g., Python) are most welcome to use them. I'll also make use of a visual interface to R called RStudioAgain, this doesn't mean you'll also have to use RStudio yourself, although, surely, seeing the same thing on your computer and mine will probably help, particularly for students with a rather basic background on these things. (Please bring your laptop to class, with a working installation of whatever software you intend to use.)

I designed the course so that it can flexible enough to challenge the advanced student, but also gently accompany the beginner. I'll introduce every new concept from scratch, without assuming much on your background. I will also provide illustrative code myself. However, there will be much focus on individual work, so that everyone can settle their own level, and pace (I'll have data and exercises for you guys, but if you could work on your own datasets, that would obviously be even better).

The course will focus on stats, not software. I will not cover things such as software installation and basic software functionalities. These are considered to be pre-requisites for the course. You should spend a substantial amount of time before the course begins to get yourself familiar with concepts like objects (numbers, vectors, matrices, lists, data frames); modes and attributes (character, factor, logical, numeric); basic functions (concatenation, indexing); import/export of data; probability distributions; grouping, loops, and conditional statements; and graphs. A good way to familiarise yourself with R specifically is studying and practicing chapters 1 to 10, and 12 of "An Introduction to R" (available here). Depending on a student's skills/attitude/commitment, this can require anything from a few hours to several weeks of work (so, the message is—start well in advance). If you're not sure what kind of extra work your background would require, please email me here.

The course will be rather intensive: in addition to the meetings, please consider that effective learning will require at least 5-6 hours of individual, extra-class work per week.

Finally, the course is not mandatory for students in the SISSA Cognitive Neuroscience PhD; therefore, there will be no final exam/grades by default. However: (i) SISSA students who would like to obtain a grade for the course; and (ii) Trento Master students who included this course in their "piano di studi"; may ask for/need an exam, which I will be happy to organize if advised with some notice. The exam will likely consist in a short essay (a commented R/Python code) to be produced on a data frame provided by the teacher. The essay will be done remotely.

The course calendar is now out on the Calendar page of this website. We will start on February 7, 2023, and hold two meetings per week, for four weeks. The course will be held in person. I will probably record classes, but won't offer them live on-line. It is not mandatory to sign up into the course; but it would surely help me if you let me know of your intention to attend, so that I'll know in advance how many people I should expect (just send me an email).