Data Modelling


Davide Crepaldi

Amount of frontal teaching: 

16 hours (plus around 25 hours of individual work)



In this course we will go together over a set of data from a fairly standard behavioural experiment in human cognition research, trying to extract reliable inferences from it. If it sounds very general, that's exactly it—you can think of this course as a series of hands-on meetings on what is our everyday task as scientists, that is, transforming experimental data into solid science (which is somewhat -- but not perfectly, be advised -- correlated to have a publishable paper, or get a PhD).

In doing this, we'll go over a number of basic and not-so-basic statistical issues, which includes graphical data exploration, data manipulation (e.g., standardisation, centering), regression, mixed-effect modelling, predictor collinearity, non linearity, and (perhaps) bootstrap. We'll do our best to touch on all of these issues, but the pace of the course will be determined by you -- your proficiency on the one hand and your curiosity on the other. It is much better to do less, but more deeply, and to get closer to your interests and/or needs.

Much attention will be given to monitoring our own work on issues such as data overstretch, p hacking, and other bad research practices (no external system will ever vicariate rigorous self inquiry in this respect).  

We won't cover the subject systematically, from a theoretical standpoint. Rather, we'll go through different concepts as they arise from our practical analysis work on the course dataset. This will allow a better grasping of the statistical concepts in context, and give you guys knowledge that is more immediately transferrable to your everyday routine.    

The software I'll use for the course is R. There's no commitment in this statement of course, and students with advanced background are most welcome to explore other options (Python is a very good one). I'll also make use of a visual interface to R called RStudioSame as above: seeing the same thing on your computer and mine will probably help, particularly for students with a rather basic background on stats/R, but there are surely several valid alternatives that students may like better.

Please bring your laptop with you at classes, with a working installation of whatever software you intend to use. This will be the best possible learning environment for you, independently of whether classes will be held remotely (let's hope not) or in rooms with other computers available (which may or may not happen).         

This is as a fairly advanced course; and is on stats, not on software. I will not cover things such as software installation and basic software functionalities. These are considered to be pre–requisites for the course. You should spend a substantial amount of time before the course begins to get yourself familiar with things like R objects (numbers, vectors, matrices, lists, data frames); modes and attributes (character, factor, logical, numeric); basic functions (concatenation, tapply, indexing, merge, table); import/export of data (read.table, sink, write.table); probability distributions; grouping, loops, and conditional statements; and graphs. This knowledge/familiarity can be obtained in several ways, and primarily by attending the Coding course offered by our PhD. If you decide to focus on R, a good entry point is studying and practicing chapters 1 to 10, and 12 of "An Introduction to R" (available here). Depending on a student's skills/attitude/commitment, this can require anything from a few hours to several weeks of work (that is, start well in advance).

The course will be rather intensive: in addition to the meetings, please consider that an effective learning will require at least 3-4 hours of individual work between meetings.

Finally, the course is NOT mandatory for students in the SISSA Cognitive Neuroscience PhD; therefore, there will be no final exam/grades by default. However: (i) SISSA students who would like to obtain a grade for the course; and (ii) Trento Master students who included this course in their "piano di studi"; may ask for/need an exam, which I will be happy to organize if advised with some notice. The exam will likely consist in a short essay (a commented R/Python code) to be produced on a data frame provided by the teacher. The essay will be done remotely.

The course will be held in April. A more specific calendar will come soon in the Calendar page of this website.