As researchers and practitioners working with administrative data, we are often given data sets where we do not know the full provenance about how this data set was captured, what kind of processing has been applied to it, and if it has been linked or merged with data from other sources.
Complete and up-to-date metadata are not always available. Not fully understanding the provenance of a data set can lead to assumptions and misconceptions being made about the content and quality of the data set. This can result in incorrect processing and / or analysis of a data set which potentially can lead to bad outcomes and decision making.
This course will provide an introduction to data quality, and how it can affect all aspects of working with administrative data. The course will cover data quality dimensions which include technical, social, as well as legal aspects; discuss frameworks that aim to quantify data quality; and provide examples and case studies showing how (the lack of) quality data can lead to bad outcomes of data science projects. This course will not focus on technical aspects of data cleaning, data processing, or data linkage, but rather highlight the issues researchers and practitioners need to be aware of when working with administrative data. The course will provide and discuss a set of recommendations, and through interactive sessions the participants will be able to share their own experiences of how data quality aspects have led to unexpected outcomes in projects they have worked in.
Course audience:
This one-day course is aimed both at researchers and practitioners who are working with administrative data, as well as those who are involved in the management of data centric systems in organisations that act as data custodians, or who are involved in the capture, processing, and linkage of data that potentially will be used for administrative data research. The course requires little technical knowledge and all technical background will be introduced during the course.
The course will be a mixture of four hours of interactive presentations (containing small practical exercises) plus two one-hour sessions with group discussions.
Please note: [as of Aug 13, preparatory activity of the participants has been cancelled; striked-throught text is kept to not confuse former registrants]Prior to the course the participants are expected to view two pre-recorded presentations and complete a short homework document with a few questions, which is aimed to guide discussions during the workshop.
Course presenter: Prof Peter Christen, University of Edinburgh
Contact: peter.christen@ed.ac.uk
About the presenter:
Peter Christen is the Research Lead on the Scottish Historic Population Platform (SHiPP) project, run at the Scottish Centre for Administrative Data Research (SCADR) at the University of Edinburgh. He is also a Professor at the School of Computing at the Australian National University in Canberra. Peter is a world-leading expert in record linkage with over 20 years experience in working with administrative data. He has over 200 publications in the area of data science, including the two books "Data Matching" in 2012 and "Linking Sensitive Data" (co-authored with Thilina Ranbaduge and Rainer Schnell) in 2020. As of February 2025, his work has attracted nearly
18,000 citations at Google Scholar.