Fostering data literacy by engaging in data cleaning

Jakim Eckert; Sarah Schönbrodt; Martin Frank

doi:10.52041/iase25.146

Authors

Jakim Eckert Karlsruhe Institute of Technology https://orcid.org/no ORCID
Sarah Schönbrodt Paris Lodron University Salzburg https://orcid.org/0000-0003-2383-6081
Martin Frank Karlsruhe Institute of Technology https://orcid.org/0000-0001-8562-6982

DOI:

https://doi.org/10.52041/iase25.146

Abstract

The increasing societal relevance of data-driven technologies highlights the importance of fostering data literacy in education. One important part is data cleaning, which plays a crucial role in data- driven technologies and offers authentic opportunities to foster data literacy through critical engagement with real-world data. Despite its mathematical richness, data cleaning – particularly outlier detection – remains underrepresented in school curricula and educational research. This paper presents a design-based research project focusing on the mathematical foundations of outlier detection methods. Using the four-level approach by Hußmann and Prediger (2016), we specify and structure the mathematical topic of boxplots for outlier detection. We explore how these concepts can be meaningfully embedded in intended learning trajectories to promote students’ understanding of variability, robustness, and the impact of assumptions. The material is based on real datasets and aims to support critical reflection on data-driven decision-making.

References

Abt, M., Leuders, T., Loibl, K. & Reinhold, F. (2024). Ein verstehensorientierter Zugang zu Boxplots: Mit digitalen Explorationen zur Variabilität in Daten [A comprehension-oriented approach to box plots: Using digital explorations to variability in data]. mathematik lehren, (243), 41–45.

Aggarwal, C. C. (2017). Outlier Analysis. Springer. https://doi.org/10.1007/978-3-319-47578-3

Anaconda (2022). State of Data Science 2022: Paving the Way for Innovation. Anaconda Inc. https://www.anaconda.com/state-of- data-science-report-2022

Bakker, A., Biehler, R. & Konold, C. (2005). Should young students learn about box plots? Curricular development in statistics education. In G. Burrill & M. Camden (Eds.), Curricular development in statistics education. International Association for Statistical Education (IASE) Roundtable, Lund, Sweden, 28 June–3 July 2004 (pp. 163–173). Voorburg, The Netherlands: International Statistical Institute.

Biehler, R. & Steinbring, H. (1991). Entdeckende Statistik, Stengel-und-Blätter, Boxplots: Konzepte, Begründungen und Erfahrungen eines Unterrichtsversuches [Exploratory statistics, stem-and-leaf, boxplots: Concepts, justifications and experiences of a teaching experiment]. Der Mathematikunterricht, 37(6), 5–32.

Chu, X. (2019). Data Cleaning. Encyclopedia of Big Data Technologies. Springer. https://doi.org/10.1007/978-3-319-77525-8

Daniel, B. K. (2017). Big data and data science: A critical review of issues for educational research. British Journal Of Educational Technology, 50(1), 101–113. https://doi.org/10.1111/bjet.12595

Dean, R. B. & Dixon, W. J. (1951). Simplified statistics for small numbers of observations. Analytical Chemistry, 23(4), 636–638. https://doi.org/10.1021/ac60052a025

Dixon, W. J. (1950). Analysis of Extreme Values. The Annals Of Mathematical Statistics, 21(4), 488– 506. https://doi.org/10.1214/aoms/1177729747

Eichler, A. & Vogel, M. (2013). Leitidee Daten und Zufall: Von konkreten Beispielen zur Didaktik der Stochastik [Key idea data and chance: From concrete examples to the didactics of stochastics]. Springer. https://doi.org/10.1007/978-3-658-00118-6

Erickson, T., Wilkerson, M., Finzer, W. & Reichsman, F. (2019). Data moves. Technology Innovations in Statistics Education, 12(1). https://doi.org/10.5070/t5121038001

Fergusson, A., Pfannkuch, M. & Budgett, S. (2025). Data cleaning doesn’t happen in a vacuum: An initial exploration of high school statistics teachers’ data practices with messy data. In J. Kaplan. & K. Luebke (Eds.). Connecting data and people for inclusive statistics and data science education. Proceedings of the Roundtable conference of the International Association for Statistics Education(IASE), July 2024, Auckland, New Zealand. ISI/IASE. https://doi.org/10.52041/iase24.301

Fleischer, Y., Biehler, R. & Schulte, C. (2022). Teaching and Learning Data-Driven Machine Learning with Educationally Designed Jupyter Notebooks. Statistics Education Journal, 21(2), Article 7. https://doi.org/10.52041/serj.v21i2

Grubbs, F. E. (1950). Sample Criteria for Testing Outlying Observations. The Annals of Mathematical Statistics, 21(1), 27–58. https://www.jstor.org/stable/2236553

Grubbs, F. E. (1969). Procedures for Detecting Outlying Observations in Samples. Technometrics, 11(1), 1–21. https://doi.org/10.2307/1266761

Hawkins, D. M. (1980). Identification of outliers. Springer, https://doi.org/10.1007/978-94-015-3994-4

Hußmann, S. & Prediger, S. (2016). Specifying and structuring mathematical topics. Journal für Mathematik-Didaktik, 37(S1), 33–67. https://doi.org/10.1007/s13138-016-0102-8

Lache, J., da Costa Silva, N. & Rolka, K. (2023). Individuelles Feedback und vielfältige Repräsentationen: Einsatz digitaler Mathematikaufgaben in der Schule [Individual feedback and diverse representations: Using digital mathematics tasks in school]. In: Digitaler Mathematikunterricht in Forschung und Praxis. Tagungsband zur Vernetzungstagung 2022 in Siegen (pp. 113–123). https://d-nb.info/128784488X/34

Lee, H., Mojica, G., Thrasher, E., & Baumgartner, P. (2022). Investigating data like a data scientist: Key practices and processes. Statistics Education Research Journal, 21(2), Article 3. https://doi.org/10.52041/serj.v21i2

Lem, S., Onghena, P., Verschaffel, L. & Van Dooren, W. (2013). The heuristic interpretation of box plots. Learning and Instruction, 26, 22–35. http://dx.doi.org/10.1016/j.learninstruc.2013.01.001

Lohr, S. (2014). For Big-Data Scientists, “Janitor Work” is key hurdle to insights. The New York Times. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights- is-janitor-work.html

Markulin, K., Bosch, M., Florensa, I. & Montañola, C. (2022). The evolution of a study and research path in Statistics. epiDEMES, 1. https://doi.org/10.46298/epidemes-7584

Msweli, N. T., Mawela, T. & Twinomurinzi, H. (2023). Data Science Education – A scoping review. Journal Of Information Technology Education Research, 22, 263–294. https://doi.org/10.28945/5173

Ossadnik, H. (2022). Boxplots–einfach zu erstellen, schwer zu interpretieren: Interpretation mit Simulationen üben [Boxplots – easy to create, difficult to interpret: Practicing interpretation with simulations]. digital unterrichten: Mathematik, 2022(1), 6–7.

Schönbrodt, S., Wohak, K. & Frank, M. (2022): Digital Tools to Enable Collaborative Mathematical Modeling Online. Modelling in Science Education and Learning, 15(1), 151–174, https://doi.org/10.4995/msel.2022.16269

Schüller, K. (2022). Data and AI literacy for everyone. Statistical Journal of the IAOS, 38. 1–14. https://doi.org/10.3233/SJI-220941

Siebert, J., Schroth, C. & Groß, J. (2022). Time Traveling with Data Science: Outlier Detection (Part 3). Frauenhofer IESE. https://www.iese.fraunhofer.de/blog/outlier-detection/.

Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.

Wang, H., Bah, M. J. & Hammad, M. (2019). Progress in outlier detection techniques: A survey. IEEE Access, 7, 107964–108000. https://doi.org/10.1109/access.2019.2932769

Wilkerson, M. H., Lanouette, K. & Shareff, R. L. (2021). Exploring variability during data preparation: a way to connect data, chance, and context when working with complex public datasets. Mathematical Thinking and Learning, 24(4), 312–330. https://doi.org/10.1080/10986065.2021.1922838

Witte, V., Schwering, A., & Frischemeier, D. (2025). Strengthening Data Literacy in K-12 Education: A Scoping Review. Education Sciences, 15(1), 25. https://doi.org/10.3390/educsci15010025

Fostering data literacy by engaging in data cleaning

Authors

DOI:

Abstract

References

Downloads

Published

Conference Proceedings Volume

Section