Fostering data literacy by engaging in data cleaning
DOI:
https://doi.org/10.52041/iase25.146Abstract
The increasing societal relevance of data-driven technologies highlights the importance of fostering data literacy in education. One important part is data cleaning, which plays a crucial role in data- driven technologies and offers authentic opportunities to foster data literacy through critical engagement with real-world data. Despite its mathematical richness, data cleaning – particularly outlier detection – remains underrepresented in school curricula and educational research. This paper presents a design-based research project focusing on the mathematical foundations of outlier detection methods. Using the four-level approach by Hußmann and Prediger (2016), we specify and structure the mathematical topic of boxplots for outlier detection. We explore how these concepts can be meaningfully embedded in intended learning trajectories to promote students’ understanding of variability, robustness, and the impact of assumptions. The material is based on real datasets and aims to support critical reflection on data-driven decision-making.References
Abt, M., Leuders, T., Loibl, K. & Reinhold, F. (2024). Ein verstehensorientierter Zugang zu Boxplots: Mit digitalen Explorationen zur Variabilität in Daten [A comprehension-oriented approach to box plots: Using digital explorations to variability in data]. mathematik lehren, (243), 41–45.
Aggarwal, C. C. (2017). Outlier Analysis. Springer. https://doi.org/10.1007/978-3-319-47578-3
Anaconda (2022). State of Data Science 2022: Paving the Way for Innovation. Anaconda Inc. https://www.anaconda.com/state-of- data-science-report-2022
Bakker, A., Biehler, R. & Konold, C. (2005). Should young students learn about box plots? Curricular development in statistics education. In G. Burrill & M. Camden (Eds.), Curricular development in statistics education. International Association for Statistical Education (IASE) Roundtable, Lund, Sweden, 28 June–3 July 2004 (pp. 163–173). Voorburg, The Netherlands: International Statistical Institute.
Biehler, R. & Steinbring, H. (1991). Entdeckende Statistik, Stengel-und-Blätter, Boxplots: Konzepte, Begründungen und Erfahrungen eines Unterrichtsversuches [Exploratory statistics, stem-and-leaf, boxplots: Concepts, justifications and experiences of a teaching experiment]. Der Mathematikunterricht, 37(6), 5–32.
Chu, X. (2019). Data Cleaning. Encyclopedia of Big Data Technologies. Springer. https://doi.org/10.1007/978-3-319-77525-8
Daniel, B. K. (2017). Big data and data science: A critical review of issues for educational research. British Journal Of Educational Technology, 50(1), 101–113. https://doi.org/10.1111/bjet.12595
Dean, R. B. & Dixon, W. J. (1951). Simplified statistics for small numbers of observations. Analytical Chemistry, 23(4), 636–638. https://doi.org/10.1021/ac60052a025
Dixon, W. J. (1950). Analysis of Extreme Values. The Annals Of Mathematical Statistics, 21(4), 488– 506. https://doi.org/10.1214/aoms/1177729747
Eichler, A. & Vogel, M. (2013). Leitidee Daten und Zufall: Von konkreten Beispielen zur Didaktik der Stochastik [Key idea data and chance: From concrete examples to the didactics of stochastics]. Springer. https://doi.org/10.1007/978-3-658-00118-6
Erickson, T., Wilkerson, M., Finzer, W. & Reichsman, F. (2019). Data moves. Technology Innovations in Statistics Education, 12(1). https://doi.org/10.5070/t5121038001
Fergusson, A., Pfannkuch, M. & Budgett, S. (2025). Data cleaning doesn’t happen in a vacuum: An initial exploration of high school statistics teachers’ data practices with messy data. In J. Kaplan. & K. Luebke (Eds.). Connecting data and people for inclusive statistics and data science education. Proceedings of the Roundtable conference of the International Association for Statistics Education(IASE), July 2024, Auckland, New Zealand. ISI/IASE. https://doi.org/10.52041/iase24.301
Fleischer, Y., Biehler, R. & Schulte, C. (2022). Teaching and Learning Data-Driven Machine Learning with Educationally Designed Jupyter Notebooks. Statistics Education Journal, 21(2), Article 7. https://doi.org/10.52041/serj.v21i2
Grubbs, F. E. (1950). Sample Criteria for Testing Outlying Observations. The Annals of Mathematical Statistics, 21(1), 27–58. https://www.jstor.org/stable/2236553
Grubbs, F. E. (1969). Procedures for Detecting Outlying Observations in Samples. Technometrics, 11(1), 1–21. https://doi.org/10.2307/1266761
Hawkins, D. M. (1980). Identification of outliers. Springer, https://doi.org/10.1007/978-94-015-3994-4
Hußmann, S. & Prediger, S. (2016). Specifying and structuring mathematical topics. Journal für Mathematik-Didaktik, 37(S1), 33–67. https://doi.org/10.1007/s13138-016-0102-8
Lache, J., da Costa Silva, N. & Rolka, K. (2023). Individuelles Feedback und vielfältige Repräsentationen: Einsatz digitaler Mathematikaufgaben in der Schule [Individual feedback and diverse representations: Using digital mathematics tasks in school]. In: Digitaler Mathematikunterricht in Forschung und Praxis. Tagungsband zur Vernetzungstagung 2022 in Siegen (pp. 113–123). https://d-nb.info/128784488X/34
Lee, H., Mojica, G., Thrasher, E., & Baumgartner, P. (2022). Investigating data like a data scientist: Key practices and processes. Statistics Education Research Journal, 21(2), Article 3. https://doi.org/10.52041/serj.v21i2
Lem, S., Onghena, P., Verschaffel, L. & Van Dooren, W. (2013). The heuristic interpretation of box plots. Learning and Instruction, 26, 22–35. http://dx.doi.org/10.1016/j.learninstruc.2013.01.001
Lohr, S. (2014). For Big-Data Scientists, “Janitor Work” is key hurdle to insights. The New York Times. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights- is-janitor-work.html
Markulin, K., Bosch, M., Florensa, I. & Montañola, C. (2022). The evolution of a study and research path in Statistics. epiDEMES, 1. https://doi.org/10.46298/epidemes-7584
Msweli, N. T., Mawela, T. & Twinomurinzi, H. (2023). Data Science Education – A scoping review. Journal Of Information Technology Education Research, 22, 263–294. https://doi.org/10.28945/5173
Ossadnik, H. (2022). Boxplots–einfach zu erstellen, schwer zu interpretieren: Interpretation mit Simulationen üben [Boxplots – easy to create, difficult to interpret: Practicing interpretation with simulations]. digital unterrichten: Mathematik, 2022(1), 6–7.
Schönbrodt, S., Wohak, K. & Frank, M. (2022): Digital Tools to Enable Collaborative Mathematical Modeling Online. Modelling in Science Education and Learning, 15(1), 151–174, https://doi.org/10.4995/msel.2022.16269
Schüller, K. (2022). Data and AI literacy for everyone. Statistical Journal of the IAOS, 38. 1–14. https://doi.org/10.3233/SJI-220941
Siebert, J., Schroth, C. & Groß, J. (2022). Time Traveling with Data Science: Outlier Detection (Part 3). Frauenhofer IESE. https://www.iese.fraunhofer.de/blog/outlier-detection/.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley.
Wang, H., Bah, M. J. & Hammad, M. (2019). Progress in outlier detection techniques: A survey. IEEE Access, 7, 107964–108000. https://doi.org/10.1109/access.2019.2932769
Wilkerson, M. H., Lanouette, K. & Shareff, R. L. (2021). Exploring variability during data preparation: a way to connect data, chance, and context when working with complex public datasets. Mathematical Thinking and Learning, 24(4), 312–330. https://doi.org/10.1080/10986065.2021.1922838
Witte, V., Schwering, A., & Frischemeier, D. (2025). Strengthening Data Literacy in K-12 Education: A Scoping Review. Education Sciences, 15(1), 25. https://doi.org/10.3390/educsci15010025