Data cleaning doesn’t happen in a vacuum: An initial exploration of high school statistics teachers’ data practices with messy data

Authors

  • Anna Fergusson Waipapa Taumata Rau | University of Auckland
  • Maxine Pfannkuch Waipapa Taumata Rau | University of Auckland
  • Stephanie Budgett Waipapa Taumata Rau | University of Auckland

DOI:

https://doi.org/10.52041/iase24.301

Abstract

Cleaning data is an important facet of statistical practice. The research literature on examining data practices of learners when dealing with messy data that needs cleaning, however, is scarce. As part of a larger study, six Grade 12 high school statistics teachers engaged with a height estimation task, for which the data were drawn from a publicly available website containing 39,195 rows of text entries in a variety of measurement systems. The teachers’ observed data practices were characterised as inspecting, ideating, sorting, sampling, converting, visualising, creating, and describing. The implications of the findings with regard to statistical enquiry pathways are discussed.

References

Bakker, A., & van Eerde, D. (2014). An introduction to design-based research with an example from statistics education. In A. Bikner-Ahsbahs, C. Knipping, & N. Presmeg (Eds.), Approaches to qualitative research in mathematics education (pp. 429-466). Springer. https://doi.org/10.1007/978- 94-017-9181-6_16

Barker, H., & Elrod, E. (2023). An analysis of K-8 pre-service teachers as data storytellers. In: E.M. Jones (Ed.), Fostering Learning of Statistics and Data Science Proceedings of the Satellite conference of the International Association for Statistical Education (IASE), International Association for Statistics Education.

Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2-10. https://doi.org/10.1080/00031305.2017.1375989

Chai, C. P. (2020). The importance of data cleaning: Three visualization examples. Chance, 33(1), 4-9. https://chance.amstat.org/2020/02/data-cleaning/

Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4th ed.). Pearson.

Cummiskey, K., Kuiper, S., & Sturdivant, R. (2012). Using classroom data to teach students about data cleaning and testing assumptions. Frontiers in Psychology, 3. https://doi.org/10.3389/fpsyg.2012.00354

D'Ignazio, C. (2017). Creative data literacy: Bridging the gap between the data-haves and data-have nots. Information Design Journal, 23(1), 6-18. https://doi.org/10.1075/idj.23.1.03dig

Dvir, M., & Ben‐Zvi, D. (2022). Students' actual purposes when engaging with a computerized simulation in the context of citizen science. British Journal of Educational Technology, 53(5), 1202- 1220. https://doi.org/10.1111/bjet.13238

Engel, J. (2017). Statistical literacy for active citizenship: A call for data science education. Statistics Education Research Journal, 16(1), 44-49. https://doi.org/10.52041/serj.v16i1.213

Erickson, T., Wilkerson, M., Finzer, W., & Reichsman, F. (2019). Data moves. Technology Innovations in Statistics Education, 12(1). https://doi.org/10.5070/T5121038001

Fergusson, A. (2022). Towards an integration of statistical and computational thinking: Development of a task design framework for introducing code-driven tools through statistical modelling. PhD Thesis, University of Auckland. https://hdl.handle.net/2292/64664

Fergusson, A., & Pfannkuch, M. (2022). Introducing teachers who use GUI-driven tools for the randomization test to code-driven tools. Mathematical Thinking and Learning, 24(4), 336-356. https://doi.org/10.1080/10986065.2021.1922856

Finzer, W., & Reichsman, F. (2018). Exploring the essential elements of data science education. https://concord.org/newsletter/2018-fall/exploring-the-essential-elements-of-data-science- education/

Fry, K., & Makar, K. (2021). How could we teach data science in primary school? Teaching Statistics, 43(S1), S173-S181. https://doi.org/10.1111/test.12259

Gafny, R., & Ben‐Zvi, D. (2023). Students' articulations of uncertainty about big data in an integrated modeling approach learning environment. Teaching Statistics, 45, S67-S79. https://doi.org/10.1111/test.12330

Gould, R. (2021). Toward data-scientific thinking. Teaching Statistics, 43, S11–S22. https://doi.org/10.1111/test.12267

Gould, R., Bargagliotti, A., & Johnson, T. (2017). An analysis of secondary teachers’ reasoning with participatory sensing data. Statistics Education Research Journal, 16(2), 305-334. https://doi.org/10.52041/serj.v16i2.194

Gould, R., Sunbury, S., & Dussault, M. (2014). In praise of messy data. The Science Teacher, 81(8), 31. https://www.proquest.com/scholarly-journals/praise-messy-data/docview/1627727600/se-2

Hammett, A., & Dorsey, C. (2020). Messy data, real science. The Science Teacher, 87(8), 40-48. https://www.jstor.org/stable/27048170

Hardin, J. (2018). Dynamic data in the statistics classroom. Technology Innovations in Statistics Education, 11(1). https://doi.org/10.5070/T5111031079

Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., Murrell, P., Peng, R., Roback, P., Temple Lang, D. & Ward, M. (2015). Data science in statistics curricula: Preparing students to “think with data”. The American Statistician, 69(4), 343-353. https://doi.org/10.1080/00031305.2015.1077729

Holcomb, J., & Spalsbury, A. (2005). Teaching students to use summary statistics and graphics to clean and analyze data. Journal of Statistics Education, 13(3). https://doi.org/10.1080/10691898.2005.11910567

Horton, N. J., Chao, J., Palmer, P., & Finzer, W. (2023). How learners produce data from text in classifying clickbait. Teaching Statistics, 45, S93-S103. https://doi.org/10.1111/test.12339

Kjelvik, M. K., & Schultheis, E. H. (2019). Getting messy with authentic data: Exploring the potential of using data from scientific research to support student data literacy. CBE—Life Sciences Education, 18(2), 1–8. https://www.lifescied.org/doi/10.1187/cbe.18-02-0023

Konold, C., Finzer, W., & Kreetong, K. (2017). Modeling as a core component of structuring data. Statistics Education Research Journal, 16(2), 191-212. https://doi.org/10.52041/serj.v16i2.190

Lee, H., Mojica, G., Thrasher, E., & Baumgartner, P. (2022). Investigating data like a data scientist: Key practices and processes. Statistics Education Research Journal, 21(2). https://doi.org/10.52041/serj.v21i2.41

Legacy, C., Zieffler, A., Fry, E. B., & Le, L. (2022). COMPUTES: Development of an instrument to measure introductory statistics instructors’ emphasis on computational practices. Statistics Education Research Journal, 21(1). https://doi.org/10.52041/serj.v21i1.63

Lohr, S. (2014, August 18). For Big-Data Scientists, “Janitor Work” Is Key Hurdle to Insights. New York Times.

McKenney, S., & Reeves, T. C. (2018). Conducting educational design research. Routledge. https://doi.org/10.4324/9781315105642

Ministry of Education. (2007). The New Zealand Curriculum. Learning Media.

Musyoka, J., Lunalo, J., Garlick, C., Ndung'u, S., Stern, D., Parsons, D., & Stern, R. (2017). Embedding Data Manipulation in Statistics Education. In: A. Molnar (Ed.), Teaching Statistics in a Data Rich World Proceedings of the Satellite conference of the International Association for Statistical Education (IASE), International Association for Statistics Education.

Nolan, D., & Temple Lang, D. (2010). Computing in the statistics curricula. The American Statistician, 64(2), 97-107. https://doi.org/10.1198/tast.2010.09132

Perez, L. & Lionberger, K. (2023). Opening the door to data science in STEM classrooms: How can we help all students navigate our data-rich world? https://ngs.wested.org/doortodatascience/

Rosenberg, J., Edwards, A., & Chen, B. (2020). Getting messy with data. The Science Teacher, 87(5), 30-35. https://www.jstor.org/stable/27048120

Rosenberg, J. M., Schultheis, E. H., Kjelvik, M. K., Reedy, A., & Sultana, O. (2022). Big data, big changes? The technologies and sources of data used in science classrooms. British Journal of Educational Technology, 53(5), 1179-1201. https://doi.org/10.1111/bjet.13245

Thoma, S., Deitrick, E., & Wilkerson, M. (2018). “It didn’t really go very well”: Epistemological framing and the complexity of interdisciplinary computing activities. In J. Kay & R. Luckin (Eds.), Rethinking learning in digital age: Making the learning sciences count. Proceedings of the 13th International Conference of the Learning Sciences (ICLS), London, UK, (Vol. 2, pp. 1121–1124). International Society of the Learning Sciences.

Yue, K. -B. (2012). A realistic data cleansing and preparation project. Journal of Information Systems Education, 23(2), 205-216.

Wickham H. (2104). Tidy Data. Journal of Statistical Software. 59(1), 1–23. https://doi.org/10.18637/jss.v059.i10

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science. O'Reilly Media, Inc.

Wild, C. J., & Pfannkuch, M. (1999). Statistical thinking in empirical enquiry. International Statistical Review, 67(3), 223–248. https://doi.org/10.1111/j.1751-5823.1999.tb00442.x

Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLoS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510

Downloads

Published

2025-02-06

Conference Proceedings Volume

Section

Topic 3: Drawing from multiple ways of knowing in the teaching and learning of statistics and data science