SwissRN - Computational Reproducibiliy Seminar

Plan for tomorrow today: a model for data stewardship

May 15, 2025

library(ghedata)
library(ggthemes)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(washopenresearch)
library(ggtext)
library(gt)
# data preparation

washdev_das_type <- washdev |> 
    mutate(das_policy = case_when(
        published_year < 2020 ~ "pre-2020",
        TRUE ~ "2020 or later"
    )) |> 
    mutate(das_type = case_when(
        das_type == "in paper" ~ "available in paper",
        das_type == "on request" ~ "available on request",
        TRUE ~ das_type
    ))  |>     
    mutate(das_type = case_when(
        is.na(das_type) ~ "missing",
        TRUE ~ das_type
    )) 
## summary for data availability statement (DAS) type and policy year

washdev_das_type_n <- washdev_das_type |> 
    count(das_policy, das_type) 


fig_das_type <- washdev_das_type_n |> 
    ggplot(aes(x = reorder(das_type, n), y = n, fill = das_policy)) +
    geom_col(position = position_dodge(), width = 0.6) +
    geom_text(aes(label = n), 
              vjust = 0.5, 
              hjust = -0.5,  
              size = 3,
              position = position_dodge(width = 0.5)
    ) +
    coord_flip() +
    annotate("text", 
             x = 3.77, 
             y = 150, 
             size = 3, 
             label = "after introducing policy\nfor data availability statement", 
             color = "gray20") +
    geom_curve(aes(x = 3.95, y = 142, xend = 3.95, yend = 70), 
               curvature = 0.5, 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "gray20") +
    labs(
        title = "Data availability statement",
        subtitle = "Analysis of 924 articles in Journal of Water, Sanitation & Hygiene for Development (2011 to 2023)",
        fill = "published year",
        y = "number of publications",
        x = "data availability statement") +
    scale_y_continuous(breaks = seq(0, 600, 100), limits = c(0,600)) +
    scale_fill_colorblind() 
# https://www.iwapublishing.com/news/iwa-publishing-2020-annual-review
## summary for data availability statement (DAS) type and supp file type

washdev_supp_file_type_n <- washdev_das_type |> 
    filter(das_policy == "2020 or later") |> 
    select(paperid, das_type, supp_file_type) |> 
    unnest(supp_file_type) |> 
    mutate(supp_file_type = case_when(
        is.na(supp_file_type) ~ "missing",
        TRUE ~ supp_file_type
    )) |>
    count(das_type, supp_file_type) 

tbl_supp_type <- washdev_supp_file_type_n |> 
    group_by(supp_file_type) |> 
    summarise(n = sum(n)) |> 
    arrange(desc(n)) |> 
    mutate(perc = n / sum(n) * 100) 

Meet a data steward

Meet a data steward

I have:

  • 10+ years work experience (5 in research)
  • empathy, compassion, patience, persistance
  • an affinity for IT
  • teaching experience
  • learned how people learn

I don’t have:

  • a doctoral degree
  • a qualification in computer science
  • a qualification in statistics
  • a lot of time

10 learnings from 3 years

#1 Technology is not on our side

Meet a Professor

The Modern Academic’s Challenges

  • Overflowing email inboxes
  • Browsers with hundreds of tabs
  • Files on stored on Desktops
  • MS Teams, Slack, Element, NAS, Google Drive, …
  • Credentials, Passwords, OTPs, 2FAs, PATs, …

#2 ETH wants reproducibility

ETH RDM Guidelines

FAIR data sharing principles

FAIR data sharing principles

  • Technical in nature
  • Require data management strategy to establish workflows
  • Not a checkbox, but a process

Findable
Accessible
Interoperable
Reusable

#3 Data management is project management

undergrad_students <- people |> 
  filter(b_m_student == "yes") |>
  filter(!is.na(title)) 

undergrad_students |> 
  count(degree, year) |> 
  mutate(degree = case_when(
    degree == "bsc" ~ "BSc thesis",
    degree == "msc" ~ "MSc thesis"
  )) |>
  ggplot(aes(x = year, y = n, label = n, fill = degree)) +
  geom_col(position = "dodge") +
  geom_label(position = position_dodge(width = 0.9),
            show.legend = FALSE,
            color = "white",
            fontface = "bold",
            size = 6) +
  labs(x = "",
       y = "Number of thesis projects", 
       fill = "Project:") +
  scale_fill_colorblind() +
  scale_color_colorblind() +
  theme(panel.grid = element_blank(),
        axis.text.y = element_blank()) +
  statR::theme_stat(base_size = 16) 

GHE Student Wiki (public)

  • Grading criteria
  • Communication expectations
  • Data storage and data management guidelines
  • Presentation standards
  • Proposal and thesis writing requirements
knitr::include_graphics(here::here("slides/img/eth-kolloquium/ghe-student-wiki.png"))

Grading rubric & data publication

Four areas of evaluation with 31 sub-areas

  • 40/100: Research competence
  • 40/100: Thesis report
  • 10/100: Colloquium
  • 10/100: Examination

‘Data Management’ under ‘Research Competence’

6: Data is fully documented, organized, easy to reproduce, and publication ready. Everything is stored on Google Drive.

But, data publication requirement

Obtaining a 6 from all sub-areas but not publishing the data in the form of a repository will result in a maximum allowed grade of 5.75.

ETH Board Open Research Data position

ETH Board Open Research Data position

#4 Predictability wins

Structure & naming conventions

GHE Google Shared Drive

  • ghe-supervision
    • archive
    • bachelors
    • masters
      • msc-sem-proj
      • msc-thesis
        • 2024-msc-thesis-lschoebitz
    • phds

Convention

  • YYYY-degree-type-ethzid

A unqiue identifier for each student (and staff) that is used in several places.

#5 Low IT affinity is not a lack of aptitude

Safe learning environments

Growth-mindset for better learning outcomes

  • Fixed mindset: ‘I’m not good’
  • Growth mindset: ‘I can learn’

Create safe learner environments

  • Regular 1:1 research data management meetings
  • Bi-monthly half day team events
  • Yearly retreat

#6 Data != Data

Disclaimer: Data at GHE

  • small (few MBs)
  • tabular
  • non-sensitive
  • topics
    • waste management
    • sanitation
    • air quality
    • etc.

Three terms for three stages

Three terms for three stages

term explanation file format
unprocessed raw data data that is not processed and remains in its original form and file type often XLSX, also CSV and others

Three terms for three stages

term explanation file format
unprocessed raw data data that is not processed and remains in its original form and file type often XLSX, also CSV and others
processed analysis-ready data data that is processed to prepare for an analysis and is exported in its new form as a new file CSV, R data package

Three terms for three stages

term explanation file format
unprocessed raw data data that is not processed and remains in its original form and file type often XLSX, also CSV and others
processed analysis-ready data data that is processed to prepare for an analysis and is exported in its new form as a new file CSV, R data package
final data underlying a publication data that is the result of an analysis (e.g descriptive statistics or data visualization) and shown in a publication, but then also exported in its new form as a new file CSV

#7 Data management is a process, not a checkbox

#8 Findable: Publish for humans and computers

Automation from ETH Research Collection

Automation from Zenodo

Automation from GitHub

Open Source

made for collaboration

Automation from GitHub

made for humans

#9 9 to 5 is possible

Meet a professor

  • Plans for ‘tomorrow’
  • Is ready for increasing requirements
  • Dedicates financial resources to data stewardship

#10 Funding for Open Research Data exists existed

Funding schemes

swissuniversities

Funding schemes

Open Research Data Program of the ETH Board

  • 2021 - 2024: ~ 96 projects funded (~ CHF 15 million budget in total)
  • Global Health Engineering was awarded 2 Contribute and 3 Explore projects worth 500’000 CHF
  • 2021 - 2024: ~ 100 projects funded (~ CHF 10 million in total)
  • All 96 projects and newsletter sign-up: https://open-research-data-portal.ch/ (bottom of page)

10 take-aways from 30 minutes

  • #1 Technology is not on our side
  • #2 ETH wants reproducibility
  • #3 Data management is project management
  • #4 Predictability wins
  • #5 Low IT affinity is not a lack of aptitude
  • #6 Data != Data
  • #7 Data management is a process, not a checkbox
  • #8 Findable: Publish for humans and computers
  • #9: 9 to 5 is possible
  • #10 Funding for Open Research Data exists existed

Thanks! 🌻

Slides created via revealjs and Quarto: https://quarto.org/docs/presentations/revealjs/

Slide background image taken from Danielle Navarro

Access slides as PDF on GitHub

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.