R data package - the GHE way

Why?

This tutorial was used as a workshop for team members of the Global Health Engineering group. It is part of a greater data management strategy and covers the aspired research data publication and archiving method.

The objectives are to:

  • make data cite-able
  • appropriately document data
  • publish analysis-ready data prior to scientific article
  • reduce workload at the point of submission of a scientific article
  • provide opportunities for showcasing data
  • give credit to contributors that would not be a co-author of a scientific article
  • enhance reproducibility of analysis-ready data (from unprocessed raw data to tidy data)
  • allow for reproducibility of scientific article by providing data alongside code
  • provide opportunities for iterations on exploratory analysis and iterations
  • increase number of group relevant datasets that can be used for teaching computational tools

The “Why?” is also covered in a talk prepared for the launch meeting of the data stewardship network at ETH Zurich. It can be accessed via the Slides page.

Part 1 - Create GitHub repo

  1. Decide on a name for the repository and corresponding R data package
    • have all small letters
    • have no spaces or dashes
    • be a combination of two to three words
    • identify location and/or theme/topic
  2. Open https://github.com/ and start a new repo with the following settings
    • Public
    • Do not add a README
    • Do not add a .gitignore
    • Do not add a LICENSE

Part 2 - Write the package

  1. Open RStudio IDE

  2. Create a new project using the R Package devtools and the same name you used for the repository on GitHub

    • File -> New Project -> New Directory -> R Package using devtools -> Choose directory name and location of sub-directory
  3. Add git version control to local directory

```{r}
usethis::use_git()
```
  • yes, commit
  • yes, restart
  1. Connect local repo with remote repo
  • open your repository site on GitHub
  • copy the commands under “…or push an existing repository from the command line”
  • open the Terminal within RStudio IDE (tab next to Console)
  • paste and execute the commands you copied from GitHub
```{}         
git remote add origin URL
git branch -M main
git push -u origin main
```
  1. Add data-raw/directory to package
```{r}
usethis::use_data_raw()
```

This will create a data-raw/ directory.

  • contains a DATASET.R
  • rename to data_processing.R
  1. Add, commit and push all changes to GitHub

  2. If you have an external data donation, open issue 1 on the GitHub repository to communicate about raw data submission (example for issue template)

  3. If you have the raw data available, add to the data-raw directory using your file management system

  4. Prepare initial import of and export of data in data_processing.R

  • replace dataset with a short name in small letters that describes your data
    • if the package will consist of single dataset, chose the same as the name for your package
    • if you will have several datasets in your package, chose a short name (one word) in small letters
```{r}
# description -------------------------------------------------------------

# R script to process uploaded raw data into a tidy dataframe

# load packages -----------------------------------------------------------

library(readr)

# read data ---------------------------------------------------------------

dataset <- read_csv("data-raw/data.csv")

# tidy data ---------------------------------------------------------------

## code to prepare a tidy, analysis-ready dataset goes here

# write data --------------------------------------------------------------

usethis::use_data(dataset, overwrite = TRUE)

```
  1. Add, commit and push all changes to GitHub

  2. Prepare data dictionary to document the variables of the exported dataset

  • create a new text file named dictionary.csv
  • add the following columns as the first line of the file: directory, file_name, variable_name, variable_type, description
  • in column variable_name list all variables of your dataset
  • in column description provide a one line description for each variable
  1. On GitHub, open issue to cross-check in with data donator for correct understanding of variables in dictionary.csv

  2. Install openwashdata R package

```{r}
devtools::install_github("openwashdata/openwashdata")
```
  1. Initiate documentation folder for writing up metadata and documentation for objects
  • replace dataset with the same name that you used for the object you exported with usethis::use_data() in data_processing.R
```{r}
usethis::use_r("dataset")
```
  • opens a new R script
  1. Add documentation from data dictionary to script as roxygen comments
```{r}
openwashdata::generate_roxygen_docs("data-raw/dictionary.csv", output_file_path = "R/dataset.R")
```
  1. Add, commit and push all changes to GitHub

  2. Document the package run checks for errors, warnings, and notes

```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check().   # keyboard shortcut: Cmd/Ctrl + Shift + E
```
  • make sure all errors are fixed
  • ignore warnings for now
  1. Open the DESCRIPTION to document the package
  • edit title field (keep this short)
  • edit description field (keep it short and to the point describing the data)
  1. Add everyone that has contributed to the Authors@R field in DESCRIPTION
  • ensure you have the ORCID id of everyone (e.g. by opening an issue on GitHub and ask people to share their details)

For each person, replace details and run the following code in your Console:

  1. Add CC-BY or CC0 license

    usethis::use_ccby_license()
    usethis::use_cc0_license()

In your Console, run:

  1. Document the package run checks for errors, warnings, and notes
```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check().   # keyboard shortcut: Cmd/Ctrl + Shift + E
```
  1. Add CITATION.cff file to repo
```{r}
cffr::cff_write()
```
  1. Add, commit and push all changes to GitHub

  2. Create a rmd README for package

```{r}
usethis::use_readme_rmd()
```
  • remove everything starting from line 41 “What is special about using README.Rmd instead of …”
  • add table with variable descriptions
```{r}
readr::read_csv("data-raw/dictionary.csv") |> 
  dplyr::select(variable_name:description) |> 
  gt::gt()
```
  • add the first ten rows of the data
```{r}
dataset |> 
  head() |> 
  gt::gt()
```
  1. Build the README.rmd to output README.md
```{r}
devtools::build_readme()
```
  1. Add, commit and push all changes to GitHub

  2. Create an examples article for the package

```{r}
usethis::use_article("examples")
```
  • prepare some exploratory analysis (e.g. a plot, a map, a table)
  1. Document the package run checks for errors, warnings, and notes
```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check().   # keyboard shortcut: Cmd/Ctrl + Shift + E
```

Part 3 - Setup pkgdown website

  1. Setup pkgdown configuration and github actions
```{r}
usethis::use_pkgdown()
```
  1. Remove docs from .gitignore file
```{r}
.Rproj.user
.Rhistory
.Rdata
.httr-oauth
.DS_Store
.quarto
```
  1. Build website with pkgdown
```{r}
pkgdown::build_site()
```
  1. Add, commit and push all changes to GitHub

Part 4 - Host with GitHub Pages

  1. Deploy to GitHub Pages
  • open your repository site on GitHub
  • click the Settings (gear icon) option
  • on the left sidebar under “Code and automation” click “Pages”
  • under “Build and deployment”, in “Branch” section
    • select “main” from Dropdown menu that states “None”
    • select “docs” from Dropdown menu that states “/ (root)”
    • click “Save” button
  1. Edit repository details
  • go back to your main page of your repository site on GitHub
  • on the right side, next to “About” click on the gear icon
  • provide a short description (e.g. the same as you have used for describing your dataset)
  • click the box next to “Use your GitHub Pages website”
  • click “Save changes” button

Part 5 - Use web analytics

  1. Add plausible.io script and website URL to _pkdown.yml
  • open your repository in RStudio IDE
  • open _pkgdown.yml file
  • copy the content below and replace the two instances of PACKAGENAME with the name of your package
```{yaml}
url: https://global-health-engineering.github.io/PACKAGENAME/

template:
  bootstrap: 5
  includes:
    in_header: |
      <script defer data-domain="global-health-engineering.github.io/PACKAGENAME" src="https://plausible.io/js/script.js"></script>
```

Part 6 - Publish

TODO

  • publish to Zenodo
  • add DOI everywhere
  • add to research collection