R data package - the GHE way

Why?

This tutorial was used as a workshop for team members of the Global Health Engineering group. It is part of a greater data management strategy and covers the aspired research data publication and archiving method.

The objectives are to:

make data cite-able
appropriately document data
publish analysis-ready data prior to scientific article
reduce workload at the point of submission of a scientific article
provide opportunities for showcasing data
give credit to contributors that would not be a co-author of a scientific article
enhance reproducibility of analysis-ready data (from unprocessed raw data to tidy data)
allow for reproducibility of scientific article by providing data alongside code
provide opportunities for iterations on exploratory analysis and iterations
increase number of group relevant datasets that can be used for teaching computational tools
…

The “Why?” is also covered in a talk prepared for the launch meeting of the data stewardship network at ETH Zurich. It can be accessed via the Slides page.

Part 1 - Create GitHub repo

Decide on a name for the repository and corresponding R data package
- have all small letters
- have no spaces or dashes
- be a combination of two to three words
- identify location and/or theme/topic
Open https://github.com/ and start a new repo with the following settings
- Public
- Do not add a README
- Do not add a .gitignore
- Do not add a LICENSE

Part 2 - Write the package

Open RStudio IDE
Create a new project using the R Package devtools and the same name you used for the repository on GitHub
- File -> New Project -> New Directory -> R Package using devtools -> Choose directory name and location of sub-directory
Add git version control to local directory

```{r}
usethis::use_git()
```

yes, commit
yes, restart

Connect local repo with remote repo

open your repository site on GitHub
copy the commands under “…or push an existing repository from the command line”
open the Terminal within RStudio IDE (tab next to Console)
paste and execute the commands you copied from GitHub

```{}         
git remote add origin URL
git branch -M main
git push -u origin main
```

Add data-raw/directory to package

```{r}
usethis::use_data_raw()
```

This will create a data-raw/ directory.

contains a DATASET.R
rename to data_processing.R

Add, commit and push all changes to GitHub
If you have an external data donation, open issue 1 on the GitHub repository to communicate about raw data submission (example for issue template)
If you have the raw data available, add to the data-raw directory using your file management system
Prepare initial import of and export of data in data_processing.R

replace dataset with a short name in small letters that describes your data
- if the package will consist of single dataset, chose the same as the name for your package
- if you will have several datasets in your package, chose a short name (one word) in small letters

```{r}
# description -------------------------------------------------------------

# R script to process uploaded raw data into a tidy dataframe

# load packages -----------------------------------------------------------

library(readr)

# read data ---------------------------------------------------------------

dataset <- read_csv("data-raw/data.csv")

# tidy data ---------------------------------------------------------------

## code to prepare a tidy, analysis-ready dataset goes here

# write data --------------------------------------------------------------

usethis::use_data(dataset, overwrite = TRUE)

```

Add, commit and push all changes to GitHub
Prepare data dictionary to document the variables of the exported dataset

create a new text file named dictionary.csv
add the following columns as the first line of the file: directory, file_name, variable_name, variable_type, description
in column variable_name list all variables of your dataset
in column description provide a one line description for each variable

On GitHub, open issue to cross-check in with data donator for correct understanding of variables in dictionary.csv
Install openwashdata R package

```{r}
devtools::install_github("openwashdata/openwashdata")
```

Initiate documentation folder for writing up metadata and documentation for objects

replace dataset with the same name that you used for the object you exported with usethis::use_data() in data_processing.R

```{r}
usethis::use_r("dataset")
```

opens a new R script

Add documentation from data dictionary to script as roxygen comments

```{r}
openwashdata::generate_roxygen_docs("data-raw/dictionary.csv", output_file_path = "R/dataset.R")
```

example of durbanplasticwaste documentation:

Add, commit and push all changes to GitHub
Document the package run checks for errors, warnings, and notes

```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check().   # keyboard shortcut: Cmd/Ctrl + Shift + E
```

make sure all errors are fixed
ignore warnings for now

Open the DESCRIPTION to document the package

edit title field (keep this short)
edit description field (keep it short and to the point describing the data)

Add everyone that has contributed to the Authors@R field in DESCRIPTION

ensure you have the ORCID id of everyone (e.g. by opening an issue on GitHub and ask people to share their details)

For each person, replace details and run the following code in your Console:

Add CC-BY or CC0 license

usethis::use_ccby_license()
usethis::use_cc0_license()

In your Console, run:

Document the package run checks for errors, warnings, and notes

```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check().   # keyboard shortcut: Cmd/Ctrl + Shift + E
```

Add CITATION.cff file to repo

```{r}
cffr::cff_write()
```

Add, commit and push all changes to GitHub
Create a rmd README for package

```{r}
usethis::use_readme_rmd()
```

remove everything starting from line 41 “What is special about using README.Rmd instead of …”
add table with variable descriptions

```{r}
readr::read_csv("data-raw/dictionary.csv") |> 
  dplyr::select(variable_name:description) |> 
  gt::gt()
```

add the first ten rows of the data

```{r}
dataset |> 
  head() |> 
  gt::gt()
```

Build the README.rmd to output README.md

```{r}
devtools::build_readme()
```

Add, commit and push all changes to GitHub
Create an examples article for the package

```{r}
usethis::use_article("examples")
```

prepare some exploratory analysis (e.g. a plot, a map, a table)

Document the package run checks for errors, warnings, and notes

```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check().   # keyboard shortcut: Cmd/Ctrl + Shift + E
```

Part 3 - Setup pkgdown website

Setup pkgdown configuration and github actions

```{r}
usethis::use_pkgdown()
```

Remove docs from .gitignore file

```{r}
.Rproj.user
.Rhistory
.Rdata
.httr-oauth
.DS_Store
.quarto
```

Build website with pkgdown

```{r}
pkgdown::build_site()
```

Add, commit and push all changes to GitHub

Part 4 - Host with GitHub Pages

Deploy to GitHub Pages

open your repository site on GitHub
click the Settings (gear icon) option
on the left sidebar under “Code and automation” click “Pages”
under “Build and deployment”, in “Branch” section
- select “main” from Dropdown menu that states “None”
- select “docs” from Dropdown menu that states “/ (root)”
- click “Save” button

Edit repository details

go back to your main page of your repository site on GitHub
on the right side, next to “About” click on the gear icon
provide a short description (e.g. the same as you have used for describing your dataset)
click the box next to “Use your GitHub Pages website”
click “Save changes” button

Part 5 - Use web analytics

Add plausible.io script and website URL to _pkdown.yml

open your repository in RStudio IDE
open _pkgdown.yml file
copy the content below and replace the two instances of PACKAGENAME with the name of your package

```{yaml}
url: https://global-health-engineering.github.io/PACKAGENAME/

template:
  bootstrap: 5
  includes:
    in_header: |
      <script defer data-domain="global-health-engineering.github.io/PACKAGENAME" src="https://plausible.io/js/script.js"></script>
```

a more detailed example is available for the durbanplasticwaste package

Part 6 - Publish

TODO

publish to Zenodo
add DOI everywhere
add to research collection