R data package - the GHE way
Why?
This tutorial was used as a workshop for team members of the Global Health Engineering group. It is part of a greater data management strategy and covers the aspired research data publication and archiving method.
The objectives are to:
- make data cite-able
- appropriately document data
- publish analysis-ready data prior to scientific article
- reduce workload at the point of submission of a scientific article
- provide opportunities for showcasing data
- give credit to contributors that would not be a co-author of a scientific article
- enhance reproducibility of analysis-ready data (from unprocessed raw data to tidy data)
- allow for reproducibility of scientific article by providing data alongside code
- provide opportunities for iterations on exploratory analysis and iterations
- increase number of group relevant datasets that can be used for teaching computational tools
- …
The “Why?” is also covered in a talk prepared for the launch meeting of the data stewardship network at ETH Zurich. It can be accessed via the Slides page.
Part 1 - Create GitHub repo
- Decide on a name for the repository and corresponding R data package
- have all small letters
- have no spaces or dashes
- be a combination of two to three words
- identify location and/or theme/topic
- Open https://github.com/ and start a new repo with the following settings
- Public
- Do not add a README
- Do not add a .gitignore
- Do not add a LICENSE
Part 2 - Write the package
Open RStudio IDE
Create a new project using the R Package devtools and the same name you used for the repository on GitHub
- File -> New Project -> New Directory -> R Package using devtools -> Choose directory name and location of sub-directory
Add git version control to local directory
```{r}
usethis::use_git()
```
- yes, commit
- yes, restart
- Connect local repo with remote repo
- open your repository site on GitHub
- copy the commands under “…or push an existing repository from the command line”
- open the Terminal within RStudio IDE (tab next to Console)
- paste and execute the commands you copied from GitHub
```{}
git remote add origin URL
git branch -M main
git push -u origin main
```
- Add
data-raw/
directory to package
```{r}
usethis::use_data_raw()
```
This will create a data-raw/
directory.
- contains a
DATASET.R
- rename to
data_processing.R
Add, commit and push all changes to GitHub
If you have an external data donation, open issue 1 on the GitHub repository to communicate about raw data submission (example for issue template)
If you have the raw data available, add to the
data-raw
directory using your file management systemPrepare initial import of and export of data in
data_processing.R
- replace
dataset
with a short name in small letters that describes your data- if the package will consist of single dataset, chose the same as the name for your package
- if you will have several datasets in your package, chose a short name (one word) in small letters
```{r}
# description -------------------------------------------------------------
# R script to process uploaded raw data into a tidy dataframe
# load packages -----------------------------------------------------------
library(readr)
# read data ---------------------------------------------------------------
dataset <- read_csv("data-raw/data.csv")
# tidy data ---------------------------------------------------------------
## code to prepare a tidy, analysis-ready dataset goes here
# write data --------------------------------------------------------------
usethis::use_data(dataset, overwrite = TRUE)
```
Add, commit and push all changes to GitHub
Prepare data dictionary to document the variables of the exported dataset
- create a new text file named
dictionary.csv
- add the following columns as the first line of the file: directory, file_name, variable_name, variable_type, description
- in column
variable_name
list all variables of your dataset - in column
description
provide a one line description for each variable
On GitHub, open issue to cross-check in with data donator for correct understanding of variables in
dictionary.csv
Install openwashdata R package
```{r}
devtools::install_github("openwashdata/openwashdata")
```
- Initiate documentation folder for writing up metadata and documentation for objects
- replace dataset with the same name that you used for the object you exported with
usethis::use_data()
indata_processing.R
```{r}
usethis::use_r("dataset")
```
- opens a new R script
- Add documentation from data dictionary to script as roxygen comments
```{r}
openwashdata::generate_roxygen_docs("data-raw/dictionary.csv", output_file_path = "R/dataset.R")
```
- example of durbanplasticwaste documentation:
Add, commit and push all changes to GitHub
Document the package run checks for errors, warnings, and notes
```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check(). # keyboard shortcut: Cmd/Ctrl + Shift + E
```
- make sure all errors are fixed
- ignore warnings for now
- Open the
DESCRIPTION
to document the package
- edit title field (keep this short)
- edit description field (keep it short and to the point describing the data)
- Add everyone that has contributed to the Authors@R field in DESCRIPTION
- ensure you have the ORCID id of everyone (e.g. by opening an issue on GitHub and ask people to share their details)
For each person, replace details and run the following code in your Console:
Add CC-BY or CC0 license
::use_ccby_license() usethis::use_cc0_license() usethis
In your Console, run:
- Document the package run checks for errors, warnings, and notes
```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check(). # keyboard shortcut: Cmd/Ctrl + Shift + E
```
- Add
CITATION.cff
file to repo
```{r}
cffr::cff_write()
```
Add, commit and push all changes to GitHub
Create a rmd README for package
```{r}
usethis::use_readme_rmd()
```
- remove everything starting from line 41 “What is special about using
README.Rmd
instead of …” - add table with variable descriptions
```{r}
readr::read_csv("data-raw/dictionary.csv") |>
dplyr::select(variable_name:description) |>
gt::gt()
```
- add the first ten rows of the data
```{r}
dataset |>
head() |>
gt::gt()
```
- Build the README.rmd to output README.md
```{r}
devtools::build_readme()
```
Add, commit and push all changes to GitHub
Create an examples article for the package
```{r}
usethis::use_article("examples")
```
- prepare some exploratory analysis (e.g. a plot, a map, a table)
- Document the package run checks for errors, warnings, and notes
```{r}
devtools::document() # keyboard shortcut: Cmd/Ctrl + Shift + D
devtools::check(). # keyboard shortcut: Cmd/Ctrl + Shift + E
```
Part 3 - Setup pkgdown website
- Setup pkgdown configuration and github actions
```{r}
usethis::use_pkgdown()
```
- Remove
docs
from.gitignore
file
```{r}
.Rproj.user
.Rhistory
.Rdata
.httr-oauth
.DS_Store
.quarto
```
- Build website with
pkgdown
```{r}
pkgdown::build_site()
```
- Add, commit and push all changes to GitHub
Part 4 - Host with GitHub Pages
- Deploy to GitHub Pages
- open your repository site on GitHub
- click the Settings (gear icon) option
- on the left sidebar under “Code and automation” click “Pages”
- under “Build and deployment”, in “Branch” section
- select “main” from Dropdown menu that states “None”
- select “docs” from Dropdown menu that states “/ (root)”
- click “Save” button
- Edit repository details
- go back to your main page of your repository site on GitHub
- on the right side, next to “About” click on the gear icon
- provide a short description (e.g. the same as you have used for describing your dataset)
- click the box next to “Use your GitHub Pages website”
- click “Save changes” button
Part 5 - Use web analytics
- Add plausible.io script and website URL to _pkdown.yml
- open your repository in RStudio IDE
- open
_pkgdown.yml
file - copy the content below and replace the two instances of PACKAGENAME with the name of your package
```{yaml}
url: https://global-health-engineering.github.io/PACKAGENAME/
template:
bootstrap: 5
includes:
in_header: |
<script defer data-domain="global-health-engineering.github.io/PACKAGENAME" src="https://plausible.io/js/script.js"></script>
```
- a more detailed example is available for the
durbanplasticwaste
package
Part 6 - Publish
TODO
- publish to Zenodo
- add DOI everywhere
- add to research collection