Data management in dplyr

In this section, I will discuss about data management process, basic data cleaning and management in dplyr.

Data can be defined as a collection of facts that can be used to draw a conclusion, make predictions and assist decision making.

Public health data may originate from various sources including:

Routine disease surveillance system such as hospital medical records (individual level information) or aggregated data captured through a reporting software - District Health Information Software (DHIS2)
Research and surveys - Demographic Health Survey, Malaria Indicator Survey, Service Provision Assessment
Administrative - logistic data
Vital statistics - birth, death and marriage
Literature and reports

- ls() - Lists all active files.
- rm(list = ls()) - Cleans your environment.
- Ctrl + L - Cleans the console.
- Assign function: Alt + -.
- Ctrl + 1 - shortcut to move to source pane
- Ctrl + 2 - shortcut to move to console pane

Step-by-Step Guide

1. Setting Up Your Environment

Check your working directory using:

getwd()

[1] "C:/Users/User/OneDrive/Documents/GitHub/yalemgelaw.github.io/posts/2024-06-10-Data-manipulation-in-dplyr"

2. Load the necessary packages

For data management, you will use the tidyverse package. Install the package if you haven’t installed before using install.packages(“tidyverse”) function.

# Install necessary packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("janitor", quietly = TRUE)) install.packages("janitor")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("lubridate", quietly = TRUE)) install.packages("lubridate")
if (!requireNamespace("haven", quietly = TRUE)) install.packages("haven")

# Load the packages
library(tidyverse)   # To read Excel files

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tibble' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'forcats' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)     # To clean column names

Warning: package 'janitor' was built under R version 4.3.3


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(dplyr)       # For data wrangling
library(lubridate)   # Date functions
library(haven)       # To read stata file

Warning: package 'haven' was built under R version 4.3.3

3. Reading and Cleaning Data

Read the malaria data: The data I used for this post is a sample of routine malaria surveillance data from Ethiopia. The data you’re collected monthly and collated at the district level (third administrative system) and stored in csv file format.

malaria_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/routine_data.csv")

Rows: 11604 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): region, zone, district, Month, date
dbl (9): Year, test_performed, confirmed_all, confirmed_u5, confirmed_5_14, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Safeguard your data: Make a backup of the data:

When dealing with large datasets, it’s essential to take precautions. Imagine your data disappearing into the digital abyss—scary, right? To prevent that, always make a backup. Think of it as a safety net for your precious data. Whether you’re crunching numbers, visualizing trends, or building models, follow this golden rule: Back it up before you hack it up🔒!

routine_data <- malaria_data #Make a backup

3. Skim the data

Knowing your dataset well from file size to data types is another crucial step prior to hands-on data cleaning.

To look at how the data looks like either by clicking on it in the global environment window or by typing the command View(routine_data) which opens up a window displaying the data.

View(routine_data)

Alternatively, you may just want to look at a few rows. You can do this by using the head() function, which shows us the first six rows of data and tail() function, which shows us the lass six rows of the data.

head(routine_data) # top six rows

# A tibble: 6 × 14
  region      zone       district Month  Year date  test_performed confirmed_all
  <chr>       <chr>      <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
1 Addis Ababa Addis Ket… Addis K… July   2021 1/07…             97             3
2 Addis Ababa Addis Ket… Addis K… Augu…  2021 1/08…            148            70
3 Addis Ababa Addis Ket… Addis K… Sept…  2021 1/09…            142             5
4 Addis Ababa Addis Ket… Addis K… Octo…  2021 1/10…            190            93
5 Addis Ababa Addis Ket… Addis K… Nove…    21 1/11…            194            87
6 Addis Ababa Addis Ket… Addis K… Dece…  2021 1/12…            194            25
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

tail(routine_data) # bottom six rows

To understand the structure of the data you can use the str() command or glimpse().

glimpse(routine_data)

Rows: 11,604
Columns: 14
$ region         <chr> "Addis Ababa", "Addis Ababa", "Addis Ababa", "Addis Aba…
$ zone           <chr> "Addis Ketema", "Addis Ketema", "Addis Ketema", "Addis …
$ district       <chr> "Addis Ketema (AA)", "Addis Ketema (AA)", "Addis Ketema…
$ Month          <chr> "July", "August", "September", "October", "November", "…
$ Year           <dbl> 2021, 2021, 2021, 2021, 21, 2021, 2021, 2021, 2021, 202…
$ date           <chr> "1/07/2021", "1/08/2021", "1/09/2021", "1/10/2021", "1/…
$ test_performed <dbl> 97, 148, 142, 190, 194, 194, 146, 189, 243, 184, 166, 1…
$ confirmed_all  <dbl> 3, 70, 5, 93, 87, 25, 62, 72, 18, 23, 15, 43, 17, 3, 10…
$ confirmed_u5   <dbl> 1, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 1, 2…
$ confirmed_5_14 <dbl> 0, 1, 0, 6, 3, 1, 2, 2, 0, 2, 0, 4, 1, 0, 0, 0, 0, 0, 1…
$ confirmed_15   <dbl> 2, 69, 5, 87, 78, 24, 60, 70, 18, 21, 15, 37, 14, 3, 10…
$ pop_5          <dbl> 50741.83, 50741.83, 50741.83, 50741.83, 50741.83, 50741…
$ pop_514        <dbl> 88159.40, 88159.40, 88159.40, 88159.40, 88159.40, 88159…
$ pop_15         <dbl> 212193.8, 212193.8, 212193.8, 212193.8, 212193.8, 21219…

From the above results, you can see that the data frame consists of 11,604 observations (rows) and 14 variables (columns). Each variable’s name and data type is also listed. The dataset can be read as data.frame, lists, tbl_df, spatial. Please refer my previous post on Mastering RStudio: A Beginner’s Guide for detail note about data structure in R and supplementary YouTube videos .

class(routine_data)

For a data frame you can select the nth row, or the nth column using square brackets (note where the comma is paced).

head(routine_data[,2], 5) #second column

routine_data[1,] #firts row

To view all variable names with the names(), colnames() , and variable.names() function.

colnames(routine_data)
names(routine_data)
variable.names(routine_data)

To view variable with the position number:

names(routine_data)[5]

There are several incorrect data types in this dataset, but let’s continue using the “date” variable to demonstrate how to identify and update these errors:

typeof(routine_data$date)

“Character” is returned but the variable should in fact be a date. you can use the as.Date function of lubridate package to change the data type accordingly:

 #as.Date(routine_data$date) # to date

*Convert to date from a character variable*

String inconsistencies: This includes typos, capitalization errors, misplaced punctuation, or similar character data errors that might interfere with data analysis. Take for instance your “Year” column. As you can see there are wrongly entered year 21 and 3021.

unique(routine_data$Year)

[1] 2021   21 3021

Outliers:

The dataset has 8 continuous variables: test_performed, confirmed_all, confirmed_u5, confirmed_5_14, confirmed_15, population, pop_5, pop_514, pop_15. To get a feeling for how the data is distributed, you can plot histograms for case variables:

Under 5 confirmed malaria cases:

hist(routine_data$confirmed_u5)

All age confirmed malaria cases:

hist(routine_data$confirmed_all)

Data cleaning is one of the most important steps for analysis! No matter where your data comes from, always be sure the completeness, consistency, and trustworthiness of data before analyzing and utilizing the data. Once you get cleaned and organized, you can perform analysis to find clear and objective answers to any data question.

Data manipulation in dplyr()

In this section, you’ll explore the powerful dplyr package, part of the tidyverse ecosystem. Tidyverse is a collection of R packages for data science, designed to make cleaning and analyses of data easy and tidy.

dplyr streamlines your data wrangling process, making it easier to work with data frames.

Let’s dive into the key functions:

Note

select(): Choose specific variables (columns) from a data frame based on their names.

rename(): Transform and rename variables.

filter(): Select rows that meet specific criteria.

mutate(): Create new variables by applying functions to existing ones.

group_by(): Perform operations by group.

summarize(): Aggregate data.

arrange(): Order rows based on a specified column.

distinct(): Remove duplicate rows based on specified columns.

These functions simplify common data manipulation tasks.

The Pipe Operator `%>% OR "|>"`

This operator allows you to chain commands together. Instead of creating intermediate variables, you can directly link functions.

You use pipes when you creating intermediate variables to link commands together . For example, the idea is instead of using :

select(routine_data, Year)

# A tibble: 11,604 × 1
    Year
   <dbl>
 1  2021
 2  2021
 3  2021
 4  2021
 5    21
 6  2021
 7  2021
 8  2021
 9  2021
10  2021
# ℹ 11,594 more rows

Building blocks that code readability and reproducibility, making your data workflows more efficient.

routine_data %>% 
  select(Year)

# A tibble: 11,604 × 1
    Year
   <dbl>
 1  2021
 2  2021
 3  2021
 4  2021
 5    21
 6  2021
 7  2021
 8  2021
 9  2021
10  2021
# ℹ 11,594 more rows

Selecting Columns:

The select() function in R allows you to choose specific columns (variables) from a data frame. Specify the data frame name as the first argument, followed by the column names you want to keep. Example:

routine_data %>% 
  select(region,zone,district, 
       Month,date,confirmed_all,test_performed)

# A tibble: 11,604 × 7
   region      zone         district    Month date  confirmed_all test_performed
   <chr>       <chr>        <chr>       <chr> <chr>         <dbl>          <dbl>
 1 Addis Ababa Addis Ketema Addis Kete… July  1/07…             3             97
 2 Addis Ababa Addis Ketema Addis Kete… Augu… 1/08…            70            148
 3 Addis Ababa Addis Ketema Addis Kete… Sept… 1/09…             5            142
 4 Addis Ababa Addis Ketema Addis Kete… Octo… 1/10…            93            190
 5 Addis Ababa Addis Ketema Addis Kete… Nove… 1/11…            87            194
 6 Addis Ababa Addis Ketema Addis Kete… Dece… 1/12…            25            194
 7 Addis Ababa Addis Ketema Addis Kete… Janu… 1/01…            62            146
 8 Addis Ababa Addis Ketema Addis Kete… Febr… 1/02…            72            189
 9 Addis Ababa Addis Ketema Addis Kete… March 1/03…            18            243
10 Addis Ababa Addis Ketema Addis Kete… April 1/04…            23            184
# ℹ 11,594 more rows

To select columns that have common prefix or suffix, you can use the start_witth() contains(), or end_with() functions. Example to subset columns that captured confirmed cases

routine_data %>% 
  select(
    starts_with("confirmed")
  )

# A tibble: 11,604 × 4
   confirmed_all confirmed_u5 confirmed_5_14 confirmed_15
           <dbl>        <dbl>          <dbl>        <dbl>
 1             3            1              0            2
 2            70            0              1           69
 3             5            0              0            5
 4            93            0              6           87
 5            87            6              3           78
 6            25            0              1           24
 7            62            0              2           60
 8            72            0              2           70
 9            18            0              0           18
10            23            0              2           21
# ℹ 11,594 more rows

contains()

routine_data %>% 
  select(
    contains("confirmed")
  )

# A tibble: 11,604 × 4
   confirmed_all confirmed_u5 confirmed_5_14 confirmed_15
           <dbl>        <dbl>          <dbl>        <dbl>
 1             3            1              0            2
 2            70            0              1           69
 3             5            0              0            5
 4            93            0              6           87
 5            87            6              3           78
 6            25            0              1           24
 7            62            0              2           60
 8            72            0              2           70
 9            18            0              0           18
10            23            0              2           21
# ℹ 11,594 more rows

end_with()

routine_data %>% 
  select(
    ends_with("5")
  )

# A tibble: 11,604 × 4
   confirmed_u5 confirmed_15  pop_5  pop_15
          <dbl>        <dbl>  <dbl>   <dbl>
 1            1            2 50742. 212194.
 2            0           69 50742. 212194.
 3            0            5 50742. 212194.
 4            0           87 50742. 212194.
 5            6           78 50742. 212194.
 6            0           24 50742. 212194.
 7            0           60 50742. 212194.
 8            0           70 50742. 212194.
 9            0           18 50742. 212194.
10            0           21 50742. 212194.
# ℹ 11,594 more rows

Filtering Rows:

The filter() function helps you retain only the rows (observations) that meet specific conditions.

For instance: To select rows for the Tigray region:

routine_data%>% 
  filter(region == "Tigray")

# A tibble: 552 × 14
   region zone    district     Month     Year date  test_performed confirmed_all
   <chr>  <chr>   <chr>        <chr>    <dbl> <chr>          <dbl>         <dbl>
 1 Tigray Central Abi Adi town July      2021 1/07…            407           111
 2 Tigray Central Abi Adi town August    2021 1/08…            277            23
 3 Tigray Central Abi Adi town Septemb…  2021 1/09…            192            23
 4 Tigray Central Abi Adi town October   2021 1/10…             NA            NA
 5 Tigray Central Abi Adi town November  2021 1/11…             NA            NA
 6 Tigray Central Abi Adi town December  2021 1/12…             NA            NA
 7 Tigray Central Abi Adi town January   2021 1/01…             NA            NA
 8 Tigray Central Abi Adi town February  2021 1/02…             NA            NA
 9 Tigray Central Abi Adi town March     2021 1/03…             NA            NA
10 Tigray Central Abi Adi town April     2021 1/04…             NA            NA
# ℹ 542 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

To choose rows with more than 500 confirmed cases in children under 5:

routine_data %>% 
  filter(confirmed_u5 > 500)

# A tibble: 16 × 14
   region          zone  district Month  Year date  test_performed confirmed_all
   <chr>           <chr> <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Gambella        Gamb… Gambela… Octo…  2021 1/10…           5221          2317
 2 Gambella        Gamb… Gambela… Nove…  2021 1/11…           5285          2556
 3 Gambella        Itan… Itang    Dece…  2021 1/12…            504          1271
 4 Somali          Afder Barey    Augu…  2021 1/08…           2202          2192
 5 Somali          Afder Barey    Octo…  2021 1/10…           1950          1950
 6 Beneshangul Gu… Asso… Sherkole July   2021 1/07…           5132          3984
 7 Amhara          Awi   Jawi     Sept…  2021 1/09…           8710          3051
 8 Amhara          Awi   Jawi     Octo…  2021 1/10…          11391          3616
 9 Amhara          Cent… Gonder … Octo…  2021 1/10…          10287          4593
10 Amhara          Cent… Gonder … Nove…  2021 1/11…          11089          4564
11 Amhara          Sout… Dera (A… June   2021 1/06…          13008          6017
12 Amhara          Sout… Fogera   Octo…  2021 1/10…          14768          5357
13 Amhara          Sout… Fogera   Nove…  2021 1/11…          19448          8367
14 Amhara          Sout… Fogera   Dece…  2021 1/12…          10552          4078
15 SNNP            Silte Sankura  Octo…  2021 1/10…           4781          2144
16 SNNP            Wola… Boloso … May    2021 1/05…          11681          3762
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

To filter by multiple conditions (e.g., regions “Amhara” or “Afar” with at least 500 confirmed cases in children under 5 ):

routine_data %>% 
  filter(
    (region == "Amhara" | region == "Afar") &
    (confirmed_u5 >= 500))

# A tibble: 8 × 14
  region zone           district  Month  Year date  test_performed confirmed_all
  <chr>  <chr>          <chr>     <chr> <dbl> <chr>          <dbl>         <dbl>
1 Amhara Awi            Jawi      Sept…  2021 1/09…           8710          3051
2 Amhara Awi            Jawi      Octo…  2021 1/10…          11391          3616
3 Amhara Central Gondar Gonder Z… Octo…  2021 1/10…          10287          4593
4 Amhara Central Gondar Gonder Z… Nove…  2021 1/11…          11089          4564
5 Amhara South Gondar   Dera (AM) June   2021 1/06…          13008          6017
6 Amhara South Gondar   Fogera    Octo…  2021 1/10…          14768          5357
7 Amhara South Gondar   Fogera    Nove…  2021 1/11…          19448          8367
8 Amhara South Gondar   Fogera    Dece…  2021 1/12…          10552          4078
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

Identifying Erroneous Data:

You can use filter() to spot inconsistencies. For instance, find rows where the number of confirmed cases exceeds the number tested:

routine_data %>% 
  filter(confirmed_all > test_performed)

# A tibble: 97 × 14
   region   zone         district Month  Year date  test_performed confirmed_all
   <chr>    <chr>        <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Afar     Zone 2       Abaala   Augu…  2021 1/08…             10            12
 2 Afar     Zone 4       Euwa     Octo…  2021 1/10…            191           243
 3 Gambella Agniwa       Jore     Dece…  2021 1/12…             79           158
 4 Gambella Agniwa       Jore     April  2021 1/04…             63            73
 5 Gambella Agniwa       Jore     May    2021 1/05…             75            84
 6 Gambella Itang Speci… Itang    Dece…  2021 1/12…            504          1271
 7 Gambella Itang Speci… Itang    Janu…  2021 1/01…            864           867
 8 Gambella Majang       Mengesh  May    2021 1/05…            118           198
 9 Gambella Nuer         Lare     Janu…  2021 1/01…            132           201
10 Gambella Nuer         Makuey   Janu…  2021 1/01…            243           360
# ℹ 87 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

Renaming variable

There may be situations when you want to rename variables in a data frame to make it more comprehensive and easier to process. The rename() function allows you to change column names in a data frame. It’s useful for making variable names more descriptive. you pass to this function the data frame you are working with, rename(dataframe, new_name = old_name).

Example, if you wanted to change the variable “confirmed_u5” to “conf_u5”, and to overwrite the object “routine_data” with this you would simply write:

routine_data %>% 
  rename(conf_u5 =confirmed_u5)

# A tibble: 11,604 × 14
   region  zone  district Month  Year date  test_performed confirmed_all conf_u5
   <chr>   <chr> <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>   <dbl>
 1 Addis … Addi… Addis K… July   2021 1/07…             97             3       1
 2 Addis … Addi… Addis K… Augu…  2021 1/08…            148            70       0
 3 Addis … Addi… Addis K… Sept…  2021 1/09…            142             5       0
 4 Addis … Addi… Addis K… Octo…  2021 1/10…            190            93       0
 5 Addis … Addi… Addis K… Nove…    21 1/11…            194            87       6
 6 Addis … Addi… Addis K… Dece…  2021 1/12…            194            25       0
 7 Addis … Addi… Addis K… Janu…  2021 1/01…            146            62       0
 8 Addis … Addi… Addis K… Febr…  2021 1/02…            189            72       0
 9 Addis … Addi… Addis K… March  2021 1/03…            243            18       0
10 Addis … Addi… Addis K… April  2021 1/04…            184            23       0
# ℹ 11,594 more rows
# ℹ 5 more variables: confirmed_5_14 <dbl>, confirmed_15 <dbl>, pop_5 <dbl>,
#   pop_514 <dbl>, pop_15 <dbl>

Creating new variable

The mutate() function lets you add new variables or modify existing ones. You use the = sign to assign new values . For example, to calculate the incidence rate of malaria in children under five per 1000 population using the formula (confirmed_u5 / pop_5) * 1000, you could write:

routine_data %>% 
  mutate(incidence_rate = (confirmed_u5/pop_5)*1000)

# A tibble: 11,604 × 15
   region      zone      district Month  Year date  test_performed confirmed_all
   <chr>       <chr>     <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Addis Ababa Addis Ke… Addis K… July   2021 1/07…             97             3
 2 Addis Ababa Addis Ke… Addis K… Augu…  2021 1/08…            148            70
 3 Addis Ababa Addis Ke… Addis K… Sept…  2021 1/09…            142             5
 4 Addis Ababa Addis Ke… Addis K… Octo…  2021 1/10…            190            93
 5 Addis Ababa Addis Ke… Addis K… Nove…    21 1/11…            194            87
 6 Addis Ababa Addis Ke… Addis K… Dece…  2021 1/12…            194            25
 7 Addis Ababa Addis Ke… Addis K… Janu…  2021 1/01…            146            62
 8 Addis Ababa Addis Ke… Addis K… Febr…  2021 1/02…            189            72
 9 Addis Ababa Addis Ke… Addis K… March  2021 1/03…            243            18
10 Addis Ababa Addis Ke… Addis K… April  2021 1/04…            184            23
# ℹ 11,594 more rows
# ℹ 7 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>,
#   incidence_rate <dbl>

Altering existing variable

Suppose you noticed an error in the region name, where “South Western Ethiopia” was mistakenly entered as “Sou.” You can correct this using an ifelse statement within mutate(). The condition is region == "Sou", and if it’s met, you replace the value with “South Western Ethiopia.” Otherwise, you keep the original value.

Here’s how you can do it:

routine_data %>% 
  mutate(region = 
           ifelse(region == "Sou",
                  "South Western Ethiopia",
                  region))

This will update the region variable in your data frame.

Using case_when for Multiple Changes:

If you need to make multiple changes based on different conditions, consider using case_when instead of ifelse. Let’s say there are other errors in the district names. You can correct them simultaneously using case_when.

There is a district name mismatch between district names in DHIS2 and names in shapefile. For this you will use the task_data.csv

task_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/task_data.csv") %>% 
  data.frame() %>% 
 mutate(district_new = case_when(
 district == "Addis Ketema" ~ "Addis Ketema (AA)",
 district == "Nifas Silk Lafto" ~ "Nefas Silk",
TRUE ~ district)) %>% 
  select(-district) %>% 
  rename("district" = "district_new") #rename to district

Rows: 11604 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): region, zone, district, Month
dbl  (7): Year, female, male, conf_5, conf_514, conf_15, cases
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Change the Year column

routine_data %>% 
  mutate(Year = 
    case_when(Year == 21 ~ 2021,
              Year == 3021 ~ 2021,
              .default = Year
              )
  )

# A tibble: 11,604 × 14
   region      zone      district Month  Year date  test_performed confirmed_all
   <chr>       <chr>     <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Addis Ababa Addis Ke… Addis K… July   2021 1/07…             97             3
 2 Addis Ababa Addis Ke… Addis K… Augu…  2021 1/08…            148            70
 3 Addis Ababa Addis Ke… Addis K… Sept…  2021 1/09…            142             5
 4 Addis Ababa Addis Ke… Addis K… Octo…  2021 1/10…            190            93
 5 Addis Ababa Addis Ke… Addis K… Nove…  2021 1/11…            194            87
 6 Addis Ababa Addis Ke… Addis K… Dece…  2021 1/12…            194            25
 7 Addis Ababa Addis Ke… Addis K… Janu…  2021 1/01…            146            62
 8 Addis Ababa Addis Ke… Addis K… Febr…  2021 1/02…            189            72
 9 Addis Ababa Addis Ke… Addis K… March  2021 1/03…            243            18
10 Addis Ababa Addis Ke… Addis K… April  2021 1/04…            184            23
# ℹ 11,594 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

unique(routine_data$Year)

[1] 2021   21 3021

Removes duplicate rows

The distinct() function removes duplicate rows from a data frame based on specified columns. You can use it to keep only unique rows or to remove duplicates based on specific columns.

To remove duplicate rows based on the district column, you can use:

routine_data %>% 
  distinct(district,
           .keep_all = TRUE)

# A tibble: 967 × 14
   region      zone      district Month  Year date  test_performed confirmed_all
   <chr>       <chr>     <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Addis Ababa Addis Ke… Addis K… July   2021 1/07…             97             3
 2 Addis Ababa Akaki Ka… Akaki -… July   2021 1/07…            105            17
 3 Addis Ababa Arada     Arada    July   2021 1/07…             62             4
 4 Addis Ababa Bole      Bole     July   2021 1/07…             73             7
 5 Addis Ababa Gulele    Gulele   July   2021 1/07…             32             6
 6 Addis Ababa Kirkos    Kirkos   July   2021 1/07…            190            11
 7 Addis Ababa Kolfe     Kolfe -… July   2021 1/07…             36             3
 8 Addis Ababa Lideta    Lideta   July   2021 1/07…             32            14
 9 Addis Ababa Nifas Si… Nefas S… July   2021 1/07…             70             7
10 Addis Ababa Yeka      Yeka     July   2021 1/07…             68            22
# ℹ 957 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

routine_data %>% 
  select(district) %>% 
  distinct() %>% 
  count()

# A tibble: 1 × 1
      n
  <int>
1   967

Null values

Null values are treated differently in R. They appear as NA in the dataset, so you may expect the following code to work for filtering data to remove all missing values for the number of people tested for malaria:

routine_data %>% 
  filter(test_performed =NA)

# A tibble: 0 × 14
# ℹ 14 variables: region <chr>, zone <chr>, district <chr>, Month <chr>,
#   Year <dbl>, date <chr>, test_performed <dbl>, confirmed_all <dbl>,
#   confirmed_u5 <dbl>, confirmed_5_14 <dbl>, confirmed_15 <dbl>, pop_5 <dbl>,
#   pop_514 <dbl>, pop_15 <dbl>

However, this does not work as R has a special way of dealing with missing values. You use the is.na() command, which checks fo NA values. As with the equals command, if you want the reverse of this, i.e. “not NA” you can use !is.na(). So the code to remove missing values would be:

routine_data %>% 
  filter(!is.na(test_performed))

# A tibble: 9,869 × 14
   region      zone      district Month  Year date  test_performed confirmed_all
   <chr>       <chr>     <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Addis Ababa Addis Ke… Addis K… July   2021 1/07…             97             3
 2 Addis Ababa Addis Ke… Addis K… Augu…  2021 1/08…            148            70
 3 Addis Ababa Addis Ke… Addis K… Sept…  2021 1/09…            142             5
 4 Addis Ababa Addis Ke… Addis K… Octo…  2021 1/10…            190            93
 5 Addis Ababa Addis Ke… Addis K… Nove…    21 1/11…            194            87
 6 Addis Ababa Addis Ke… Addis K… Dece…  2021 1/12…            194            25
 7 Addis Ababa Addis Ke… Addis K… Janu…  2021 1/01…            146            62
 8 Addis Ababa Addis Ke… Addis K… Febr…  2021 1/02…            189            72
 9 Addis Ababa Addis Ke… Addis K… March  2021 1/03…            243            18
10 Addis Ababa Addis Ke… Addis K… April  2021 1/04…            184            23
# ℹ 9,859 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

Another method for removing missing data in tidyverse is using the drop_na() function from {tidyr} package. As with the filter function this takes the dataset as the first argument, followed by the variables for which you are dropping NA values.

routine_data_no_NA <- routine_data %>% 
  drop_na(test_performed)

Sorting and reordering data frames

Sorting a data frame by rows and reordering columns is easy in R. To sort a data frame by a column you use the function arrange(). You specify the data frame and the column to sort by, and the default is to sort in ascending order. To sort in a descending order you can specify this with desc(). Additionally, you can sort by multiple variables, and sorting will be undertaken in the order they appear in the command.

routine_data %>% 
  arrange(confirmed_u5) # the default is asce

routine_data %>% 
  arrange(desc(confirmed_u5))

# A tibble: 11,604 × 14
   region          zone  district Month  Year date  test_performed confirmed_all
   <chr>           <chr> <chr>    <chr> <dbl> <chr>          <dbl>         <dbl>
 1 Gambella        Gamb… Gambela… Nove…  2021 1/11…           5285          2556
 2 Amhara          Sout… Fogera   Nove…  2021 1/11…          19448          8367
 3 Amhara          Sout… Dera (A… June   2021 1/06…          13008          6017
 4 Beneshangul Gu… Asso… Sherkole July   2021 1/07…           5132          3984
 5 Somali          Afder Barey    Augu…  2021 1/08…           2202          2192
 6 Gambella        Itan… Itang    Dece…  2021 1/12…            504          1271
 7 SNNP            Silte Sankura  Octo…  2021 1/10…           4781          2144
 8 Gambella        Gamb… Gambela… Octo…  2021 1/10…           5221          2317
 9 Somali          Afder Barey    Octo…  2021 1/10…           1950          1950
10 Amhara          Sout… Fogera   Dece…  2021 1/12…          10552          4078
# ℹ 11,594 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
#   confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>

How to change the order of the column

The relocate() function is part of the dplyr package in R. It allows you to change the order of columns in a data frame. You can specify where a particular column should be placed relative to other columns using the .before and .after arguments.

The basic syntax of the relocate() function as follows:

relocate(data, column_name, .before = target_column, .after = target_column)

data: The data frame containing the columns.
column_name: The name of the column you want to move.
.before: Specify the column name before which the target column should be placed.
.after: Specify the column name after which the target column should be placed.

Example: Suppose we have a data frame called task_data with columns: “zone,” “district,” and “Month.” We want to move the “district” column after the “zone” column and before the “Month” column.

task_data %>% 
  relocate(district, .before =  Month) %>% 
  head()

       region         zone          district     Month Year       Date female
1 Addis Ababa Addis Ketema Addis Ketema (AA)      July 2022 2022-01-01     12
2 Addis Ababa Addis Ketema Addis Ketema (AA)    August 2022 2022-01-01      5
3 Addis Ababa Addis Ketema Addis Ketema (AA) September 2022 2022-01-01     22
4 Addis Ababa Addis Ketema Addis Ketema (AA)   October 2022 2022-01-01      5
5 Addis Ababa Addis Ketema Addis Ketema (AA)  November 2022 2022-01-01      5
6 Addis Ababa Addis Ketema Addis Ketema (AA)  December 2022 2022-01-01      2
  male conf_5 conf_514 conf_15 cases
1   19      2        2      27    31
2    6      1        1       9    11
3   50      0        0      72    72
4   13      1        3      14    18
5   13      0        1      17    18
6    8      0        0      10    10

Summarizing data

There are some useful functions in tidyverse to help you summarize the data. The first of these is the count() function. This is a quick function which will allow you to quickly count the occurrences of an item within a dataset.

routine_data %>% 
  filter(region=="Amhara") %>% 
  group_by(region) %>% 
  count(district)

# A tibble: 165 × 3
# Groups:   region [1]
   region district             n
   <chr>  <chr>            <int>
 1 Amhara Abergele (AM)       12
 2 Amhara Adagn Ager Chaqo    12
 3 Amhara Addi Arekay         12
 4 Amhara Albuko              12
 5 Amhara Alfa                12
 6 Amhara Ambasel             12
 7 Amhara Aneded              12
 8 Amhara Angolelana Tera     12
 9 Amhara Angot               12
10 Amhara Ankasha             12
# ℹ 155 more rows

By including multiple variables in the command you can count the numbers of times that combination of variables appears.

routine_data %>% 
  count(region, Month)

# A tibble: 156 × 3
   region      Month        n
   <chr>       <chr>    <int>
 1 Addis Ababa April       10
 2 Addis Ababa August      10
 3 Addis Ababa December    10
 4 Addis Ababa February    10
 5 Addis Ababa January     10
 6 Addis Ababa July        10
 7 Addis Ababa June        10
 8 Addis Ababa March       10
 9 Addis Ababa May         10
10 Addis Ababa November    10
# ℹ 146 more rows

If you want to summarize numerical variables you can use the function summarise(). This is used in conjunction with other mathematical functions such as sum(), mean(), median(), max()..

task_data %>% 
  summarise(total_case = sum(female, male, na.rm = TRUE),
          mean_cases = mean(conf_5),
          median_cases = median(conf_5),
          max_cases = max(conf_5),
          sd_case = sd(conf_5))

  total_case mean_cases median_cases max_cases  sd_case
1    1644107   24.09204            1      1441 66.48492

you can combine the summarise() function with group_by() to summarize the data by different variables in the dataset. To calculate the total number of people tested and positive for malaria in each district in our dataset, you would group by this variable first and then summarize the data. Grouping is not restricted to one variable, if you wanted to group the data by location and date then both variables would be included in the command. When you use the sum() function , na.rm = T logic required if the column has NULL values

routine_data %>%  group_by(region) %>% 
summarise(total_test = sum(test_performed),

          total_positive = sum(confirmed_all),

          total_u5 = sum(confirmed_u5),

          total_514 = sum(confirmed_5_14),

          total_ov15 = sum(confirmed_15))

# A tibble: 13 × 6
   region            total_test total_positive total_u5 total_514 total_ov15
   <chr>                  <dbl>          <dbl>    <dbl>     <dbl>      <dbl>
 1 Addis Ababa            12671           1657       92       111       1454
 2 Afar                      NA             NA       NA        NA         NA
 3 Amhara                    NA             NA       NA        NA         NA
 4 Beneshangul Gumuz         NA             NA       NA        NA         NA
 5 Dire Dawa                 NA             NA       NA        NA         NA
 6 Gambella                  NA             NA       NA        NA         NA
 7 Harari                    NA             NA       NA        NA         NA
 8 Oromiya                   NA             NA       NA        NA         NA
 9 SNNP                      NA             NA       NA        NA         NA
10 Sidama                    NA             NA       NA        NA         NA
11 Somali                    NA             NA       NA        NA         NA
12 Sou                       NA             NA       NA        NA         NA
13 Tigray                    NA             NA       NA        NA         NA

Subsetting data

Subsetting refers to extracting a portion of your data based on specific conditions. In this case, we want to focus on data related to the Amhara region.
The select() function allows us to choose specific columns from our data frame. We’ll keep the following columns: “region,” “zone,” “district,” “test_performed,” “confirmed_all,” and “confirmed_u5.”
We’ll group our data by the “zone” column using the group_by() function. This means that subsequent calculations will be performed within each zone.
The summarise() function computes summary statistics for each group (in this case, each zone).

routine_data %>% 
 select(region, zone, district,
         test_performed,confirmed_all,confirmed_u5) %>%  
         filter(region%in%"Amhara") %>%  
  group_by(zone) %>% 
  summarise(total_positive = sum(confirmed_all, na.rm = T),
          total_u5 = sum(confirmed_u5, na.rm = T),
          prop_u5 = round((total_u5/total_positive)*100,2))

# A tibble: 15 × 4
   zone           total_positive total_u5 prop_u5
   <chr>                   <dbl>    <dbl>   <dbl>
 1 Awi                     30757     4679   15.2 
 2 Bahirdar                16611     2630   15.8 
 3 Central Gondar          99939    11600   11.6 
 4 Dessie                     44        1    2.27
 5 East Gojjam             39948     1586    3.97
 6 Gondar                   5884      459    7.8 
 7 North Gondar            11811      994    8.42
 8 North Shewa             10752      541    5.03
 9 North Wollo             12704     1277   10.0 
10 Oromia Special           5693     1040   18.3 
11 South Gondar           107618    12023   11.2 
12 South Wollo              8356      488    5.84
13 Waghimera               11551     1185   10.3 
14 West Gojjam             41792     5132   12.3 
15 West Gondar             73432     9534   13.0

Know to take everything you have learnt to import and clean the routine dataset. If you want your output to contain the total numbers of malaria tests performed and the number of confirmed cases in children under 5, people over 5, and calculate a total for all ages. you want to have the total by district level and Year in the dataset. This is how you would go about building the code.

clean_routine_data <- routine_data %>% 
  # subset
dplyr::select(region, zone, district, Month, Year,
         test_performed,confirmed_all,
         confirmed_u5,confirmed_5_14,confirmed_15) %>% 
  drop_na(test_performed,confirmed_all,
         confirmed_u5,confirmed_5_14,confirmed_15) %>% 
  # filter rows if test number is less than confirmed
  filter(test_performed>confirmed_all) %>% 
  # update region
  mutate(region = case_when(
    region =='Sou'~'South Western Ethiopia',
             TRUE~region),
    # update year
             year = case_when(Year == 3021 ~ 2021,
                          Year == 21 ~ 2021,
                          TRUE ~ Year),
    # to date
         date_reported = make_date(year = Year,
                                   month = Month), 
         conf_ov5 = confirmed_5_14+confirmed_15) %>%
  # aggregate by region, zone, and district
  group_by(region, zone, district) %>% 
  summarise(test_total = sum(test_performed),
            conf_total = sum(confirmed_all),
            conf_u5 = sum(confirmed_u5),
           conf_ov5 = sum(conf_ov5)) %>% 
  # deselect test_total, -conf_total 
  dplyr::select(-test_total, -conf_total)

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date_reported = make_date(year = Year, month = Month)`.
Caused by warning in `make_date()`:
! NAs introduced by coercion

`summarise()` has grouped output by 'region', 'zone'. You can override using
the `.groups` argument.

# top six
head(clean_routine_data)

# A tibble: 6 × 5
# Groups:   region, zone [6]
  region      zone         district          conf_u5 conf_ov5
  <chr>       <chr>        <chr>               <dbl>    <dbl>
1 Addis Ababa Addis Ketema Addis Ketema (AA)       9      507
2 Addis Ababa Akaki Kality Akaki - Kalit          13      277
3 Addis Ababa Arada        Arada                  13       54
4 Addis Ababa Bole         Bole                    2       97
5 Addis Ababa Gulele       Gulele                  0      103
6 Addis Ababa Kirkos       Kirkos                  3       74

Advanced manipulation of data frames

In this section you introducing some more advanced functions for data manipulation. you will be using the “clean_routine_data” dataset you just created.

Reshaping data

Reshaping or pivoting data is an important part of data cleaning and manipulation. Tidyverse has introduced the functions pivot_wider() and pivot_longer() to improve the ease of reshaping data in R.

pivot_longer() takes a wide dataset and converts it into a long one, decreasing the number of columns and increasing the number of rows. Datasets are often created in a wide format, but for analysis a long format is often preferable, especially for data visualization.

To reshape the data long you need to pass the argument the columns which are being pivoted, a name for the new column to identify the columns being reshaped, and a name for the values of the columns being reshaped. you can also combine this with helper functions such as starts_with() to help identify the columns to reshape. For this demonstration you will use the “clean_data_routine.csv”. To reshape the dataset into long format :

clean_routine_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/clean_routine_data.csv")

Rows: 709 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): region, zone, district
dbl (2): conf_u5, conf_ov5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

clean_long <- clean_routine_data %>% 
pivot_longer(cols = starts_with("conf"),
               names_to = "age_group",
               names_pattern = "conf_(.*)",
               values_to = 'cases')

to convert the clean_long dataset to wide forma:

clean_long |> 
  pivot_wider(id_cols = c('region',
                          'zone',
                          'district'),
              names_from = age_group,
              values_from = cases)

There are a range of different options in this function to help pivot the data in the cleanest way possible. To see these you can look at the vignette by typing the code vignette("pivot").

Joining data frames

In R you can easily join two data frames together based on one, or multiple variables. There are the following options for joining two data frames, xand y:

inner_join(): includes all rows that appear in both x and y
left_join(): includes all rows in x
right_join(): includes all rows in y
full_join(): includes all rows in x or y
anti_join(): return all rows in one data frame that do not have matching values in another data frame

To run the command you need to pass it the data frames you wish to join and the variable(s) you wish to join by. If there are matching variables in the data frames these will be detected and you do not need to specify them.

If you have two data frames with varying numbers of rows, you can investigate the outcomes of using the different join commands. Firstly, you create two data frames of different lengths, tested, and confirmed, then look at what the commands and outcomes would be.

tested <- data.frame(year = c(2015, 2016, 2017, 2018, 2019, 2020),

                     tested = c(1981, 1992, 2611, 2433, 2291, 2311))

positive <- data.frame(year = c(2013, 2014, 2015, 2016, 2017, 2018),

                     positive = c(1164, 1391, 981, 871, 1211, 998))

# Command written in full

inner_join(tested, positive, by = "year")

  year tested positive
1 2015   1981      981
2 2016   1992      871
3 2017   2611     1211
4 2018   2433      998

# Using the pipe operator

tested %>% 
  inner_join(positive)      # Keeps only matching records

Joining with `by = join_by(year)`

  year tested positive
1 2015   1981      981
2 2016   1992      871
3 2017   2611     1211
4 2018   2433      998

 left_join(tested, positive)     # Keeps all records for the first dataset

Joining with `by = join_by(year)`

  year tested positive
1 2015   1981      981
2 2016   1992      871
3 2017   2611     1211
4 2018   2433      998
5 2019   2291       NA
6 2020   2311       NA

tested %>% right_join(positive)    # Keeps all records for the second dataset

Joining with `by = join_by(year)`

  year tested positive
1 2015   1981      981
2 2016   1992      871
3 2017   2611     1211
4 2018   2433      998
5 2013     NA     1164
6 2014     NA     1391

tested %>% full_join(positive)   # Keeps all records from both datasets

Joining with `by = join_by(year)`

  year tested positive
1 2015   1981      981
2 2016   1992      871
3 2017   2611     1211
4 2018   2433      998
5 2019   2291       NA
6 2020   2311       NA
7 2013     NA     1164
8 2014     NA     1391

anti_join(tested, positive) #Keep all rows in tested that do not have matching values in positive dataset

Joining with `by = join_by(year)`

  year tested
1 2019   2291
2 2020   2311

You can also join datasets which have different names for the variables you wish to join on. Say if the variable for “year” in the positive dataset was “yr”, you could use the command:

positive <- data.frame(yr = c(2013, 2014, 2015, 2016, 2017, 2018),
                     positive = c(1164, 1391, 981, 871, 1211, 998))
tested %>% inner_join(positive,
                      by= c("year"="yr"))

  year tested positive
1 2015   1981      981
2 2016   1992      871
3 2017   2611     1211
4 2018   2433      998

Writing data

Once you have finished working on your data there are multiple ways you can save your work in R.

One of the basic ones is to save your dataset as a csv. This is useful as it can easily be opened in other software (such as excel). You may also want to save your data as a Stata (dta) file.

R has some specific formats you can use to store data, these are .RDS and .RData. RDS files store one R object, whilst RData can store multiple objects.

Here, you can select which format you want to save the data in and save the malaria_incidence_2021 data frame you created in this module.

Similarly to importing data, you can use the base R functions of `write.csv()`, or preferably the tidyverse option if write_csv().

#write_csv(clean_routine_data, "clean_routine_data.csv")

#write_dta(clean_routine_data, "outputs/clean_routine_data.dta")

#saveRDS(clean_routine_data, "outputs/clean_routine_data.RDS")

Data manipulation: dplyr

Data management in dplyr

Step-by-Step Guide

1. Setting Up Your Environment

2. Load the necessary packages

3. Reading and Cleaning Data

3. Skim the data

Data manipulation in dplyr()

The Pipe Operator `%>% OR "|>"`

Selecting Columns:

Filtering Rows:

Renaming variable

Creating new variable

Altering existing variable

Removes duplicate rows

Null values

Sorting and reordering data frames

Summarizing data

Advanced manipulation of data frames

Reshaping data

Joining data frames

Writing data

Reference

Data management in dplyr

Step-by-Step Guide

1. Setting Up Your Environment

2. Load the necessary packages

3. Reading and Cleaning Data

3. Skim the data

Data manipulation in dplyr()

The Pipe Operator %>% OR "|>"

Selecting Columns:

Filtering Rows:

Renaming variable

Creating new variable

Altering existing variable

Removes duplicate rows

Null values

Sorting and reordering data frames

Summarizing data

Advanced manipulation of data frames

Reshaping data

Joining data frames

Writing data

Reference

The Pipe Operator `%>% OR "|>"`