getwd()
[1] "C:/Users/User/OneDrive/Documents/GitHub/yalemgelaw.github.io/posts/2024-06-10-Data-manipulation-in-dplyr"
Basic data cleaning and management in dplyr
June 12, 2024
In this section, I will discuss about data management process, basic data cleaning and management in dplyr
.
Data can be defined as a collection of facts that can be used to draw a conclusion, make predictions and assist decision making.
Public health data may originate from various sources including:
ls()
- Lists all active files.
rm(list = ls())
- Cleans your environment.
Ctrl + L
- Cleans the console.
Assign function: Alt + -
.
Ctrl + 1
- shortcut to move to source pane
Ctrl + 2
- shortcut to move to console pane
Check your working directory using:
For data management, you will use the tidyverse
package. Install the package if you haven’t installed before using install.packages(“tidyverse”) function.
# Install necessary packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("janitor", quietly = TRUE)) install.packages("janitor")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("lubridate", quietly = TRUE)) install.packages("lubridate")
if (!requireNamespace("haven", quietly = TRUE)) install.packages("haven")
# Load the packages
library(tidyverse) # To read Excel files
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'janitor' was built under R version 4.3.3
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(dplyr) # For data wrangling
library(lubridate) # Date functions
library(haven) # To read stata file
Warning: package 'haven' was built under R version 4.3.3
Read the malaria data: The data I used for this post is a sample of routine malaria surveillance data from Ethiopia. The data you’re collected monthly and collated at the district level (third administrative system) and stored in csv file format.
Rows: 11604 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): region, zone, district, Month, date
dbl (9): Year, test_performed, confirmed_all, confirmed_u5, confirmed_5_14, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Safeguard your data: Make a backup of the data:
When dealing with large datasets, it’s essential to take precautions. Imagine your data disappearing into the digital abyss—scary, right? To prevent that, always make a backup. Think of it as a safety net for your precious data. Whether you’re crunching numbers, visualizing trends, or building models, follow this golden rule: Back it up before you hack it up🔒!
Knowing your dataset well from file size to data types is another crucial step prior to hands-on data cleaning.
To look at how the data looks like either by clicking on it in the global environment window or by typing the command View(routine_data)
which opens up a window displaying the data.
Alternatively, you may just want to look at a few rows. You can do this by using the head()
function, which shows us the first six rows of data and tail()
function, which shows us the lass six rows of the data.
# A tibble: 6 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ket… Addis K… July 2021 1/07… 97 3
2 Addis Ababa Addis Ket… Addis K… Augu… 2021 1/08… 148 70
3 Addis Ababa Addis Ket… Addis K… Sept… 2021 1/09… 142 5
4 Addis Ababa Addis Ket… Addis K… Octo… 2021 1/10… 190 93
5 Addis Ababa Addis Ket… Addis K… Nove… 21 1/11… 194 87
6 Addis Ababa Addis Ket… Addis K… Dece… 2021 1/12… 194 25
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
To understand the structure of the data you can use the str()
command or glimpse().
Rows: 11,604
Columns: 14
$ region <chr> "Addis Ababa", "Addis Ababa", "Addis Ababa", "Addis Aba…
$ zone <chr> "Addis Ketema", "Addis Ketema", "Addis Ketema", "Addis …
$ district <chr> "Addis Ketema (AA)", "Addis Ketema (AA)", "Addis Ketema…
$ Month <chr> "July", "August", "September", "October", "November", "…
$ Year <dbl> 2021, 2021, 2021, 2021, 21, 2021, 2021, 2021, 2021, 202…
$ date <chr> "1/07/2021", "1/08/2021", "1/09/2021", "1/10/2021", "1/…
$ test_performed <dbl> 97, 148, 142, 190, 194, 194, 146, 189, 243, 184, 166, 1…
$ confirmed_all <dbl> 3, 70, 5, 93, 87, 25, 62, 72, 18, 23, 15, 43, 17, 3, 10…
$ confirmed_u5 <dbl> 1, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 1, 2…
$ confirmed_5_14 <dbl> 0, 1, 0, 6, 3, 1, 2, 2, 0, 2, 0, 4, 1, 0, 0, 0, 0, 0, 1…
$ confirmed_15 <dbl> 2, 69, 5, 87, 78, 24, 60, 70, 18, 21, 15, 37, 14, 3, 10…
$ pop_5 <dbl> 50741.83, 50741.83, 50741.83, 50741.83, 50741.83, 50741…
$ pop_514 <dbl> 88159.40, 88159.40, 88159.40, 88159.40, 88159.40, 88159…
$ pop_15 <dbl> 212193.8, 212193.8, 212193.8, 212193.8, 212193.8, 21219…
From the above results, you can see that the data frame consists of 11,604 observations (rows) and 14 variables (columns). Each variable’s name and data type is also listed. The dataset can be read as data.frame
, lists
, tbl_df
, spatial
. Please refer my previous post on Mastering RStudio: A Beginner’s Guide for detail note about data structure in R and supplementary YouTube videos .
For a data frame you can select the nth row, or the nth column using square brackets (note where the comma is paced).
To view all variable names with the names()
, colnames()
, and variable.names()
function.
To view variable with the position number:
There are several incorrect data types in this dataset, but let’s continue using the “date” variable to demonstrate how to identify and update these errors:
“Character” is returned but the variable should in fact be a date. you can use the as.Date
function of lubridate
package to change the data type accordingly:
String inconsistencies: This includes typos, capitalization errors, misplaced punctuation, or similar character data errors that might interfere with data analysis. Take for instance your “Year” column. As you can see there are wrongly entered year 21 and 3021.
Outliers:
The dataset has 8 continuous variables: test_performed, confirmed_all, confirmed_u5, confirmed_5_14, confirmed_15, population, pop_5, pop_514, pop_15. To get a feeling for how the data is distributed, you can plot histograms for case variables:
Under 5 confirmed malaria cases:
All age confirmed malaria cases:
Data cleaning is one of the most important steps for analysis! No matter where your data comes from, always be sure the completeness, consistency, and trustworthiness of data before analyzing and utilizing the data. Once you get cleaned and organized, you can perform analysis to find clear and objective answers to any data question.
In this section, you’ll explore the powerful dplyr package, part of the tidyverse ecosystem. Tidyverse is a collection of R packages for data science, designed to make cleaning and analyses of data easy and tidy.
dplyr streamlines your data wrangling process, making it easier to work with data frames.
Let’s dive into the key functions:
select(): Choose specific variables (columns) from a data frame based on their names.
rename(): Transform and rename variables.
filter(): Select rows that meet specific criteria.
mutate(): Create new variables by applying functions to existing ones.
group_by(): Perform operations by group.
summarize(): Aggregate data.
arrange(): Order rows based on a specified column.
distinct(): Remove duplicate rows based on specified columns.
These functions simplify common data manipulation tasks.
%>% OR "|>"
This operator allows you to chain commands together. Instead of creating intermediate variables, you can directly link functions.
You use pipes
when you creating intermediate variables to link commands together . For example, the idea is instead of using :
# A tibble: 11,604 × 1
Year
<dbl>
1 2021
2 2021
3 2021
4 2021
5 21
6 2021
7 2021
8 2021
9 2021
10 2021
# ℹ 11,594 more rows
Building blocks that code readability and reproducibility, making your data workflows more efficient.
The select()
function in R allows you to choose specific columns (variables) from a data frame. Specify the data frame name as the first argument, followed by the column names you want to keep. Example:
# A tibble: 11,604 × 7
region zone district Month date confirmed_all test_performed
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ketema Addis Kete… July 1/07… 3 97
2 Addis Ababa Addis Ketema Addis Kete… Augu… 1/08… 70 148
3 Addis Ababa Addis Ketema Addis Kete… Sept… 1/09… 5 142
4 Addis Ababa Addis Ketema Addis Kete… Octo… 1/10… 93 190
5 Addis Ababa Addis Ketema Addis Kete… Nove… 1/11… 87 194
6 Addis Ababa Addis Ketema Addis Kete… Dece… 1/12… 25 194
7 Addis Ababa Addis Ketema Addis Kete… Janu… 1/01… 62 146
8 Addis Ababa Addis Ketema Addis Kete… Febr… 1/02… 72 189
9 Addis Ababa Addis Ketema Addis Kete… March 1/03… 18 243
10 Addis Ababa Addis Ketema Addis Kete… April 1/04… 23 184
# ℹ 11,594 more rows
To select columns that have common prefix or suffix, you can use the start_witth()
contains()
, or end_with()
functions. Example to subset columns that captured confirmed cases
# A tibble: 11,604 × 4
confirmed_all confirmed_u5 confirmed_5_14 confirmed_15
<dbl> <dbl> <dbl> <dbl>
1 3 1 0 2
2 70 0 1 69
3 5 0 0 5
4 93 0 6 87
5 87 6 3 78
6 25 0 1 24
7 62 0 2 60
8 72 0 2 70
9 18 0 0 18
10 23 0 2 21
# ℹ 11,594 more rows
contains()
# A tibble: 11,604 × 4
confirmed_all confirmed_u5 confirmed_5_14 confirmed_15
<dbl> <dbl> <dbl> <dbl>
1 3 1 0 2
2 70 0 1 69
3 5 0 0 5
4 93 0 6 87
5 87 6 3 78
6 25 0 1 24
7 62 0 2 60
8 72 0 2 70
9 18 0 0 18
10 23 0 2 21
# ℹ 11,594 more rows
end_with()
# A tibble: 11,604 × 4
confirmed_u5 confirmed_15 pop_5 pop_15
<dbl> <dbl> <dbl> <dbl>
1 1 2 50742. 212194.
2 0 69 50742. 212194.
3 0 5 50742. 212194.
4 0 87 50742. 212194.
5 6 78 50742. 212194.
6 0 24 50742. 212194.
7 0 60 50742. 212194.
8 0 70 50742. 212194.
9 0 18 50742. 212194.
10 0 21 50742. 212194.
# ℹ 11,594 more rows
The filter()
function helps you retain only the rows (observations) that meet specific conditions.
For instance: To select rows for the Tigray region:
# A tibble: 552 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Tigray Central Abi Adi town July 2021 1/07… 407 111
2 Tigray Central Abi Adi town August 2021 1/08… 277 23
3 Tigray Central Abi Adi town Septemb… 2021 1/09… 192 23
4 Tigray Central Abi Adi town October 2021 1/10… NA NA
5 Tigray Central Abi Adi town November 2021 1/11… NA NA
6 Tigray Central Abi Adi town December 2021 1/12… NA NA
7 Tigray Central Abi Adi town January 2021 1/01… NA NA
8 Tigray Central Abi Adi town February 2021 1/02… NA NA
9 Tigray Central Abi Adi town March 2021 1/03… NA NA
10 Tigray Central Abi Adi town April 2021 1/04… NA NA
# ℹ 542 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
To choose rows with more than 500 confirmed cases in children under 5:
# A tibble: 16 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Gambella Gamb… Gambela… Octo… 2021 1/10… 5221 2317
2 Gambella Gamb… Gambela… Nove… 2021 1/11… 5285 2556
3 Gambella Itan… Itang Dece… 2021 1/12… 504 1271
4 Somali Afder Barey Augu… 2021 1/08… 2202 2192
5 Somali Afder Barey Octo… 2021 1/10… 1950 1950
6 Beneshangul Gu… Asso… Sherkole July 2021 1/07… 5132 3984
7 Amhara Awi Jawi Sept… 2021 1/09… 8710 3051
8 Amhara Awi Jawi Octo… 2021 1/10… 11391 3616
9 Amhara Cent… Gonder … Octo… 2021 1/10… 10287 4593
10 Amhara Cent… Gonder … Nove… 2021 1/11… 11089 4564
11 Amhara Sout… Dera (A… June 2021 1/06… 13008 6017
12 Amhara Sout… Fogera Octo… 2021 1/10… 14768 5357
13 Amhara Sout… Fogera Nove… 2021 1/11… 19448 8367
14 Amhara Sout… Fogera Dece… 2021 1/12… 10552 4078
15 SNNP Silte Sankura Octo… 2021 1/10… 4781 2144
16 SNNP Wola… Boloso … May 2021 1/05… 11681 3762
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
To filter by multiple conditions (e.g., regions “Amhara” or “Afar” with at least 500 confirmed cases in children under 5 ):
# A tibble: 8 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Amhara Awi Jawi Sept… 2021 1/09… 8710 3051
2 Amhara Awi Jawi Octo… 2021 1/10… 11391 3616
3 Amhara Central Gondar Gonder Z… Octo… 2021 1/10… 10287 4593
4 Amhara Central Gondar Gonder Z… Nove… 2021 1/11… 11089 4564
5 Amhara South Gondar Dera (AM) June 2021 1/06… 13008 6017
6 Amhara South Gondar Fogera Octo… 2021 1/10… 14768 5357
7 Amhara South Gondar Fogera Nove… 2021 1/11… 19448 8367
8 Amhara South Gondar Fogera Dece… 2021 1/12… 10552 4078
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
You can use filter()
to spot inconsistencies. For instance, find rows where the number of confirmed cases exceeds the number tested:
# A tibble: 97 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Afar Zone 2 Abaala Augu… 2021 1/08… 10 12
2 Afar Zone 4 Euwa Octo… 2021 1/10… 191 243
3 Gambella Agniwa Jore Dece… 2021 1/12… 79 158
4 Gambella Agniwa Jore April 2021 1/04… 63 73
5 Gambella Agniwa Jore May 2021 1/05… 75 84
6 Gambella Itang Speci… Itang Dece… 2021 1/12… 504 1271
7 Gambella Itang Speci… Itang Janu… 2021 1/01… 864 867
8 Gambella Majang Mengesh May 2021 1/05… 118 198
9 Gambella Nuer Lare Janu… 2021 1/01… 132 201
10 Gambella Nuer Makuey Janu… 2021 1/01… 243 360
# ℹ 87 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
There may be situations when you want to rename variables in a data frame to make it more comprehensive and easier to process. The rename()
function allows you to change column names in a data frame. It’s useful for making variable names more descriptive. you pass to this function the data frame you are working with, rename(dataframe, new_name = old_name).
Example, if you wanted to change the variable “confirmed_u5” to “conf_u5”, and to overwrite the object “routine_data” with this you would simply write:
# A tibble: 11,604 × 14
region zone district Month Year date test_performed confirmed_all conf_u5
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Addis … Addi… Addis K… July 2021 1/07… 97 3 1
2 Addis … Addi… Addis K… Augu… 2021 1/08… 148 70 0
3 Addis … Addi… Addis K… Sept… 2021 1/09… 142 5 0
4 Addis … Addi… Addis K… Octo… 2021 1/10… 190 93 0
5 Addis … Addi… Addis K… Nove… 21 1/11… 194 87 6
6 Addis … Addi… Addis K… Dece… 2021 1/12… 194 25 0
7 Addis … Addi… Addis K… Janu… 2021 1/01… 146 62 0
8 Addis … Addi… Addis K… Febr… 2021 1/02… 189 72 0
9 Addis … Addi… Addis K… March 2021 1/03… 243 18 0
10 Addis … Addi… Addis K… April 2021 1/04… 184 23 0
# ℹ 11,594 more rows
# ℹ 5 more variables: confirmed_5_14 <dbl>, confirmed_15 <dbl>, pop_5 <dbl>,
# pop_514 <dbl>, pop_15 <dbl>
The mutate()
function lets you add new variables or modify existing ones. You use the =
sign to assign new values . For example, to calculate the incidence rate of malaria in children under five per 1000 population using the formula (confirmed_u5 / pop_5) * 1000
, you could write:
# A tibble: 11,604 × 15
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ke… Addis K… July 2021 1/07… 97 3
2 Addis Ababa Addis Ke… Addis K… Augu… 2021 1/08… 148 70
3 Addis Ababa Addis Ke… Addis K… Sept… 2021 1/09… 142 5
4 Addis Ababa Addis Ke… Addis K… Octo… 2021 1/10… 190 93
5 Addis Ababa Addis Ke… Addis K… Nove… 21 1/11… 194 87
6 Addis Ababa Addis Ke… Addis K… Dece… 2021 1/12… 194 25
7 Addis Ababa Addis Ke… Addis K… Janu… 2021 1/01… 146 62
8 Addis Ababa Addis Ke… Addis K… Febr… 2021 1/02… 189 72
9 Addis Ababa Addis Ke… Addis K… March 2021 1/03… 243 18
10 Addis Ababa Addis Ke… Addis K… April 2021 1/04… 184 23
# ℹ 11,594 more rows
# ℹ 7 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>,
# incidence_rate <dbl>
Suppose you noticed an error in the region name, where “South Western Ethiopia” was mistakenly entered as “Sou.” You can correct this using an ifelse
statement within mutate()
. The condition is region == "Sou"
, and if it’s met, you replace the value with “South Western Ethiopia.” Otherwise, you keep the original value.
Here’s how you can do it:
This will update the region
variable in your data frame.
case_when
for Multiple Changes:If you need to make multiple changes based on different conditions, consider using case_when
instead of ifelse
. Let’s say there are other errors in the district names. You can correct them simultaneously using case_when
.
There is a district name mismatch between district names in DHIS2 and names in shapefile. For this you will use the task_data.csv
task_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/task_data.csv") %>%
data.frame() %>%
mutate(district_new = case_when(
district == "Addis Ketema" ~ "Addis Ketema (AA)",
district == "Nifas Silk Lafto" ~ "Nefas Silk",
TRUE ~ district)) %>%
select(-district) %>%
rename("district" = "district_new") #rename to district
Rows: 11604 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): region, zone, district, Month
dbl (7): Year, female, male, conf_5, conf_514, conf_15, cases
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Change the Year column
routine_data %>%
mutate(Year =
case_when(Year == 21 ~ 2021,
Year == 3021 ~ 2021,
.default = Year
)
)
# A tibble: 11,604 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ke… Addis K… July 2021 1/07… 97 3
2 Addis Ababa Addis Ke… Addis K… Augu… 2021 1/08… 148 70
3 Addis Ababa Addis Ke… Addis K… Sept… 2021 1/09… 142 5
4 Addis Ababa Addis Ke… Addis K… Octo… 2021 1/10… 190 93
5 Addis Ababa Addis Ke… Addis K… Nove… 2021 1/11… 194 87
6 Addis Ababa Addis Ke… Addis K… Dece… 2021 1/12… 194 25
7 Addis Ababa Addis Ke… Addis K… Janu… 2021 1/01… 146 62
8 Addis Ababa Addis Ke… Addis K… Febr… 2021 1/02… 189 72
9 Addis Ababa Addis Ke… Addis K… March 2021 1/03… 243 18
10 Addis Ababa Addis Ke… Addis K… April 2021 1/04… 184 23
# ℹ 11,594 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
[1] 2021 21 3021
The distinct() function removes duplicate rows from a data frame based on specified columns. You can use it to keep only unique rows or to remove duplicates based on specific columns.
To remove duplicate rows based on the district column, you can use:
# A tibble: 967 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ke… Addis K… July 2021 1/07… 97 3
2 Addis Ababa Akaki Ka… Akaki -… July 2021 1/07… 105 17
3 Addis Ababa Arada Arada July 2021 1/07… 62 4
4 Addis Ababa Bole Bole July 2021 1/07… 73 7
5 Addis Ababa Gulele Gulele July 2021 1/07… 32 6
6 Addis Ababa Kirkos Kirkos July 2021 1/07… 190 11
7 Addis Ababa Kolfe Kolfe -… July 2021 1/07… 36 3
8 Addis Ababa Lideta Lideta July 2021 1/07… 32 14
9 Addis Ababa Nifas Si… Nefas S… July 2021 1/07… 70 7
10 Addis Ababa Yeka Yeka July 2021 1/07… 68 22
# ℹ 957 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
Null values are treated differently in R. They appear as NA
in the dataset, so you may expect the following code to work for filtering data to remove all missing values for the number of people tested for malaria:
# A tibble: 0 × 14
# ℹ 14 variables: region <chr>, zone <chr>, district <chr>, Month <chr>,
# Year <dbl>, date <chr>, test_performed <dbl>, confirmed_all <dbl>,
# confirmed_u5 <dbl>, confirmed_5_14 <dbl>, confirmed_15 <dbl>, pop_5 <dbl>,
# pop_514 <dbl>, pop_15 <dbl>
However, this does not work as R has a special way of dealing with missing values. You use the is.na()
command, which checks fo NA
values. As with the equals command, if you want the reverse of this, i.e. “not NA” you can use !is.na()
. So the code to remove missing values would be:
# A tibble: 9,869 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ke… Addis K… July 2021 1/07… 97 3
2 Addis Ababa Addis Ke… Addis K… Augu… 2021 1/08… 148 70
3 Addis Ababa Addis Ke… Addis K… Sept… 2021 1/09… 142 5
4 Addis Ababa Addis Ke… Addis K… Octo… 2021 1/10… 190 93
5 Addis Ababa Addis Ke… Addis K… Nove… 21 1/11… 194 87
6 Addis Ababa Addis Ke… Addis K… Dece… 2021 1/12… 194 25
7 Addis Ababa Addis Ke… Addis K… Janu… 2021 1/01… 146 62
8 Addis Ababa Addis Ke… Addis K… Febr… 2021 1/02… 189 72
9 Addis Ababa Addis Ke… Addis K… March 2021 1/03… 243 18
10 Addis Ababa Addis Ke… Addis K… April 2021 1/04… 184 23
# ℹ 9,859 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
Another method for removing missing data in tidyverse is using the drop_na()
function from {tidyr} package. As with the filter function this takes the dataset as the first argument, followed by the variables for which you are dropping NA values.
Sorting a data frame by rows and reordering columns is easy in R. To sort a data frame by a column you use the function arrange()
. You specify the data frame and the column to sort by, and the default is to sort in ascending
order. To sort in a descending order you can specify this with desc().
Additionally, you can sort by multiple variables, and sorting will be undertaken in the order they appear in the command.
# A tibble: 11,604 × 14
region zone district Month Year date test_performed confirmed_all
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Gambella Gamb… Gambela… Nove… 2021 1/11… 5285 2556
2 Amhara Sout… Fogera Nove… 2021 1/11… 19448 8367
3 Amhara Sout… Dera (A… June 2021 1/06… 13008 6017
4 Beneshangul Gu… Asso… Sherkole July 2021 1/07… 5132 3984
5 Somali Afder Barey Augu… 2021 1/08… 2202 2192
6 Gambella Itan… Itang Dece… 2021 1/12… 504 1271
7 SNNP Silte Sankura Octo… 2021 1/10… 4781 2144
8 Gambella Gamb… Gambela… Octo… 2021 1/10… 5221 2317
9 Somali Afder Barey Octo… 2021 1/10… 1950 1950
10 Amhara Sout… Fogera Dece… 2021 1/12… 10552 4078
# ℹ 11,594 more rows
# ℹ 6 more variables: confirmed_u5 <dbl>, confirmed_5_14 <dbl>,
# confirmed_15 <dbl>, pop_5 <dbl>, pop_514 <dbl>, pop_15 <dbl>
How to change the order of the column
The relocate()
function is part of the dplyr
package in R. It allows you to change the order of columns in a data frame. You can specify where a particular column should be placed relative to other columns using the .before
and .after
arguments.
The basic syntax of the relocate() function as follows:
relocate(data, column_name, .before = target_column, .after = target_column)
data
: The data frame containing the columns.
column_name
: The name of the column you want to move.
.before
: Specify the column name before which the target column should be placed.
.after
: Specify the column name after which the target column should be placed.
Example: Suppose we have a data frame called task_data
with columns: “zone,” “district,” and “Month.” We want to move the “district” column after the “zone” column and before the “Month” column.
region zone district Month Year Date female
1 Addis Ababa Addis Ketema Addis Ketema (AA) July 2022 2022-01-01 12
2 Addis Ababa Addis Ketema Addis Ketema (AA) August 2022 2022-01-01 5
3 Addis Ababa Addis Ketema Addis Ketema (AA) September 2022 2022-01-01 22
4 Addis Ababa Addis Ketema Addis Ketema (AA) October 2022 2022-01-01 5
5 Addis Ababa Addis Ketema Addis Ketema (AA) November 2022 2022-01-01 5
6 Addis Ababa Addis Ketema Addis Ketema (AA) December 2022 2022-01-01 2
male conf_5 conf_514 conf_15 cases
1 19 2 2 27 31
2 6 1 1 9 11
3 50 0 0 72 72
4 13 1 3 14 18
5 13 0 1 17 18
6 8 0 0 10 10
There are some useful functions in tidyverse to help you summarize the data. The first of these is the count()
function. This is a quick function which will allow you to quickly count the occurrences of an item within a dataset.
# A tibble: 165 × 3
# Groups: region [1]
region district n
<chr> <chr> <int>
1 Amhara Abergele (AM) 12
2 Amhara Adagn Ager Chaqo 12
3 Amhara Addi Arekay 12
4 Amhara Albuko 12
5 Amhara Alfa 12
6 Amhara Ambasel 12
7 Amhara Aneded 12
8 Amhara Angolelana Tera 12
9 Amhara Angot 12
10 Amhara Ankasha 12
# ℹ 155 more rows
By including multiple variables in the command you can count the numbers of times that combination of variables appears.
# A tibble: 156 × 3
region Month n
<chr> <chr> <int>
1 Addis Ababa April 10
2 Addis Ababa August 10
3 Addis Ababa December 10
4 Addis Ababa February 10
5 Addis Ababa January 10
6 Addis Ababa July 10
7 Addis Ababa June 10
8 Addis Ababa March 10
9 Addis Ababa May 10
10 Addis Ababa November 10
# ℹ 146 more rows
If you want to summarize numerical variables you can use the function summarise().
This is used in conjunction with other mathematical functions such as sum()
, mean()
, median()
, max()
..
task_data %>%
summarise(total_case = sum(female, male, na.rm = TRUE),
mean_cases = mean(conf_5),
median_cases = median(conf_5),
max_cases = max(conf_5),
sd_case = sd(conf_5))
total_case mean_cases median_cases max_cases sd_case
1 1644107 24.09204 1 1441 66.48492
you can combine the summarise()
function with group_by()
to summarize the data by different variables in the dataset. To calculate the total number of people tested and positive for malaria in each district in our dataset, you would group by this variable first and then summarize the data. Grouping is not restricted to one variable, if you wanted to group the data by location and date then both variables would be included in the command. When you use the sum()
function , na.rm = T
logic required if the column has NULL values
routine_data %>% group_by(region) %>%
summarise(total_test = sum(test_performed),
total_positive = sum(confirmed_all),
total_u5 = sum(confirmed_u5),
total_514 = sum(confirmed_5_14),
total_ov15 = sum(confirmed_15))
# A tibble: 13 × 6
region total_test total_positive total_u5 total_514 total_ov15
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Addis Ababa 12671 1657 92 111 1454
2 Afar NA NA NA NA NA
3 Amhara NA NA NA NA NA
4 Beneshangul Gumuz NA NA NA NA NA
5 Dire Dawa NA NA NA NA NA
6 Gambella NA NA NA NA NA
7 Harari NA NA NA NA NA
8 Oromiya NA NA NA NA NA
9 SNNP NA NA NA NA NA
10 Sidama NA NA NA NA NA
11 Somali NA NA NA NA NA
12 Sou NA NA NA NA NA
13 Tigray NA NA NA NA NA
Subsetting data
Subsetting refers to extracting a portion of your data based on specific conditions. In this case, we want to focus on data related to the Amhara region.
The select()
function allows us to choose specific columns from our data frame. We’ll keep the following columns: “region,” “zone,” “district,” “test_performed,” “confirmed_all,” and “confirmed_u5.”
We’ll group our data by the “zone” column using the group_by()
function. This means that subsequent calculations will be performed within each zone.
The summarise()
function computes summary statistics for each group (in this case, each zone).
routine_data %>%
select(region, zone, district,
test_performed,confirmed_all,confirmed_u5) %>%
filter(region%in%"Amhara") %>%
group_by(zone) %>%
summarise(total_positive = sum(confirmed_all, na.rm = T),
total_u5 = sum(confirmed_u5, na.rm = T),
prop_u5 = round((total_u5/total_positive)*100,2))
# A tibble: 15 × 4
zone total_positive total_u5 prop_u5
<chr> <dbl> <dbl> <dbl>
1 Awi 30757 4679 15.2
2 Bahirdar 16611 2630 15.8
3 Central Gondar 99939 11600 11.6
4 Dessie 44 1 2.27
5 East Gojjam 39948 1586 3.97
6 Gondar 5884 459 7.8
7 North Gondar 11811 994 8.42
8 North Shewa 10752 541 5.03
9 North Wollo 12704 1277 10.0
10 Oromia Special 5693 1040 18.3
11 South Gondar 107618 12023 11.2
12 South Wollo 8356 488 5.84
13 Waghimera 11551 1185 10.3
14 West Gojjam 41792 5132 12.3
15 West Gondar 73432 9534 13.0
Know to take everything you have learnt to import and clean the routine dataset. If you want your output to contain the total numbers of malaria tests performed and the number of confirmed cases in children under 5, people over 5, and calculate a total for all ages. you want to have the total by district level and Year in the dataset. This is how you would go about building the code.
clean_routine_data <- routine_data %>%
# subset
dplyr::select(region, zone, district, Month, Year,
test_performed,confirmed_all,
confirmed_u5,confirmed_5_14,confirmed_15) %>%
drop_na(test_performed,confirmed_all,
confirmed_u5,confirmed_5_14,confirmed_15) %>%
# filter rows if test number is less than confirmed
filter(test_performed>confirmed_all) %>%
# update region
mutate(region = case_when(
region =='Sou'~'South Western Ethiopia',
TRUE~region),
# update year
year = case_when(Year == 3021 ~ 2021,
Year == 21 ~ 2021,
TRUE ~ Year),
# to date
date_reported = make_date(year = Year,
month = Month),
conf_ov5 = confirmed_5_14+confirmed_15) %>%
# aggregate by region, zone, and district
group_by(region, zone, district) %>%
summarise(test_total = sum(test_performed),
conf_total = sum(confirmed_all),
conf_u5 = sum(confirmed_u5),
conf_ov5 = sum(conf_ov5)) %>%
# deselect test_total, -conf_total
dplyr::select(-test_total, -conf_total)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `date_reported = make_date(year = Year, month = Month)`.
Caused by warning in `make_date()`:
! NAs introduced by coercion
`summarise()` has grouped output by 'region', 'zone'. You can override using
the `.groups` argument.
# A tibble: 6 × 5
# Groups: region, zone [6]
region zone district conf_u5 conf_ov5
<chr> <chr> <chr> <dbl> <dbl>
1 Addis Ababa Addis Ketema Addis Ketema (AA) 9 507
2 Addis Ababa Akaki Kality Akaki - Kalit 13 277
3 Addis Ababa Arada Arada 13 54
4 Addis Ababa Bole Bole 2 97
5 Addis Ababa Gulele Gulele 0 103
6 Addis Ababa Kirkos Kirkos 3 74
In this section you introducing some more advanced functions for data manipulation. you will be using the “clean_routine_data” dataset you just created.
Reshaping or pivoting data is an important part of data cleaning and manipulation. Tidyverse has introduced the functions pivot_wider()
and pivot_longer()
to improve the ease of reshaping data in R.
pivot_longer()
takes a wide dataset and converts it into a long one, decreasing the number of columns and increasing the number of rows. Datasets are often created in a wide format, but for analysis a long format is often preferable, especially for data visualization.
To reshape the data long you need to pass the argument the columns which are being pivoted, a name for the new column to identify the columns being reshaped, and a name for the values of the columns being reshaped. you can also combine this with helper functions such as starts_with()
to help identify the columns to reshape. For this demonstration you will use the “clean_data_routine.csv”. To reshape the dataset into long format :
clean_routine_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/clean_routine_data.csv")
Rows: 709 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): region, zone, district
dbl (2): conf_u5, conf_ov5
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
to convert the clean_long
dataset to wide forma:
There are a range of different options in this function to help pivot the data in the cleanest way possible. To see these you can look at the vignette by typing the code vignette("pivot")
.
In R you can easily join two data frames together based on one, or multiple variables. There are the following options for joining two data frames, x
and y
:
inner_join()
: includes all rows that appear in both x and y
left_join()
: includes all rows in x
right_join()
: includes all rows in y
full_join()
: includes all rows in x or y
anti_join()
: return all rows in one data frame that do not have matching values in another data frame
To run the command you need to pass it the data frames you wish to join and the variable(s) you wish to join by. If there are matching variables in the data frames these will be detected and you do not need to specify them.
If you have two data frames with varying numbers of rows, you can investigate the outcomes of using the different join commands. Firstly, you create two data frames of different lengths, tested
, and confirmed
, then look at what the commands and outcomes would be.
tested <- data.frame(year = c(2015, 2016, 2017, 2018, 2019, 2020),
tested = c(1981, 1992, 2611, 2433, 2291, 2311))
positive <- data.frame(year = c(2013, 2014, 2015, 2016, 2017, 2018),
positive = c(1164, 1391, 981, 871, 1211, 998))
# Command written in full
inner_join(tested, positive, by = "year")
year tested positive
1 2015 1981 981
2 2016 1992 871
3 2017 2611 1211
4 2018 2433 998
Joining with `by = join_by(year)`
year tested positive
1 2015 1981 981
2 2016 1992 871
3 2017 2611 1211
4 2018 2433 998
Joining with `by = join_by(year)`
year tested positive
1 2015 1981 981
2 2016 1992 871
3 2017 2611 1211
4 2018 2433 998
5 2019 2291 NA
6 2020 2311 NA
Joining with `by = join_by(year)`
year tested positive
1 2015 1981 981
2 2016 1992 871
3 2017 2611 1211
4 2018 2433 998
5 2013 NA 1164
6 2014 NA 1391
Joining with `by = join_by(year)`
year tested positive
1 2015 1981 981
2 2016 1992 871
3 2017 2611 1211
4 2018 2433 998
5 2019 2291 NA
6 2020 2311 NA
7 2013 NA 1164
8 2014 NA 1391
anti_join(tested, positive) #Keep all rows in tested that do not have matching values in positive dataset
Joining with `by = join_by(year)`
year tested
1 2019 2291
2 2020 2311
You can also join datasets which have different names for the variables you wish to join on. Say if the variable for “year” in the positive dataset was “yr”, you could use the command:
Once you have finished working on your data there are multiple ways you can save your work in R.
One of the basic ones is to save your dataset as a csv. This is useful as it can easily be opened in other software (such as excel). You may also want to save your data as a Stata (dta) file.
R has some specific formats you can use to store data, these are .RDS
and .RData
. RDS files store one R object, whilst RData can store multiple objects.
Here, you can select which format you want to save the data in and save the malaria_incidence_2021
data frame you created in this module.
Similarly to importing data, you can use the base R functions of `write.csv()`, or preferably the tidyverse
option if write_csv()
.
---
title: "Data manipulation: dplyr"
description: |
Basic data cleaning and management in dplyr
author:
- name: Yalemzewod Gelaw
url: https://yalemzewodgelaw.com/
orcid: 0000-0002-5338-586
date: 2024-06-12
image: "media/pc.png"
slug: beginner’s guide
categories: [RStudio, Data management]
toc: true
toc-depth: 4
number-sections: false
highlight-style: github
format:
html:
self-contained: true
code-fold: false
code-summary: "Show the code"
code-tools: true
theme: united
knitr:
opts_knit:
warning: false
message: false
editor: visual
---
# Data management in dplyr
In this section, I will discuss about data management process, basic data cleaning and management in `dplyr`.
Data can be defined as a collection of facts that can be used to draw a conclusion, make predictions and assist decision making.
Public health data may originate from various sources including:
| Routine disease surveillance system such as hospital medical records (individual level information) or aggregated data captured through a reporting software - District Health Information Software (DHIS2)
| Research and surveys - Demographic Health Survey, Malaria Indicator Survey, Service Provision Assessment
| Administrative - logistic data
| Vital statistics - birth, death and marriage
| Literature and reports
- ::: {.callout-tip appearance="simple"}
- `ls()` - Lists all active files.
- `rm(list = ls())` - Cleans your environment.
- `Ctrl + L` - Cleans the console.
- Assign function: `Alt + -`.
- `Ctrl + 1` - shortcut to move to source pane
- `Ctrl + 2` - shortcut to move to console pane
:::
## Step-by-Step Guide
### 1. Setting Up Your Environment
Check your working directory using:
```{r, woringdir, echo=TRUE}
getwd()
```
### 2. Load the necessary packages
For data management, you will use the `tidyverse` package. Install the package if you haven't installed before using *install.packages("*tidyverse"*)* function.
```{r,load_package, echo=TRUE, eval=TRUE}
# Install necessary packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("janitor", quietly = TRUE)) install.packages("janitor")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("lubridate", quietly = TRUE)) install.packages("lubridate")
if (!requireNamespace("haven", quietly = TRUE)) install.packages("haven")
# Load the packages
library(tidyverse) # To read Excel files
library(janitor) # To clean column names
library(dplyr) # For data wrangling
library(lubridate) # Date functions
library(haven) # To read stata file
```
### 3. Reading and Cleaning Data
Read the malaria data: The data I used for this post is a sample of routine malaria surveillance data from **Ethiopia**. The data you're collected monthly and collated at the district level (third administrative system) and stored in csv file format.
```{r,path, echo=TRUE, eval=TRUE}
malaria_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/routine_data.csv")
```
**Safeguard your data: Make a backup of the data**:
When dealing with large datasets, it’s essential to take precautions. Imagine your data disappearing into the digital abyss—scary, right? To prevent that, always make a backup. Think of it as a safety net for your precious data. Whether you’re crunching numbers, visualizing trends, or building models, follow this golden rule: **Back it up before you hack it up**🔒!
```{r, backup, echo=TRUE, eval=TRUE}
routine_data <- malaria_data #Make a backup
```
### 3. Skim the data
Knowing your dataset well from file size to data types is another crucial step prior to hands-on data cleaning.
To look at how the data looks like either by clicking on it in the global environment window or by typing the command `View(routine_data)` which opens up a window displaying the data.
```{r, view, echo=TRUE, eval=TRUE}
View(routine_data)
```
![*View data*](C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-06-10-Data-manipulation-in-dplyr/media/data_view.png)
Alternatively, you may just want to look at a few rows. You can do this by using the `head()` function, which shows us the first six rows of data and `tail()` function, which shows us the lass six rows of the data.
```{r, top,echo=TRUE, eval=TRUE}
head(routine_data) # top six rows
```
```{r, bottom,echo=TRUE,eval=FALSE}
tail(routine_data) # bottom six rows
```
To understand the structure of the data you can use the `str()` command or `glimpse().`
```{r, attributes,echo=TRUE, eval=TRUE}
glimpse(routine_data)
```
From the above results, you can see that the data frame consists of 11,604 observations (rows) and 14 variables (columns). Each variable's name and data type is also listed. The dataset can be read as `data.frame`, `lists`, `tbl_df`, `spatial`. Please refer my previous post on [**Mastering RStudio: A Beginner’s Guide**](https://yalemzewodgelaw.com/tutorials/2022-05-24-introduction-to-rstudio/) for detail note about data structure in R and supplementary [YouTube videos .](https://youtu.be/7MyAEQO3nqM?si=K0l2q8kU2hYaMUiW)
```{r, class,echo=TRUE,eval=FALSE}
class(routine_data)
```
For a data frame you can select the nth row, or the nth column using square brackets (note where the comma is paced).
```{r, second_col,echo=TRUE,eval=FALSE}
head(routine_data[,2], 5) #second column
```
```{r, first_row,echo=TRUE,eval=FALSE}
routine_data[1,] #firts row
```
To view all variable names with the `names()`, `colnames()` , and `variable.names()` function.
```{r, col_list,echo=TRUE,eval=FALSE}
colnames(routine_data)
names(routine_data)
variable.names(routine_data)
```
To view variable with the position number:
```{r,col_position,echo=TRUE,eval=FALSE}
names(routine_data)[5]
```
There are several incorrect data types in this dataset, but let's continue using the "date" variable to demonstrate how to identify and update these errors:
```{r, data_type,echo=TRUE,eval=FALSE}
typeof(routine_data$date)
```
"Character" is returned but the variable should in fact be a date. you can use the `as.Date` function of `lubridate` package to change the data type accordingly:
```{r, as_date, echo=TRUE, eval=TRUE}
#as.Date(routine_data$date) # to date
```
![*Convert to date from a character variable*](C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-06-10-Data-manipulation-in-dplyr/media/date.png)
String inconsistencies: This includes typos, capitalization errors, misplaced punctuation, or similar character data errors that might interfere with data analysis. Take for instance your "Year" column. As you can see there are wrongly entered year 21 and 3021.
```{r, unique_year,echo=TRUE, eval=TRUE}
unique(routine_data$Year)
```
**Outliers**:
The dataset has 8 continuous variables: *test_performed*, *confirmed_all, confirmed_u5, confirmed_5_14, confirmed_15, population, pop_5, pop_514, pop_15*. To get a feeling for how the data is distributed, you can plot histograms for case variables:
Under 5 confirmed malaria cases:
```{r, hist_u5, echo=TRUE, eval=TRUE}
hist(routine_data$confirmed_u5)
```
All age confirmed malaria cases:
```{r, hist_all,echo=TRUE, eval=TRUE}
hist(routine_data$confirmed_all)
```
Data cleaning is one of the most important steps for analysis! No matter where your data comes from, always be sure the completeness, consistency, and trustworthiness of data before analyzing and utilizing the data. Once you get cleaned and organized, you can perform analysis to find clear and objective answers to any data question.
## Data manipulation in dplyr()
In this section, you’ll explore the powerful dplyr package, part of the tidyverse ecosystem. Tidyverse is a collection of R packages for data science, designed to make cleaning and analyses of data easy and tidy.
**dplyr** streamlines your data wrangling process, making it easier to work with data frames.
Let’s dive into the key functions:
::: callout-note
***select()***: Choose specific variables (columns) from a data frame based on their names.
***rename()***: Transform and rename variables.
***filter()***: Select rows that meet specific criteria.
***mutate()***: Create new variables by applying functions to existing ones.
***group_by()***: Perform operations by group.
***summarize()***: Aggregate data.
***arrange()***: Order rows based on a specified column.
***distinct()***: Remove duplicate rows based on specified columns.
:::
These functions simplify common data manipulation tasks.
#### The Pipe Operator `%>% OR "|>"`
This operator allows you to chain commands together. Instead of creating intermediate variables, you can directly link functions.
You use `pipes` when you creating intermediate variables to link commands together . For example, the idea is instead of using :
```{r,select, echo=TRUE, eval=TRUE}
select(routine_data, Year)
```
Building blocks that code readability and reproducibility, making your data workflows more efficient.
```{r, pipe, echo=TRUE, eval=TRUE}
routine_data %>%
select(Year)
```
#### Selecting Columns:
The `select()` function in R allows you to choose specific columns (variables) from a data frame. Specify the data frame name as the first argument, followed by the column names you want to keep. Example:
```{r, colu, echo=TRUE, eval=TRUE}
routine_data %>%
select(region,zone,district,
Month,date,confirmed_all,test_performed)
```
To select columns that have common prefix or suffix, you can use the `start_witth()` `contains()`, or `end_with()` functions. Example to subset columns that captured confirmed cases
```{r, start, echo=TRUE, eval=TRUE}
routine_data %>%
select(
starts_with("confirmed")
)
```
contains()
```{r, contain, echo=TRUE, eval=TRUE}
routine_data %>%
select(
contains("confirmed")
)
```
end_with()
```{r, end, echo=TRUE, eval=TRUE}
routine_data %>%
select(
ends_with("5")
)
```
#### Filtering Rows:
The `filter()` function helps you retain only the rows (observations) that meet specific conditions.
For instance: To select rows for the *Tigray* region:
```{r, filter, echo=TRUE, eval=TRUE}
routine_data%>%
filter(region == "Tigray")
```
To choose rows with more than 500 confirmed cases in children under 5:
```{r, pick, echo=TRUE, eval=TRUE}
routine_data %>%
filter(confirmed_u5 > 500)
```
To filter by multiple conditions (e.g., regions “Amhara” or “Afar” with at least 500 confirmed cases in children under 5 ):
```{r, filter_m, echo=TRUE, eval=TRUE}
routine_data %>%
filter(
(region == "Amhara" | region == "Afar") &
(confirmed_u5 >= 500))
```
| *Identifying Erroneous Data:*
You can use `filter()` to spot inconsistencies. For instance, find rows where the number of confirmed cases exceeds the number tested:
```{r, inco, echo=TRUE, eval=TRUE}
routine_data %>%
filter(confirmed_all > test_performed)
```
#### Renaming variable
There may be situations when you want to rename variables in a data frame to make it more comprehensive and easier to process. The `rename()` function allows you to change column names in a data frame. It’s useful for making variable names more descriptive. you pass to this function the data frame you are working with, *`rename(dataframe, new_name = old_name).`*
Example, if you wanted to change the variable "confirmed_u5" to "conf_u5", and to overwrite the object "routine_data" with this you would simply write:
```{r, rename, echo=TRUE, eval=TRUE}
routine_data %>%
rename(conf_u5 =confirmed_u5)
```
#### Creating new variable
The ***`mutate()`*** function lets you add new variables or modify existing ones. You use the **`=`** sign to assign new values . For example, to calculate the incidence rate of malaria in children under five per 1000 population using the formula `(confirmed_u5 / pop_5) * 1000`, you could write:
```{r,prop, echo=TRUE, eval=TRUE}
routine_data %>%
mutate(incidence_rate = (confirmed_u5/pop_5)*1000)
```
#### Altering existing variable
Suppose you noticed an error in the region name, where “South Western Ethiopia” was mistakenly entered as “Sou.” You can correct this using an **`ifelse`** statement within **`mutate()`**. The condition is **`region == "Sou"`**, and if it’s met, you replace the value with “South Western Ethiopia.” Otherwise, you keep the original value.
Here’s how you can do it:
```{r, alter, eval=FALSE, echo=TRUE}
routine_data %>%
mutate(region =
ifelse(region == "Sou",
"South Western Ethiopia",
region))
```
This will update the **`region`** variable in your data frame.
| Using `case_when` for Multiple Changes:
If you need to make multiple changes based on different conditions, consider using **`case_when`** instead of `ifelse`. Let’s say there are other errors in the district names. You can correct them simultaneously using **`case_when`**.
There is a district name mismatch between district names in **DHIS2** and names in ***shapefile***. For this you will use the `task_data.csv`
```{r, multiple_change, eval=TRUE, echo=TRUE}
task_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/task_data.csv") %>%
data.frame() %>%
mutate(district_new = case_when(
district == "Addis Ketema" ~ "Addis Ketema (AA)",
district == "Nifas Silk Lafto" ~ "Nefas Silk",
TRUE ~ district)) %>%
select(-district) %>%
rename("district" = "district_new") #rename to district
```
![district_old vs district_new](C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-06-10-Data-manipulation-in-dplyr/media/name_change.png)
Change the Year column
```{r,match_year, echo=TRUE, eval=TRUE}
routine_data %>%
mutate(Year =
case_when(Year == 21 ~ 2021,
Year == 3021 ~ 2021,
.default = Year
)
)
unique(routine_data$Year)
```
#### Removes duplicate rows
The distinct() function removes duplicate rows from a data frame based on specified columns. You can use it to keep only unique rows or to remove duplicates based on specific columns.
To remove duplicate rows based on the district column, you can use:
```{r, keep_all, echo=TRUE, eval=TRUE}
routine_data %>%
distinct(district,
.keep_all = TRUE)
```
```{r, count, echo=TRUE, eval=TRUE}
routine_data %>%
select(district) %>%
distinct() %>%
count()
```
#### **Null values**
*Null values* are treated differently in R. They appear as `NA` in the dataset, so you may expect the following code to work for filtering data to remove all missing values for the number of people tested for malaria:
```{r, na, echo=TRUE, eval=TRUE}
routine_data %>%
filter(test_performed =NA)
```
However, this does not work as R has a special way of dealing with missing values. You use the `is.na()` command, which checks fo `NA` values. As with the equals command, if you want the reverse of this, i.e. "not NA" you can use `!is.na()`. So the code to remove missing values would be:
```{r,is_na, echo=TRUE, eval=TRUE}
routine_data %>%
filter(!is.na(test_performed))
```
Another method for removing missing data in tidyverse is using the `drop_na()` function from *{tidyr}* package. As with the filter function this takes the dataset as the first argument, followed by the variables for which you are dropping NA values.
```{r, drop_na, echo=TRUE, eval=TRUE}
routine_data_no_NA <- routine_data %>%
drop_na(test_performed)
```
#### Sorting and reordering data frames
Sorting a data frame by rows and reordering columns is easy in R. To sort a data frame by a column you use the function `arrange()`. You specify the data frame and the column to sort by, and the default is to sort in `ascending` order. To sort in a descending order you can specify this with `desc().` Additionally, you can sort by multiple variables, and sorting will be undertaken in the order they appear in the command.
```{r,sort, eval=FALSE, echo=TRUE}
routine_data %>%
arrange(confirmed_u5) # the default is asce
```
```{r, desc, echo=TRUE, eval=TRUE}
routine_data %>%
arrange(desc(confirmed_u5))
```
How to change the order of the column
The `relocate()` function is part of the `dplyr` package in R. It allows you to change the order of columns in a data frame. You can specify where a particular column should be placed relative to other columns using the `.before` and `.after` arguments.
The basic syntax of the relocate() function as follows:
`relocate(data, column_name, .before = target_column, .after = target_column)`
- **`data`**: The data frame containing the columns.
- **`column_name`**: The name of the column you want to move.
- **`.before`**: Specify the column name before which the target column should be placed.
- **`.after`**: Specify the column name after which the target column should be placed.
**Example:** Suppose we have a data frame called **`task_data`** with columns: “zone,” “district,” and “Month.” We want to move the “district” column after the “zone” column and before the “Month” column.
```{r, locate, echo=TRUE, eval=TRUE}
task_data %>%
relocate(district, .before = Month) %>%
head()
```
#### Summarizing data
There are some useful functions in tidyverse to help you summarize the data. The first of these is the `count()` function. This is a quick function which will allow you to quickly count the occurrences of an item within a dataset.
```{r, summary, echo=TRUE, eval=TRUE}
routine_data %>%
filter(region=="Amhara") %>%
group_by(region) %>%
count(district)
```
By including multiple variables in the command you can count the numbers of times that combination of variables appears.
```{r, n, echo=TRUE, eval=TRUE}
routine_data %>%
count(region, Month)
```
If you want to summarize numerical variables you can use the function `summarise().` This is used in conjunction with other mathematical functions such as `sum()`, `mean()`, `median()`, `max()`..
```{r, na_rm, echo=TRUE, eval=TRUE}
task_data %>%
summarise(total_case = sum(female, male, na.rm = TRUE),
mean_cases = mean(conf_5),
median_cases = median(conf_5),
max_cases = max(conf_5),
sd_case = sd(conf_5))
```
you can combine the `summarise()` function with `group_by()` to summarize the data by different variables in the dataset. To calculate the total number of people tested and positive for malaria in each district in our dataset, you would group by this variable first and then summarize the data. Grouping is not restricted to one variable, if you wanted to group the data by location and date then both variables would be included in the command. When you use the `sum()` function , `na.rm = T` logic required if the column has NULL values
```{r, sum, echo=TRUE, eval=TRUE}
routine_data %>% group_by(region) %>%
summarise(total_test = sum(test_performed),
total_positive = sum(confirmed_all),
total_u5 = sum(confirmed_u5),
total_514 = sum(confirmed_5_14),
total_ov15 = sum(confirmed_15))
```
**Subsetting data**
- Subsetting refers to extracting a portion of your data based on specific conditions. In this case, we want to focus on data related to the Amhara region.
- The **`select()`** function allows us to choose specific columns from our data frame. We’ll keep the following columns: “region,” “zone,” “district,” “test_performed,” “confirmed_all,” and “confirmed_u5.”
- We’ll group our data by the “zone” column using the **`group_by()`** function. This means that subsequent calculations will be performed within each zone.
- The **`summarise()`** function computes summary statistics for each group (in this case, each zone).
```{r, subset, echo=TRUE, eval=TRUE}
routine_data %>%
select(region, zone, district,
test_performed,confirmed_all,confirmed_u5) %>%
filter(region%in%"Amhara") %>%
group_by(zone) %>%
summarise(total_positive = sum(confirmed_all, na.rm = T),
total_u5 = sum(confirmed_u5, na.rm = T),
prop_u5 = round((total_u5/total_positive)*100,2))
```
Know to take everything you have learnt to import and clean the routine dataset. If you want your output to contain the total numbers of malaria tests performed and the number of confirmed cases in children under 5, people over 5, and calculate a total for all ages. you want to have the total by district level and Year in the dataset. This is how you would go about building the code.
```{r,all, echo=TRUE, eval=TRUE}
clean_routine_data <- routine_data %>%
# subset
dplyr::select(region, zone, district, Month, Year,
test_performed,confirmed_all,
confirmed_u5,confirmed_5_14,confirmed_15) %>%
drop_na(test_performed,confirmed_all,
confirmed_u5,confirmed_5_14,confirmed_15) %>%
# filter rows if test number is less than confirmed
filter(test_performed>confirmed_all) %>%
# update region
mutate(region = case_when(
region =='Sou'~'South Western Ethiopia',
TRUE~region),
# update year
year = case_when(Year == 3021 ~ 2021,
Year == 21 ~ 2021,
TRUE ~ Year),
# to date
date_reported = make_date(year = Year,
month = Month),
conf_ov5 = confirmed_5_14+confirmed_15) %>%
# aggregate by region, zone, and district
group_by(region, zone, district) %>%
summarise(test_total = sum(test_performed),
conf_total = sum(confirmed_all),
conf_u5 = sum(confirmed_u5),
conf_ov5 = sum(conf_ov5)) %>%
# deselect test_total, -conf_total
dplyr::select(-test_total, -conf_total)
# top six
head(clean_routine_data)
```
### Advanced manipulation of data frames
In this section you introducing some more advanced functions for data manipulation. you will be using the "clean_routine_data" dataset you just created.
#### Reshaping data
Reshaping or pivoting data is an important part of data cleaning and manipulation. Tidyverse has introduced the functions `pivot_wider()` and `pivot_longer()` to improve the ease of reshaping data in R.
`pivot_longer()` takes a wide dataset and converts it into a long one, decreasing the number of columns and increasing the number of rows. Datasets are often created in a wide format, but for analysis a long format is often preferable, especially for data visualization.
To reshape the data long you need to pass the argument the columns which are being pivoted, a name for the new column to identify the columns being reshaped, and a name for the values of the columns being reshaped. you can also combine this with helper functions such as `starts_with()` to help identify the columns to reshape. For this demonstration you will use the "clean_data_routine.csv". To reshape the dataset into long format :
```{r, to_long, echo=TRUE, eval=TRUE}
clean_routine_data <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/data/clean_routine_data.csv")
clean_long <- clean_routine_data %>%
pivot_longer(cols = starts_with("conf"),
names_to = "age_group",
names_pattern = "conf_(.*)",
values_to = 'cases')
```
to convert the `clean_long` dataset to wide forma:\
```{r, to_wide, eval=FALSE, echo=TRUE}
clean_long |>
pivot_wider(id_cols = c('region',
'zone',
'district'),
names_from = age_group,
values_from = cases)
```
There are a range of different options in this function to help pivot the data in the cleanest way possible. To see these you can look at the vignette by typing the code `vignette("pivot")`.
#### Joining data frames
In R you can easily join two data frames together based on one, or multiple variables. There are the following options for joining two data frames, `x`and `y`:
- `inner_join()`: includes all rows that appear in both x and y
- `left_join()`: includes all rows in x
- `right_join()`: includes all rows in y
- `full_join()`: includes all rows in x or y
- `anti_join()`: return all rows in one data frame that do not have matching values in another data frame
To run the command you need to pass it the data frames you wish to join and the variable(s) you wish to join by. If there are matching variables in the data frames these will be detected and you do not need to specify them.
If you have two data frames with varying numbers of rows, you can investigate the outcomes of using the different join commands. Firstly, you create two data frames of different lengths, `tested`, and `confirmed`, then look at what the commands and outcomes would be.
![Data frame Joining in dplyr](C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-06-10-Data-manipulation-in-dplyr/media/join.png)
```{r, join,echo=TRUE,eval=TRUE}
tested <- data.frame(year = c(2015, 2016, 2017, 2018, 2019, 2020),
tested = c(1981, 1992, 2611, 2433, 2291, 2311))
positive <- data.frame(year = c(2013, 2014, 2015, 2016, 2017, 2018),
positive = c(1164, 1391, 981, 871, 1211, 998))
# Command written in full
inner_join(tested, positive, by = "year")
# Using the pipe operator
tested %>%
inner_join(positive) # Keeps only matching records
left_join(tested, positive) # Keeps all records for the first dataset
tested %>% right_join(positive) # Keeps all records for the second dataset
tested %>% full_join(positive) # Keeps all records from both datasets
anti_join(tested, positive) #Keep all rows in tested that do not have matching values in positive dataset
```
You can also join datasets which have different names for the variables you wish to join on. Say if the variable for "year" in the positive dataset was "yr", you could use the command:
```{r,join_cmd, echo=TRUE, eval=TRUE}
positive <- data.frame(yr = c(2013, 2014, 2015, 2016, 2017, 2018),
positive = c(1164, 1391, 981, 871, 1211, 998))
tested %>% inner_join(positive,
by= c("year"="yr"))
```
#### Writing data
Once you have finished working on your data there are multiple ways you can save your work in R.
One of the basic ones is to save your dataset as a csv. This is useful as it can easily be opened in other software (such as excel). You may also want to save your data as a *Stata (dta)* file.
R has some specific formats you can use to store data, these are `.RDS` and `.RData`. RDS files store one R object, whilst *RData* can store multiple objects.
Here, you can select which format you want to save the data in and save the `malaria_incidence_2021` data frame you created in this module.
Similarly to importing data, you can use the base R functions of \`write.csv()\`, or preferably the `tidyverse` option if `write_csv()`.
```{r, write_data, echo=TRUE, eval=TRUE}
#write_csv(clean_routine_data, "clean_routine_data.csv")
#write_dta(clean_routine_data, "outputs/clean_routine_data.dta")
#saveRDS(clean_routine_data, "outputs/clean_routine_data.RDS")
```
# Reference
1. [MAP training material](https://malaria-atlas-project.gitlab.io/intro-to-spatial-analysis-for-infectious-diseases/02_datahandling.html)
2. [R for Reproducible Scientific Analysis](https://umn-dash.github.io/r-novice-gapminder/aio.html)
3. [Dataframe Manipulation with dplyr](https://r-crash-course.github.io/13-dplyr/)