What’s in it for you?

Overview of RStudio
Operators in R
Coding and arithmetic operations
Variable assignment, data types and structures
Installing packages
Getting help in R
Import and explore data

Overview

The basic layout of RStudio will have three panels

Console (entire left) - the interactive panel where you can type and execute R commands and it displays the output of those commands
Environment/History/Connections (tabbed in upper right) - shows loaded variables and their values
Files/Plots/Packages/Help/Viewer(tabbed in lower right) - displays project files and directories, show plots, list installed packages, provide access to R documentation

Project management in Rstudio

In RStudio, a project is a self-contained environment that manages all the files associated with a particular set of analyses or tasks. It’s a powerful tool for organizing your work, maintaining reproducibility, and simplifying collaboration.

Steps to set up

Create a New Project:

Click on File > New Project
Choose New Directory for a new project or Existing Directory if you want to associate the project with an already existing folder.
Follow the prompt to name your project and choose its location on your computer.

Open Existing Project:

Navigate to the project directory and open the .Rproj file

Using Version Control:

During the project creation, you can also initialize a Git repository if you want to use version control, which is highly recommended for tracking changes and collaborating with others.

Once a project is set up, RStudio will automatically set your working directory to the project’s root folder each time you open it, which is incredibly convenient for file management and relative paths in your code.

Set working directory

Set working directory is to tell R where to look for files and where to save outputs.

Note

To set working directory - which is the folder where your R session is focused - you can use the following method

Using the setwd() function:

Type setwd("path/to/your/directory") and replace '``path/to/your/directory' with the actual path to your folder. Make sure to use forward slashes / or double slashes in your path

You can also use the graphic interface:

Go to the Session menu at the top of RStudio.
Choose Set Working Directory.
Select Choose Directory… and navigate to the folder you want to set as your working directory.

path = "C:/Users/User/Documents/R_training"
setwd(path) # tell R to access the file from 'R_training' folder
getwd() # check the file folder

Operators in R

Basic arithmetic and logical operators to perform mathematical operations and expressions in R are:

Operators	Example
+ addition	Add two numbers `x + y`
- subtraction	Subtracts one number from the other `x - y`
/ division	Divides one number by another `x/y`
%% reminder/modulus	Reminder of the devision of one number by another `x%%y`
^ exponent	Raises a number to the power of another `x^2`
< Less than	`x < y` a logical comparison that checks if each element of the vector `x` is less than the corresponding elements of the vector `y`. The results are TRUE or FALSE.
<= less than or equal to	`x <= y` a logical comparison that checks if each element of the vector `x` is less than or equal to the corresponding elements of the vector `y`. The results are TRUE or FALSE.
> Greater than	`x > y` a logical comparison that checks if each element of the vector `x` is greater than to the corresponding elements of the vector `y`. The results are TRUE or FALSE.
>= Greater than	`x >= y` a logical comparison that checks if each element of the vector `x` is greater than or equal to the corresponding elements of the vector `y`. The results are TRUE or FALSE.
== Equal to	`x == y` a logical comparison that checks if each element of the vector `x` is equal to the corresponding elements of the vector `y`. The results are TRUE or FALSE.
= /<- Assign variable	<- A common assignment operator in R. It used to assign values to variables. `x <- 10` `Z <- c(1, 2, 3, 4, 5)` = also used for assignment in R and works similarly to <- `x = 10` `Z = "Hello, World"`
!= Not equal to	!= used for inequality comparison and returns a logical vector of TRUE or FALSE x != y
& AND	& used for element-wise logical “AND”. It returns TRUE only if both corresponding elements of the operator are TRUE. `x <- c(TRUE, FALSE, TRUE, FALSE)` `y <- c(TRUE, TRUE, FALSE, FALSE)` `result <- x & y` [1] TRUE FALSE FALSE FALSE`
\| OR	\| used for element-wise logical “OR”. It returns TRUE if either corresponding elements of the operator are TRUE.
! Not	! is used for logical negation. It inverts the value of a logical expression: TRUE becomes FALSE `x <- TRUE` `result <- !x` `result` [1] FALSE
%<% and \|> Pipes	`%<%` and `\|>` are pipe operators from `dplyr` package used to pass the output of one function directly into another, which can help in creating a more clear and concise code. `library(dplyr)` `countries%>%` `filter(Capital_city %in% "Addis Ababa") %>%` `mutate(Region= "Eeast Afirca")` Without pipel `filter`(`countries`, `Capital_city ="Addis Ababa"`)
%IN% contained	`%in% is used to determine if elements of one vector are contained in another vector` `Example: x <- c(1, 2, 3, 4, 5)` `y <- c(3, 4, 5, 6, 7)` `x %in% y` `[1] FALSE FALSE TRUE TRUE TRUE`

Coding Basic Expression

Welcome to your first R programming experience if it’s your first time using R. In RStudio, you can write in two places: the console and R scripts.
R Console: This is where you can directly enter and run R commands:

R Script is the source files (with a .R extension) where you can save your R code. To create an R script, got to the File Menu and select New File > R Script. This will open a fourth pane in RStudio for your script. Using R Scripts allows you to save your code and return it later. Let’s get started! 📊

1 + 1 # sum

[1] 2

3^2 # sqrt

[1] 9

3**2

[1] 9

13%%2 #reminder/modules

[1] 1

8/4 # divided

[1] 2

2*4 # multiplication

[1] 8

Order of Operation

Understanding the order of arithmetic operations in R is crucial. The order from highest to lowest:

Parenthesis: ()

Exponential: ^

Multiplication: *

Division: /

Addition: +

Subtraction: -

Let’s practice using these operations:

(1 + (2^2  * 4*8)) - (4/3) #what is the answer?

[1] 127.6667

Mathematical functions:

# natural logarithm
log(1)

[1] 0

# base-10 logarithm
log10(10)

[1] 1

# e^(1/2)
exp(0.5)

[1] 1.648721

Character operation

R can also work with text.
R uses the print function to display the variables
The function paste and past0 used to concatenate texts and variables together. For your challenge, what do you notice the difference between paste() and paste0()

print("Hello World")

[1] "Hello World"

# assign variable
greeting <- "Hello"
name <- "Yalem"
message <- paste(greeting, name) 
message

[1] "Hello Yalem"

# paste0()
message2 <- paste0(greeting, name)
print(message2)

[1] "HelloYalem"

# rep()
rep("hello",10)

 [1] "hello" "hello" "hello" "hello" "hello" "hello" "hello" "hello" "hello"
[10] "hello"

Logical operation

A logical value is often created via a comparison between variables.

test = 89
conf = 76

conf < test

[1] TRUE

x = 42
y =144
z = 12
is.even <- (x %% 2 == 0)

z

[1] 12

x < y & is.even

[1] TRUE

x > y | x > z

[1] TRUE

Let’s create a scenario related to COVID-19 data analysis where you compare the number of cases and deaths in different countries.

# Assume we have COVID-19 data for three countries: USA, India, and Brazil
# We'll compare the total number of confirmed cases and deaths in these countries

# Define COVID-19 data for each country
USA_cases <- 1000000
USA_deaths <- 50000

India_cases <- 500000
India_deaths <- 20000

Brazil_cases <- 800000
Brazil_deaths <- 40000

# Compare the total number of cases and deaths
USA_more_cases <- USA_cases > India_cases & USA_cases > Brazil_cases
India_more_cases <- India_cases > USA_cases & India_cases > Brazil_cases
Brazil_more_cases <- Brazil_cases > USA_cases & Brazil_cases > India_cases

USA_more_deaths <- USA_deaths > India_deaths & USA_deaths > Brazil_deaths
India_more_deaths <- India_deaths > USA_deaths & India_deaths > Brazil_deaths
Brazil_more_deaths <- Brazil_deaths > USA_deaths & Brazil_deaths > India_deaths

# Display the comparison results
if (USA_more_cases) {
  print("USA has the highest number of confirmed cases.")
} else if (India_more_cases) {
  print("India has the highest number of confirmed cases.")
} else {
  print("Brazil has the highest number of confirmed cases.")
}

[1] "USA has the highest number of confirmed cases."

if (USA_more_deaths) {
  print("USA has the highest number of deaths.")
} else if (India_more_deaths) {
  print("India has the highest number of deaths.")
} else {
  print("Brazil has the highest number of deaths.")
}

[1] "USA has the highest number of deaths."

Variable assignment

In this scenario, we’ll use malaria surveillance data to demonstrate variable declaration in R.

Variables in R:

A variable name must start with a letter.
It can contain numbers, letters, underscores (_), and periods (.).
Variable names cannot start with a number or contain spaces.

name <- "Yalem"
assessment <- "malaria"
diagn <- "microscopy"
result <- "Pf"

# Create the message
message <- paste(name, "'s result is ", result, " positive.", sep = "")
print(message)

[1] "Yalem's result is Pf positive."

In R, the traditional method of assigning a value to a variable is using the left arrow <-. For example:

x <-  2

Notice that assigning a value does not print it. Instead, the value is stored in a variable. To see the value, you need to call the variable.

Let’s see this in action:

# Assign a value to the variable 'tested'
tested <- 420 

# Display the value of 'tested'
tested

[1] 420

Example: Calculating and Rounding a Ratio

# number
positive = 284

# ratio
Test_positive <- positive/tested # float

The round() function rounds its first argument to the specified number of decimal places (default is 0). Use ?round to see the documentation for the round() function.

# round the result
round(Test_positive,3)*100 

#?round

You can also assign a date as a variable

# assign today 
today = Sys.Date() 
# print the date with the text
paste("Today is", today)

[1] "Today is 2024-07-11"

Data types in R

R has several basic data types:

*Data Types in R*
Data types	Description
Numeric	Numbers, including integers and floating-point numbers.
Character	Text strings.
Logical	Boolean values (`TRUE` or `FALSE`).
Integer	Whole numbers.
Complex	Complex numbers with real and imaginary parts

Let’s explore these in practice

Numeric

# Numeric
num <- 42
print(num)

[1] 42

Character

char <- "malaria"
print(char)

[1] "malaria"

Logical/Boolean

log <- TRUE
print(log)

[1] TRUE

Integer

int <- 5L
print(int)

[1] 5

Complex

comp <- 1+2i
print(comp)

[1] 1+2i

How can you check the data type?

#?typeof()

Data Structures in R

R has several data structures to organize data:

Vector: A sequence of data elements of the same basic type.

vector <- c(1, 2, 3, 4, 5)
print(vector)

[1] 1 2 3 4 5

**Indexing**

letter <- c('a', 'b','c', 'd', 'e')
letter[3] # retrive the third value

[1] "c"

List: An order of sequence of items that can have different data types.

    list <- list(name="Yalem", assessment="malaria", result="Pf", age = 35, height = 1.68)
    print(list)

$name
[1] "Yalem"

$assessment
[1] "malaria"

$result
[1] "Pf"

$age
[1] 35

$height
[1] 1.68

    # retrive third element of the list
    list[3]

$result
[1] "Pf"

::: callout-tip
***Challenge: update the list.***

Add the `weight`variable at the end with the value of 65 and update the `result` with 'neg' instead of 'pf'
:::

    list$weight <- 64

    print(list)

$name
[1] "Yalem"

$assessment
[1] "malaria"

$result
[1] "Pf"

$age
[1] 35

$height
[1] 1.68

$weight
[1] 64

Matrix: A 2-dimensional array where each element has the same type.

matrix <- matrix(1:6, nrow=2, ncol=3)
print(matrix)

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Data Frame: A table where each column can contain different types of data.
```
malaria_data <- data.frame(name=c("Yalem", "Anna"), result=c("Pf", "Neg"))
print(malaria_data)

m_data <- cbind(name, result)
#? cbind()
```
Caution

May I ask you to identify the difference between malaria_data and m_data ?

Factor: Used to handle categorical data.

factor <- factor(c("Pf", "Neg", "Pf"))
print(factor)

[1] Pf  Neg Pf 
Levels: Neg Pf

Installing packages

A pacakge is a collection of functions developed by R experts for various purposes. R comes with built-in base function like sum(), mean(), summay(), and others. Additionally, you can install extra packages from CRAN (Comprehensive R Archive Network) or from GitHub repositories of the package developers. You can install packages from the CRAN in two ways: using the RStudio menu or via the command line.

Installing Packages Using the RStudio Menu:

Open RStudio: Start RStudio on your computer.
Navigate to Tools: Click on the Tools menu at the top of the screen.
Install Packages: Select Install Packages... from the drop down menu.
Choose Package: In the dialogue box, type the name of the package you want to install (e.g., tidyverse).
Install: Click Install to download and install the package from CRAN.

Install package

For example installing package via the command line

# Install for the first time
install.packages("tidyverse")

# Check if the package 'tidyverse' is already installed; if not, install it
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
}

There are two ways of loading installed packages in R: library() and require(), but they have some differences in terms of behaviour and usage.

The library() function is commonly used to load packages. If the specified package is not installed, library() will output an error and stop the execution of the code.

library(sf) #Error in library(sf) : there is no package called ‘sf’

Warning: package 'sf' was built under R version 4.3.3

Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

The require() function is useful for conditional loading within functions or scripts where you want running even if the package is missing. If the specified package is not installed, require() will output a warning but continue executing the code.

require(sf) #Warning: there is no package called ‘sf’

Getting Help in R

R has extensive documentation and help features to assist users. There are different options to search for help:

Help function: use ? followed by the function name to get help on that function ?round

# ?round()

Help search: Use help.search() to find help pages related to a topic.

# help.search("regression")

Vignettes: Detailed guides and documentation provided by package authors.

#vignette("ggplot2-specs")

R Help Website: Access the official R documentation online at CRAN R Documentation and R community help stack overflow.

Importing and exploring data

R provides various ways to import and explore data. You can use built-in datasets, import data from a website, or load data from a local file.

Load Built-In Dataset

R comes with several built-in datasets that are included in the datasets package. These datasets are useful for practice and learning.

To load and explore a built-in dataset, you need to load the datasets package first. Here’s how you can do it:

# load the dataset package
library(datasets)

# Load a built-in dataset, for example, the 'iris' dataset
data(iris)

# View the first few rows of the dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# Summary statistics of the dataset
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

Import Data from a Website

You can import data directly from a website. COVID19 data is available online at COVID_19 HUB. As the file is stored as a ZIP file, to read and execute the content of a zipped CSV file from a URL in R, you can follow these steps:

Download the ZIP file from the URL.
Unzip the file.
Read the CSV file into R.

In the below example, you can load the COVID-19 data by country. Here’s an example:

# Install and load necessary packages
install.packages("downloader")
install.packages("utils")

library(downloader)
library(utils)

# Define the URL and destination file
url <- "https://storage.covid19datahub.io/country/AUS.csv.zip"
destfile <- "AUS.csv.zip" # AUstralia

# Download the ZIP file
download(url, destfile, mode = "wb")

# Unzip the file
unzip(destfile, exdir = ".")

# Read the CSV file into R
Asutralia_data <- read.csv("ETH.csv")

# Display the first few rows of the data
head(Asutralia_data)

Import Data From local file

Data can be stored in different file types such as csv, .xlsx, .dta and .sav. Remember, you need to know the file formats of our data to install any necessary packages (e.g., readxl, haven) using install.packages(“package_name”).

Let’s break down the information about reading different file types into R:

CSV files: The most common file format is CSV (Comma-Separated Values). To import CSV files into R, you can use the read.csv() function from the base R package.

Example using read.csv():

aus <- read.csv("C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/AUS.csv")

Example using read_csv():

aus2 <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/AUS.csv")

Whilst read.csv() is the most common way to read data into R, there is the alternative read_csv() function (note the underscore). This is part of the tidyverse group of packages and is often a better option for reading in CSV files. This function is quicker, it reads the data in as a tibble instead of a data frame and allows for non-standard variable names among other benefits.
There are various options you can use when importing data, such as whether to include headers, what character to use for decimal points, and what to import as missing values. To explore these options you can look at the help pages e.g. ?read_csv.
Excel Files .xlsx: To read Excel files into R, use the read_excel() function from the readxl package.

When reading files in R, you might encounter the use of / and \:

/: In R, / is used as the directory separator in file paths. For example, "C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/Leprosy_am_14.xlsx" uses / to specify the directory structure.

Example:

# read one sheet, 2014_Q1
library(readxl) 
leprosy_q1 <- read_excel("C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/Leprosy_am_14.xlsx",sheet = "2014_Q1", col_names = TRUE, na = "NA")

Stata Files .dta: To read the Stata files, use the read_dta() function from the haven package (part of the tidyverse).

Example:

library(haven)
my_stata_data <- read_dta("C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/data_exr.dta")

Other Data Formats: Other formats like SPSS, Matlab, and binary files can also be read using specific functions.
For SPSS files, use read_sav() from the haven package.
For Matlab files, use readMat() from the R.matlab package.
For binary files, explore relevant functions based on your specific needs

About the data:

The data I used for this demonstration is a sample of routine leprosy data from Amhara Region, Ethiopia Leprosy surveillance system in 2014. The data were collected and collated at the district level (third administrative system) and stored in Excel file with four sheets. Let’s explore the data

library(readxl)

Warning: package 'readxl' was built under R version 4.3.3

# Install and load the tidyverse package if required
#install.packages("tidyverse", "janitor")
#library(tidyverse) # data management and visualization
#library(janitor) #clean column 
# Specify the path to your Excel file
excel_file <- "C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/Data/Leprosy_am_14.xlsx"
excel_file <- read_excel(excel_file)
# Read all sheets into a list
#all_sheets <- excel_sheets(excel_file) %>%
 # map(~ read_excel(excel_file, sheet = .x)) %>%
 # bind_rows(.id = "sheet_name") %>%
#clean_names() #  Make the column names readable

Inspect the data

To get an overview of the data, you have several options:

View the whole data: You can view the entire dataset by clicking on it in the global environment window. Alternatively, you can use the command View(routine_data) to open a window displaying the data.

 # View(excel_file)

View the Top Five Rows: To quickly inspect the first few rows of the dataset, you can use the head() function to see the top rows and the tail() function to see the last rows. This function displays the first n rows of the data.

head(excel_file, 5) # top five

# A tibble: 5 × 10
   Year Woreda          quarter Leprosy (new cases) (MB…¹ Grade II disability …²
  <dbl> <chr>           <chr>                       <dbl>                  <dbl>
1  2014 Ankasha Guagusa Q1                              0                      0
2  2014 Banja Shekudad  Q1                              0                      0
3  2014 Chagni          Q1                              1                      0
4  2014 Dangila         Q1                              1                      0
5  2014 Dangla Town     Q1                              0                      0
# ℹ abbreviated names: ¹`Leprosy (new cases) (MB+PB)`,
#   ²`Grade II disability (new cases) (MB+PB)`
# ℹ 5 more variables: `New leprosy cases under 15` <dbl>,
#   `Treatment completed leprosy: MB` <dbl>,
#   `Cases in MB completed treatment cohort` <dbl>,
#   `Treatment completed leprosy: PB` <dbl>,
#   `Cases in PB completed treatment cohort` <dbl>

tail(excel_file, 5) # bottom five

# A tibble: 5 × 10
   Year Woreda        quarter Leprosy (new cases) (MB+P…¹ Grade II disability …²
  <dbl> <chr>         <chr>                         <dbl>                  <dbl>
1  2014 Yilmana Densa Q1                                1                      0
2  2014 Bahirdar Town Q1                                7                      2
3  2014 Dessie Town   Q1                                0                      0
4  2014 Gonder Town   Q1                                0                      0
5  2014 Total         <NA>                            159                     12
# ℹ abbreviated names: ¹`Leprosy (new cases) (MB+PB)`,
#   ²`Grade II disability (new cases) (MB+PB)`
# ℹ 5 more variables: `New leprosy cases under 15` <dbl>,
#   `Treatment completed leprosy: MB` <dbl>,
#   `Cases in MB completed treatment cohort` <dbl>,
#   `Treatment completed leprosy: PB` <dbl>,
#   `Cases in PB completed treatment cohort` <dbl>

Understand the structure of the data: To understand the structure of the data, you can use either the str() or glimpse() function.

str() provides a concise summary of the structure of the dataset, including the data types and the first few values of each variable.

glimpse() is part of the tidyverse ecosystem and offers a similar summary but with a focus on tibbles, providing additional information such as variable types and the first few values of each variable, displayed in a more compact format.

str(excel_file) # Use str() for a concise summary

tibble [154 × 10] (S3: tbl_df/tbl/data.frame)
 $ Year                                   : num [1:154] 2014 2014 2014 2014 2014 ...
 $ Woreda                                 : chr [1:154] "Ankasha Guagusa" "Banja Shekudad" "Chagni" "Dangila" ...
 $ quarter                                : chr [1:154] "Q1" "Q1" "Q1" "Q1" ...
 $ Leprosy (new cases) (MB+PB)            : num [1:154] 0 0 1 1 0 3 4 0 1 0 ...
 $ Grade II disability (new cases) (MB+PB): num [1:154] 0 0 0 0 0 0 2 0 0 0 ...
 $ New leprosy cases under 15             : num [1:154] 0 0 0 0 0 0 0 0 0 0 ...
 $ Treatment completed leprosy: MB        : num [1:154] 0 0 0 0 1 1 3 0 0 0 ...
 $ Cases in MB completed treatment cohort : num [1:154] 0 0 0 0 1 0 3 0 0 0 ...
 $ Treatment completed leprosy: PB        : num [1:154] 0 0 3 0 0 1 0 0 0 0 ...
 $ Cases in PB completed treatment cohort : num [1:154] 0 0 1 0 0 0 0 0 0 0 ...

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

glimpse(excel_file) # Use glimpse() for a tidyverse-friendly summary

Rows: 154
Columns: 10
$ Year                                      <dbl> 2014, 2014, 2014, 2014, 2014…
$ Woreda                                    <chr> "Ankasha Guagusa", "Banja Sh…
$ quarter                                   <chr> "Q1", "Q1", "Q1", "Q1", "Q1"…
$ `Leprosy (new cases) (MB+PB)`             <dbl> 0, 0, 1, 1, 0, 3, 4, 0, 1, 0…
$ `Grade II disability (new cases) (MB+PB)` <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 0, 0…
$ `New leprosy cases under 15`              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Treatment completed leprosy: MB`         <dbl> 0, 0, 0, 0, 1, 1, 3, 0, 0, 0…
$ `Cases in MB completed treatment cohort`  <dbl> 0, 0, 0, 0, 1, 0, 3, 0, 0, 0…
$ `Treatment completed leprosy: PB`         <dbl> 0, 0, 3, 0, 0, 1, 0, 0, 0, 0…
$ `Cases in PB completed treatment cohort`  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…

Check the Class: You can check the class of the dataset, which could be either tbl or data.frame.

class(excel_file)

[1] "tbl_df"     "tbl"        "data.frame"

Select Specific Rows or Columns: If you want to select specific rows or columns, you can use square brackets.

head(excel_file[,2], 5) #View the second column

excel_file[1,] # view the first row

# A tibble: 1 × 10
   Year Woreda          quarter Leprosy (new cases) (MB…¹ Grade II disability …²
  <dbl> <chr>           <chr>                       <dbl>                  <dbl>
1  2014 Ankasha Guagusa Q1                              0                      0
# ℹ abbreviated names: ¹`Leprosy (new cases) (MB+PB)`,
#   ²`Grade II disability (new cases) (MB+PB)`
# ℹ 5 more variables: `New leprosy cases under 15` <dbl>,
#   `Treatment completed leprosy: MB` <dbl>,
#   `Cases in MB completed treatment cohort` <dbl>,
#   `Treatment completed leprosy: PB` <dbl>,
#   `Cases in PB completed treatment cohort` <dbl>

Check Column Names: Checking column names is essential for identifying any missing or misspelled variables. Ensure that variable names are clean, short, and readable.

colnames(excel_file) # using colnames

 [1] "Year"                                   
 [2] "Woreda"                                 
 [3] "quarter"                                
 [4] "Leprosy (new cases) (MB+PB)"            
 [5] "Grade II disability (new cases) (MB+PB)"
 [6] "New leprosy cases under 15"             
 [7] "Treatment completed leprosy: MB"        
 [8] "Cases in MB completed treatment cohort" 
 [9] "Treatment completed leprosy: PB"        
[10] "Cases in PB completed treatment cohort"

names(excel_file) # using names

 [1] "Year"                                   
 [2] "Woreda"                                 
 [3] "quarter"                                
 [4] "Leprosy (new cases) (MB+PB)"            
 [5] "Grade II disability (new cases) (MB+PB)"
 [6] "New leprosy cases under 15"             
 [7] "Treatment completed leprosy: MB"        
 [8] "Cases in MB completed treatment cohort" 
 [9] "Treatment completed leprosy: PB"        
[10] "Cases in PB completed treatment cohort"

variable.names(excel_file) # Assuming variable.names() is a custom function to retrieve variable names

 [1] "Year"                                   
 [2] "Woreda"                                 
 [3] "quarter"                                
 [4] "Leprosy (new cases) (MB+PB)"            
 [5] "Grade II disability (new cases) (MB+PB)"
 [6] "New leprosy cases under 15"             
 [7] "Treatment completed leprosy: MB"        
 [8] "Cases in MB completed treatment cohort" 
 [9] "Treatment completed leprosy: PB"        
[10] "Cases in PB completed treatment cohort"

Common Commands in R

$ : Used for accessing elements within an object.

ls() : Lists the objects in the current environment.

rm(list = ls()) : Removes all objects from the environment.

Ctrl + ENTER : Typically used in RStudio to execute the current line or selection.

Ctrl + C : Keyboard shortcut for copying.

Ctrl + L : Clears your console.

Alt + - : Keyboard shortcut for assigning values.

Reference

--- title: "Mastering RStudio: A Beginner’s Guide" description: | Essential Techniques and Tools for Effective R Programming author: - name: Yalemzewod Gelaw url: https://yalemzewodgelaw.com/ orcid: 0000-0002-5338-586 date: 2024-05-24 image: "media/rstudio.png" slug: beginner’s guide categories: [RStudio, R] toc: true toc-depth: 2 number-sections: false highlight-style: github format: html: self-contained: true code-fold: false code-summary: "Show the code" code-tools: true knitr: opts_knit: warning: false message: false --- <head> ```{=html} <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-7959266255261935" crossorigin="anonymous"></script> ``` </head> ::: callout-note ## What's in it for you? - Overview of RStudio - Operators in R - Coding and arithmetic operations - Variable assignment, data types and structures - Installing packages - Getting help in R - Import and explore data ::: ## Overview The basic layout of RStudio will have three panels - Console (entire left) - the interactive panel where you can type and execute R commands and it displays the output of those commands - Environment/History/Connections (tabbed in upper right) - shows loaded variables and their values - Files/Plots/Packages/Help/Viewer(tabbed in lower right) - displays project files and directories, show plots, list installed packages, provide access to R documentation ### Project management in Rstudio In RStudio, a project is a self-contained environment that manages all the files associated with a particular set of analyses or tasks. It's a powerful tool for organizing your work, maintaining reproducibility, and simplifying collaboration. ::: {.callout-note appearance="simple"} **Steps to set up** 1. Create a New Project: - Click on `File` \> `New Project` - Choose `New Directory` for a new project or `Existing Directory` if you want to associate the project with an already existing folder. - Follow the prompt to name your project and choose its location on your computer. 2. Open Existing Project: - Navigate to the project directory and open the `.Rproj` file 3. Using Version Control: - During the project creation, you can also initialize a Git repository if you want to use version control, which is highly recommended for tracking changes and collaborating with others. ::: Once a project is set up, RStudio will automatically set your working directory to the project's root folder each time you open it, which is incredibly convenient for file management and relative paths in your code. ## Set working directory Set working directory is to tell R where to look for files and where to save outputs. ::: callout-note To set working directory - which is the folder where your R session is focused - you can use the following method 1. Using the `setwd()` function: Type ``` setwd("path/to/your/directory") and replace '``path/to/your/directory' with the actual path to your folder. Make sure to use forward slashes / or double slashes in your path ``` 2. You can also use the graphic interface: - Go to the **Session** menu at the top of RStudio. - Choose **Set Working Directory**. - Select **Choose Directory...** and navigate to the folder you want to set as your working directory. ::: ```{r,Set_working, eval=FALSE,out.width="100%"} path = "C:/Users/User/Documents/R_training" setwd(path) # tell R to access the file from 'R_training' folder getwd() # check the file folder ``` ## Operators in R **Basic arithmetic and logical operators to perform mathematical operations and expressions in R are:** +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Operators | Example | +===========================+==============================================================================================================================================================================================+ | \+ addition | Add two numbers `x + y` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \- subtraction | Subtracts one number from the other `x - y` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | / division | Divides one number by another `x/y` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | %% reminder/modulus | Reminder of the devision of one number by another `x%%y` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \^ exponent | Raises a number to the power of another `x^2` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \< Less than | | `x < y` a logical comparison that checks if each element of the vector `x` is less than the corresponding elements of the vector `y`. The results are **TRUE** or **FALSE**. | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \<= less than or equal to | `x <= y` a logical comparison that checks if each element of the vector `x` is less than or equal to the corresponding elements of the vector `y`. The results are **TRUE** or **FALSE**. | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \> Greater than | `x > y` a logical comparison that checks if each element of the vector `x` is greater than to the corresponding elements of the vector `y`. The results are **TRUE** or **FALSE**. | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \>= Greater than | `x >= y` a logical comparison that checks if each element of the vector `x` is greater than or equal to the corresponding elements of the vector `y`. The results are **TRUE** or **FALSE**. | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | == Equal to | `x == y` a logical comparison that checks if each element of the vector `x` is equal to the corresponding elements of the vector `y`. The results are **TRUE** or **FALSE**. | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | = /\<- Assign variable | | **\<**- A common assignment operator in R. It used to assign values to variables. | | | | `x <- 10` | | | | `Z <- c(1, 2, 3, 4, 5)` | | | | **=** also used for assignment in R and works similarly to **\<-** | | | | `x = 10` | | | | `Z = "Hello, World"` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | != Not equal to | | *!=* used for inequality comparison and returns a logical vector of **TRUE** or **FALSE** | | | | x != y | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | & AND | & used for element-wise logical "AND". It returns TRUE only if both corresponding elements of the operator are TRUE. | | | | | | | `x <- c(TRUE, FALSE, TRUE, FALSE)` | | | | `y <- c(TRUE, TRUE, FALSE, FALSE)` | | | | `result <- x & y` | | | | \[1\] TRUE FALSE FALSE FALSE\` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | \| OR | \| used for element-wise logical "OR". It returns TRUE if either corresponding elements of the operator are TRUE. | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ! Not | | **!** is used for logical negation. It inverts the value of a logical expression: **TRUE** becomes **FALSE** | | | | `x <- TRUE` | | | | `result <- !x` | | | | `result` | | | | \[1\] FALSE | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | %\<% and \|\> Pipes | | `%<%` and `|>` are pipe operators from `dplyr` package used to pass the output of one function directly into another, which can help in creating a more clear and concise code. | | | | `library(dplyr)` | | | | `countries%>%` | | | | `filter(Capital_city %in% "Addis Ababa") %>%` | | | | `mutate(Region= "Eeast Afirca")` | | | | Without pipel | | | | `filter`(`countries`, `Capital_city ="Addis Ababa"`) | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | %IN% contained | | `%in% is used to determine if elements of one vector are contained in another vector` | | | | `Example: x <- c(1, 2, 3, 4, 5)` | | | | `y <- c(3, 4, 5, 6, 7)` | | | | `x %in% y` | | | | `[1] FALSE FALSE TRUE TRUE TRUE` | +---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ## Coding Basic Expression | Welcome to your first R programming experience if it's your first time using R. In RStudio, you can write in two places: the console and R scripts. | R Console: This is where you can directly enter and run R commands: ![R Console](media/R_console.PNG){fig-align="center"} | R Script is the source files (with a `.R` extension) where you can save your R code. To create an R script, got to the `File Menu` and select `New File > R Script`. This will open a fourth pane in RStudio for your script. Using R Scripts allows you to save your code and return it later. Let's get started! 📊 ```{r, maths_operation, echo = TRUE, eval=TRUE} 1 + 1 # sum 3^2 # sqrt 3**2 13%%2 #reminder/modules 8/4 # divided 2*4 # multiplication ``` ::: callout-important ## Order of Operation Understanding the order of arithmetic operations in R is crucial. The order from highest to lowest: **Parenthesis:** `()` **Exponential:** `^` **Multiplication:** `*` **Division:** `/` **Addition:** `+` **Subtraction:** `-` ::: Let's practice using these operations: ```{r, order_of_operation, echo = TRUE, eval = TRUE} (1 + (2^2 * 4*8)) - (4/3) #what is the answer? ``` ## Mathematical functions: ```{r, maths_ope, echo = TRUE, eval = TRUE} # natural logarithm log(1) # base-10 logarithm log10(10) # e^(1/2) exp(0.5) ``` ## Character operation | R can also work with text. | R uses the print function to display the variables | The function `paste` and `past0` used to concatenate texts and variables together. For your challenge, what do you notice the difference between paste() and paste0() ```{r, text, echo = TRUE, eval=TRUE} print("Hello World") # assign variable greeting <- "Hello" name <- "Yalem" message <- paste(greeting, name) message # paste0() message2 <- paste0(greeting, name) print(message2) # rep() rep("hello",10) ``` ## Logical operation A logical value is often created via a comparison between variables. ```{r, comparing, echo = TRUE, eval = TRUE} test = 89 conf = 76 conf < test x = 42 y =144 z = 12 is.even <- (x %% 2 == 0) z x < y & is.even x > y | x > z ``` Let's create a scenario related to COVID-19 data analysis where you compare the number of cases and deaths in different countries. ```{r, comapr, echo = TRUE, eval = TRUE} # Assume we have COVID-19 data for three countries: USA, India, and Brazil # We'll compare the total number of confirmed cases and deaths in these countries # Define COVID-19 data for each country USA_cases <- 1000000 USA_deaths <- 50000 India_cases <- 500000 India_deaths <- 20000 Brazil_cases <- 800000 Brazil_deaths <- 40000 # Compare the total number of cases and deaths USA_more_cases <- USA_cases > India_cases & USA_cases > Brazil_cases India_more_cases <- India_cases > USA_cases & India_cases > Brazil_cases Brazil_more_cases <- Brazil_cases > USA_cases & Brazil_cases > India_cases USA_more_deaths <- USA_deaths > India_deaths & USA_deaths > Brazil_deaths India_more_deaths <- India_deaths > USA_deaths & India_deaths > Brazil_deaths Brazil_more_deaths <- Brazil_deaths > USA_deaths & Brazil_deaths > India_deaths ``` ```{r, display, echo = TRUE, eval = TRUE} # Display the comparison results if (USA_more_cases) { print("USA has the highest number of confirmed cases.") } else if (India_more_cases) { print("India has the highest number of confirmed cases.") } else { print("Brazil has the highest number of confirmed cases.") } if (USA_more_deaths) { print("USA has the highest number of deaths.") } else if (India_more_deaths) { print("India has the highest number of deaths.") } else { print("Brazil has the highest number of deaths.") } ``` ## Variable assignment <div> In this scenario, we'll use malaria surveillance data to demonstrate variable declaration in R. Variables in R: - A variable name must start with a letter. - It can contain numbers, letters, underscores (\_), and periods (.). - Variable names cannot start with a number or contain spaces. </div> ```{r, act_as_doctor, echo = TRUE, eval = TRUE} name <- "Yalem" assessment <- "malaria" diagn <- "microscopy" result <- "Pf" # Create the message message <- paste(name, "'s result is ", result, " positive.", sep = "") print(message) ``` In R, the traditional method of assigning a value to a variable is using the left arrow `<-`. For example: ```{r, variable, echo = TRUE, eval=TRUE} x <- 2 ``` Notice that assigning a value does not print it. Instead, the value is stored in a variable. To see the value, you need to call the variable. Let's see this in action: ```{r, assign_number, echo = TRUE, eval = TRUE} # Assign a value to the variable 'tested' tested <- 420 # Display the value of 'tested' tested ``` **Example: Calculating and Rounding a Ratio** ```{r,calculate, echo = TRUE, eval = TRUE} # number positive = 284 # ratio Test_positive <- positive/tested # float ``` The **`round()`** function rounds its first argument to the specified number of decimal places (default is 0). Use `?round` to see the documentation for the `round()` function. ```{r, rounding, echo = TRUE, eval=FALSE} # round the result round(Test_positive,3)*100 #?round ``` You can also assign a date as a variable ```{r, date_as_var, echo = TRUE, eval = TRUE} # assign today today = Sys.Date() # print the date with the text paste("Today is", today) ``` ## Data types in R R has several basic data types: | Data types | Description | |------------|---------------------------------------------------------| | Numeric | Numbers, including integers and floating-point numbers. | | Character | Text strings. | | Logical | Boolean values (**`TRUE`** or **`FALSE`**). | | Integer | Whole numbers. | | Complex | Complex numbers with real and imaginary parts | : *Data Types in R* Let's explore these in practice **Numeric** ```{r, num, echo = TRUE, eval = TRUE} # Numeric num <- 42 print(num) ``` **Character** ```{r, chr, echo = TRUE, eval = TRUE} char <- "malaria" print(char) ``` **Logical/Boolean** ```{r, logi, echo = TRUE, eval = TRUE} log <- TRUE print(log) ``` **Integer** ```{r, int, echo = TRUE, eval = TRUE} int <- 5L print(int) ``` **Complex** ```{r, complex, echo = TRUE, eval = TRUE} comp <- 1+2i print(comp) ``` **How can you check the data type?** ```{r, type, echo = TRUE, eval = TRUE} #?typeof() ``` ## Data Structures in R R has several data structures to organize data: - **Vector**: A sequence of data elements of the same basic type. ```{r, vector, echo = TRUE, eval = TRUE} vector <- c(1, 2, 3, 4, 5) print(vector) ``` **Indexing** ```{r, index, echo = TRUE, eval = TRUE} letter <- c('a', 'b','c', 'd', 'e') letter[3] # retrive the third value ``` - **List**: An order of sequence of items that can have different data types. ```{r, list, echo = TRUE, eval=TRUE} list <- list(name="Yalem", assessment="malaria", result="Pf", age = 35, height = 1.68) print(list) # retrive third element of the list list[3] ``` ::: callout-tip ***Challenge: update the list.*** Add the `weight`variable at the end with the value of 65 and update the `result` with 'neg' instead of 'pf' ::: ```{r, update, echo = TRUE, eval=TRUE} list$weight <- 64 print(list) ``` - **Matrix**: A 2-dimensional array where each element has the same type. ```{r, matrix, echo = TRUE, eval=TRUE} matrix <- matrix(1:6, nrow=2, ncol=3) print(matrix) ``` - **Data Frame**: A table where each column can contain different types of data. ```{r, data_f, echo = TRUE, eval=FALSE} malaria_data <- data.frame(name=c("Yalem", "Anna"), result=c("Pf", "Neg")) print(malaria_data) m_data <- cbind(name, result) #? cbind() ``` ::: callout-caution May I ask you to identify the difference between ***`malaria_data`*** and `m_data ?` ::: - **Factor**: Used to handle categorical data. ```{r, factor, echo = TRUE, eval=TRUE} factor <- factor(c("Pf", "Neg", "Pf")) print(factor) ``` ## Installing packages A `pacakge` is a collection of functions developed by R experts for various purposes. R comes with built-in base function like `sum()`, `mean()`, `summay()`, and others. Additionally, you can install extra packages from `CRAN (Comprehensive R Archive Network)` or from GitHub repositories of the package developers. You can install packages from the CRAN in two ways: using the RStudio menu or via the command line. **Installing Packages Using the RStudio Menu:** 1. **Open RStudio**: Start RStudio on your computer. 2. **Navigate to Tools**: Click on the **`Tools`** menu at the top of the screen. 3. **Install Packages**: Select **`Install Packages...`** from the drop down menu. 4. **Choose Package**: In the dialogue box, type the name of the package you want to install (e.g., **`tidyverse`**). 5. **Install**: Click **`Install`** to download and install the package from CRAN. ![Install package](media/install_pk.PNG){fig-align="center"} For example installing package via the command line ```{r, insatll_pck, echo = TRUE, eval=FALSE} # Install for the first time install.packages("tidyverse") # Check if the package 'tidyverse' is already installed; if not, install it if (!requireNamespace("tidyverse", quietly = TRUE)) { install.packages("tidyverse") } ``` There are two ways of loading installed packages in R: ***`library()`*** and ***`require()`***, but they have some differences in terms of behaviour and usage. The `library()` function is commonly used to load packages. If the specified package is not installed, *`library`()* will output an error and stop the execution of the code. ```{r, library, echo = TRUE, eval=TRUE, error=FALSE} library(sf) #Error in library(sf) : there is no package called ‘sf’ ``` *The require()* function is useful for conditional loading within functions or scripts where you want running even if the package is missing. If the specified package is not installed, **`require()`** will output a warning but continue executing the code. ```{r, require, echo = TRUE, eval=TRUE, warning=FALSE} require(sf) #Warning: there is no package called ‘sf’ ``` ## Getting Help in R R has extensive documentation and help features to assist users. There are different options to search for help: - Help function: use **`?`** followed by the function name to get help on that function `?round` ```{r, help_r, echo = TRUE, eval=FALSE} # ?round() ``` - Help search: Use `help.search()` to find help pages related to a topic. ```{r, help, echo = TRUE, eval=FALSE} # help.search("regression") ``` - **Vignettes**: Detailed guides and documentation provided by package authors. ```{r, help_guide, echo = TRUE, eval=FALSE} #vignette("ggplot2-specs") ``` - **R Help Website**: Access the official R documentation online at [CRAN R Documentation](https://cran.r-project.org/) and R community help [stack overflow](https://stackoverflow.com/). ## Importing and exploring data R provides various ways to import and explore data. You can use built-in datasets, import data from a website, or load data from a local file. 1. **Load Built-In Dataset** R comes with several built-in datasets that are included in the **`datasets`** package. These datasets are useful for practice and learning. To load and explore a built-in dataset, you need to load the **`datasets`** package first. Here's how you can do it: ```{r, data_in, echo = TRUE, eval = TRUE} # load the dataset package library(datasets) # Load a built-in dataset, for example, the 'iris' dataset data(iris) # View the first few rows of the dataset head(iris) # Summary statistics of the dataset summary(iris) ``` 2. **Import Data from a Website** You can import data directly from a website. COVID19 data is available online at [COVID_19 HUB](https://covid19datahub.io/). As the file is stored as a `ZIP` file, to read and execute the content of a zipped CSV file from a URL in R, you can follow these steps: 1. Download the ZIP file from the URL. 2. Unzip the file. 3. Read the CSV file into R. In the below example, you can load the `COVID-19` data by country. Here's an example: ```{r, web, echo = TRUE, eval=FALSE} # Install and load necessary packages install.packages("downloader") install.packages("utils") library(downloader) library(utils) # Define the URL and destination file url <- "https://storage.covid19datahub.io/country/AUS.csv.zip" destfile <- "AUS.csv.zip" # AUstralia # Download the ZIP file download(url, destfile, mode = "wb") # Unzip the file unzip(destfile, exdir = ".") # Read the CSV file into R Asutralia_data <- read.csv("ETH.csv") # Display the first few rows of the data head(Asutralia_data) ``` 3. **Import Data From local file** Data can be stored in different file types such as `csv`, `.xlsx`, `.dta` and `.sav`. Remember, you need to know the file formats of our data to install any necessary packages (e.g., `readxl`, haven) using install.packages("package_name"). Let's break down the information about reading different file types into R: CSV files: The most common file format is CSV (Comma-Separated Values). To import CSV files into R, you can use the `read.csv()` function from the base R package. Example using `read.csv()`: ```{r, base_R, echo = TRUE, eval=FALSE} aus <- read.csv("C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/AUS.csv") ``` Example using `read_csv()`: ```{r,option2, echo = TRUE, eval=FALSE} aus2 <- read_csv("C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/AUS.csv") ``` | Whilst `read.csv()` is the most common way to read data into R, there is the alternative `read_csv()` function (note the underscore). This is part of the tidyverse group of packages and is often a better option for reading in CSV files. This function is quicker, it reads the data in as a tibble instead of a data frame and allows for non-standard variable names among other benefits. | There are various options you can use when importing data, such as whether to include headers, what character to use for decimal points, and what to import as missing values. To explore these options you can look at the help pages e.g. `?read_csv`. | Excel Files `.xlsx`: To read Excel files into R, use the `read_excel()` function from the `readxl` package. | | When reading files in R, you might encounter the use of **`/`** and **`\`**: - **`/`**: In R, **`/`** is used as the directory separator in file paths. For example, **`"C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/Leprosy_am_14.xlsx"`** uses **`/`** to specify the directory structure. Example: ```{r, read_excel_file, echo = TRUE, eval=FALSE} # read one sheet, 2014_Q1 library(readxl) leprosy_q1 <- read_excel("C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/Leprosy_am_14.xlsx",sheet = "2014_Q1", col_names = TRUE, na = "NA") ``` Stata Files `.dta`: To read the Stata files, use the `read_dta()` function from the *haven* package (part of the tidyverse). Example: ```{r, stata_file, echo = TRUE, eval=FALSE} library(haven) my_stata_data <- read_dta("C://Users/User/Documents/R_training/Tutorial_R/R_Tutorial/data_exr.dta") ``` | Other Data Formats: Other formats like `SPSS`, `Matlab`, and `binary files` can also be read using specific functions. | For `SPSS` files, use `read_sav()` from the haven package. | For `Matlab` files, use `readMat()` from the `R.matlab` package. | For binary files, explore relevant functions based on your specific needs About the data: The data I used for this demonstration is a sample of routine leprosy data from Amhara Region, Ethiopia Leprosy surveillance system in 2014. The data were collected and collated at the district level (third administrative system) and stored in Excel file with four sheets. Let's explore the data ```{r, import_csv_data, echo = TRUE, eval = TRUE} library(readxl) # Install and load the tidyverse package if required #install.packages("tidyverse", "janitor") #library(tidyverse) # data management and visualization #library(janitor) #clean column # Specify the path to your Excel file excel_file <- "C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/Data/Leprosy_am_14.xlsx" excel_file <- read_excel(excel_file) # Read all sheets into a list #all_sheets <- excel_sheets(excel_file) %>% # map(~ read_excel(excel_file, sheet = .x)) %>% # bind_rows(.id = "sheet_name") %>% #clean_names() # Make the column names readable ``` **Inspect the data** To get an overview of the data, you have several options: View the whole data: You can view the entire dataset by clicking on it in the global environment window. Alternatively, you can use the command View(routine_data) to open a window displaying the data. ```{r, view, echo = TRUE, eval=FALSE} # View(excel_file) ``` View the Top Five Rows: To quickly inspect the first few rows of the dataset, you can use the head() function to see the top rows and the tail() function to see the last rows. This function displays the first n rows of the data. ```{r, view_rows_of_the_dataframe, echo = TRUE, eval=TRUE} head(excel_file, 5) # top five tail(excel_file, 5) # bottom five ``` Understand the structure of the data: To understand the structure of the data, you can use either the `str()` or `glimpse()` function. `str()` provides a concise summary of the structure of the dataset, including the data types and the first few values of each variable. `glimpse()` is part of the tidyverse ecosystem and offers a similar summary but with a focus on tibbles, providing additional information such as variable types and the first few values of each variable, displayed in a more compact format. ```{r, str, echo = TRUE, eval=TRUE} str(excel_file) # Use str() for a concise summary ``` ```{r, row_x_col, echo = TRUE, eval=TRUE} library(dplyr) glimpse(excel_file) # Use glimpse() for a tidyverse-friendly summary ``` Check the Class: You can check the class of the dataset, which could be either tbl or data.frame. ```{r, class, echo = TRUE, eval=TRUE} class(excel_file) ``` Select Specific Rows or Columns: If you want to select specific rows or columns, you can use square brackets. ```{r, view_the_col, echo = TRUE, eval = FALSE} head(excel_file[,2], 5) #View the second column ``` ```{r,first_row, echo = TRUE, eval=TRUE} excel_file[1,] # view the first row ``` Check Column Names: Checking column names is essential for identifying any missing or misspelled variables. Ensure that variable names are clean, short, and readable. ```{r, name, echo = TRUE, eval=TRUE} colnames(excel_file) # using colnames names(excel_file) # using names variable.names(excel_file) # Assuming variable.names() is a custom function to retrieve variable names ``` <div> **Common Commands in R** `$` : Used for accessing elements within an object. `ls()` : Lists the objects in the current environment. `rm(list = ls())` : Removes all objects from the environment. `Ctrl + ENTER` : Typically used in RStudio to execute the current line or selection. `Ctrl + C` : Keyboard shortcut for copying. `Ctrl + L` : Clears your console. `Alt + -` : Keyboard shortcut for assigning values. </div> # Reference 1. [R for Reproducible Scientific Analysis](https://umn-dash.github.io/r-novice-gapminder/aio.html) 2. [Data Manipulation with dplyr in R: A Comprehensive Guide](https://rstudiodatalab.medium.com/data-manipulation-with-dplyr-in-r-a-comprehensive-guide-2793eaaba421) 3. [Data Cleaning with R and the Tidyverse: Detecting Missing Values](https://towardsdatascience.com/data-cleaning-with-r-and-the-tidyverse-detecting-missing-values-ea23c519bc62) 4. [Dataframe Manipulation with dplyr](https://r-crash-course.github.io/13-dplyr/)