Introduction to spatial analysis in R

Introduction

Spatial epidemiology involves both describing and understanding geographical variation in disease with respect to demographic, environmental, behavioral, socioeconomic, genetic, and infectious risk factors using various methods [Elliott et al. (2001)](Elliott and Wartenberg 2004). Common techniques include proximity calculations, estimating summary measures across geographic areas, assessing clustering, applying spatial smoothing and interpolation techniques, and utilizing spatial regression for in-depth analysis.

Spatial Data

Spatial data, also known as geospatial data, can be broadly categorized into two types: vector and raster data.

Vector Data

Vector data describes the “geometry” or “shape” of objects and often includes additional variables. The three basic types of vector data are points, lines, and polygons (areas). Each point, line, and polygon has a spatial reference frame such as latitude and longitude.

A polygon refers to a set of closed polylines that represent a geographic location, composed of vertices (a series of x and y coordinates or spatial points).

Figure 1: Vector data types

Raster Data

Raster data is pixel-based, where each pixel has a value representing information. This format is particularly useful for representing data that varies continuously over space and is naturally suited to a grid format.

Check the packages required for this tutorial:

`sp` and `sf` packages

sp and sf are the two common spatail package in R and work in fundamentally different ways, specifically revolving around how they store spatial data within R.

sp uses something called S4 classes, where the different elements of spatial data (coordinates/geometry, attribute table, coordinate system, etc.) are split up into different ‘slots’ and accessed using the @ symbol. sfuses S3 classes and stores spatial data all within a standard data frame, with some extra header information and a geometry column.

Figure 2: `sp` and `sf` spatial package comparison

How to convert one file format to the other format?

Tip

To sp object <- as(object, Class = "Spatial")

To sf object_sf = st_as_sf(object_sp, "sf")

A word of advice: be flexible in the usage of sf and sp. Sometimes it may be hard to explain why functions work for one data type and do not for the other. But since transformation is quite easy, time is better spend on analyzing your data than on wondering why operations do not work. For the purpose of this material we will focus of teaching you the package sf as it is intended to succeed and replace R packages sp, rgeos and the vector parts of rgdal packages. It also connects nicely to tidyverse learnt in previous modules. For this session we focus on the sf package.

Vector - Spatial data

Area spatial data structure

After the package is loaded, you can check out all of the methods that are available for sf data objects.

# sf package functions
methods(class = "sf")

Import Data

Shapefiles for area data:

We will use the st_read() from sf package, which simply takes the path of the directory with the shapefile as argument. Where as readOGR() from the rgdal packages will take the path and file name readOGR("path","fileName") to a shapefile. In this instance we would load the part of the shapefile that ends with .shp. Setting quiet = T will keep R from being chatty when it imports the data.

Shepfile reads with the sp package has both sf and data.frame classes.

# class - data
class(eth_map)

[1] "sf"         "data.frame"

Shapefile Metadata & Attributes

You’ll notice that when you load the shapefile in, there will be a set of information explaining the features of the shapefile. The first sentence shows that you have loaded a ESRI shapefile, it contains 46 features (which are polygons in this case) and 5 columns of information stored as a data tabel. It mentions also there is a spatial extent (called bounding box) and the coordinate reference system (CRS).

What does this show us? Let’s break it down. In this case it is important to read in and check the metadata for the type of spatial information you have loaded. Metadata, describing the format, CRS, extent, and other components of the vector data, and the attributes which describe properties associated with each individual vector object.

Key metadata for all shapefiles include:

Object Type: the class of the imported object.
Coordinate Reference System (CRS): the projection of the data.
Extent: the spatial extent (geographic area that the shapefile covers) of the shapefile. Note that the spatial extent for a shapefile represents the extent for ALL spatial objects in the shapefile.

You can view shapefile metadata using the class, crs and extent methods:

# metadata
eth_map

The str() function will also give us the meta data of the dataset including the class and dimensions:

# structure - column list with details
str(eth_map)

Classes 'sf' and 'data.frame':  967 obs. of  19 variables:
 $ Shape_Leng: num  1.951 2.034 0.631 1.576 0.227 ...
 $ Shape_Area: num  0.17132 0.08034 0.01579 0.07866 0.00158 ...
 $ district  : chr  "Aba-Korow" "Abaala" "Abaala town" "Ababo" ...
 $ ADM3_PCODE: chr  "ET050699" "ET020203" "ET020209" "ET041904" ...
 $ ADM3_REF  : chr  NA NA NA NA ...
 $ ADM3ALT1EN: chr  NA "Aba Ala" "Aba Ala" NA ...
 $ ADM3ALT2EN: chr  NA NA NA NA ...
 $ ADM2_EN   : chr  "Shabelle" "Kilbati /Zone2" "Kilbati /Zone2" "Horo Gudru Wellega" ...
 $ ADM2_PCODE: chr  "ET0506" "ET0202" "ET0202" "ET0419" ...
 $ region    : chr  "Somali" "Afar" "Afar" "Oromia" ...
 $ ADM1_PCODE: chr  "ET05" "ET02" "ET02" "ET04" ...
 $ ADM0_EN   : chr  "Ethiopia" "Ethiopia" "Ethiopia" "Ethiopia" ...
 $ ADM0_PCODE: chr  "ET" "ET" "ET" "ET" ...
 $ date      : Date, format: "2019-08-19" "2019-08-19" ...
 $ validOn   : Date, format: "2020-10-08" "2020-10-08" ...
 $ validTo   : Date, format: "-001-11-30" "-001-11-30" ...
 $ X_coordina: num  42.7 39.9 39.8 37.5 42.1 ...
 $ Y_coordina: num  6.48 13.37 13.36 9.81 9.31 ...
 $ geometry  :sfc_MULTIPOLYGON of length 967; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:143, 1:2] 42.8 42.8 42.8 42.8 42.8 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
  ..- attr(*, "names")= chr [1:18] "Shape_Leng" "Shape_Area" "district" "ADM3_PCODE" ...

The class function, applied on an sf layer, returns both "sf" and "data.frame" class names:

# spatial attributes
class(eth_map)

[1] "sf"         "data.frame"

A layer (geometry+attributes) is represented by an sf object. The second printed value, data.frame, implies that sf is, in fact, an extension of the data.frame class, inheriting many of its methods. Indeed, we will see that many functions in R work exactly the same way on sf and data.frame objects.

The geometric part can be extracted with st_geometry, resulting in an object of class sfc, i.e., a geometry column

# get geometry data
st_geometry(eth_map)

Geometry set for 967 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 32.9918 ymin: 3.40667 xmax: 47.98824 ymax: 14.84548
Geodetic CRS:  WGS 84
First 5 geometries:

MULTIPOLYGON (((42.79826 6.19605, 42.79641 6.19...

MULTIPOLYGON (((39.92615 13.57223, 39.92848 13....

MULTIPOLYGON (((39.76732 13.28527, 39.76242 13....

MULTIPOLYGON (((37.39245 10.04636, 37.39828 10....

MULTIPOLYGON (((42.13371 9.322936, 42.13386 9.3...

Conversely, If we want just the non-geometric part (the “attributes”), it can be extracted with st_drop_geometry which returns a data.frame:

# Extract nom-geometric part
map_nogeom <- st_drop_geometry(eth_map)

# top five rows
head(map_nogeom)

  Shape_Leng  Shape_Area    district ADM3_PCODE ADM3_REF ADM3ALT1EN ADM3ALT2EN
1  1.9505322 0.171322798   Aba-Korow   ET050699     <NA>       <NA>       <NA>
2  2.0341566 0.080342819      Abaala   ET020203     <NA>    Aba Ala       <NA>
3  0.6310232 0.015789541 Abaala town   ET020209     <NA>    Aba Ala       <NA>
4  1.5755748 0.078658877       Ababo   ET041904     <NA>       <NA>       <NA>
5  0.2274391 0.001580407      Abadir   ET130109     <NA>       <NA>       <NA>
6  1.5390176 0.069847562 Abay Chomen   ET041905     <NA>       <NA>       <NA>
             ADM2_EN ADM2_PCODE region ADM1_PCODE  ADM0_EN ADM0_PCODE
1           Shabelle     ET0506 Somali       ET05 Ethiopia         ET
2     Kilbati /Zone2     ET0202   Afar       ET02 Ethiopia         ET
3     Kilbati /Zone2     ET0202   Afar       ET02 Ethiopia         ET
4 Horo Gudru Wellega     ET0419 Oromia       ET04 Ethiopia         ET
5             Harari     ET1301 Harari       ET13 Ethiopia         ET
6 Horo Gudru Wellega     ET0419 Oromia       ET04 Ethiopia         ET
        date    validOn    validTo X_coordina Y_coordina
1 2019-08-19 2020-10-08 -001-11-30    42.6983    6.47533
2 2019-08-19 2020-10-08 -001-11-30    39.8761   13.36890
3 2019-08-19 2020-10-08 -001-11-30    39.7914   13.35980
4 2019-08-19 2020-10-08 -001-11-30    37.5210    9.80858
5 2019-08-19 2020-10-08 -001-11-30    42.1227    9.31044
6 2019-08-19 2020-10-08 -001-11-30    37.3524    9.72482

Spatial points

Another key type of spatial data is data linked to coordinates, frequently found in a table format. We can read these data in using the read_csv() function, then we may want to convert these into spatial information using the st_as_sf() function. Here we have a csv file containing the coordinates for the locations of health facilities in our routine data. Note that we set the projection for the spatial object using the crs command; crs 4326 is the standard WGS 84 CRS.

# Read the `csv` file with geometry information
hf_gps <- read_csv("C:/Users/User/OneDrive/from_unsw_drive/file_YG/TKI/MAP/training/HF_list.csv") %>%
    clean_names() %>%
    drop_na() # drop na

Rows: 1045036 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): ZONE, Woreda, HF name, Facility Type, Ownershio, Kebele_Locality, R...
dbl (2): X, Y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check the class of the point spatial data. The dataset read as data.frame th output captured the geographic information: longitude and latitude.

# check the data class
class(hf_gps)

[1] "tbl_df"     "tbl"        "data.frame"

The st_as_sf()` function from the sf package will convert into point spatial data format.

# Make the gps data into a point sample feature and define it as shapefile `eth_hf_points`

eth_hf_points <- st_as_sf(hf_gps,
    coords = c("x", "y"),
    crs = st_crs(4326)
)

The map below plotted the each XY geometry health facility points.

# map the geometry
plot(eth_hf_points$geometry)

Coordinate Reference Systems (CRS)

A Coordinate Reference System (CRS) defines how the coordinates in our geometries relate to the surface of the Earth. There are two main types of CRS:

Geographic—longitude and latitude, in degrees
Projected—implying flat surface, usually in units of true distance (e.g., meters)

We can then change the projection using the command st_transform(). We can also combine the two commands to change the projection of one data type to be that of the other. The CRS of a given sf layer can be obtained using function st_crs. The CRS information is returned in the WKT format:

# coordinate system
st_crs(eth_map)

Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

A vector layer reprojection

Reprojection is the transformation of geometry coordinates, from one CRS to another. It is an important part of spatial analysis workflow, since we often need to:

Transform several layers into the same projection
Switch between geographic and projected CRS

A vector layer can be reprojected with st_transform. The st_transform function has two important parameters:

x—The layer to be reprojected
crs—The target CRS

The crs can be specified in one of four ways:

An EPSG code (e.g., 4326)
A PROJ4 string (e.g., "+proj=longlat +datum=WGS84 +no_defs")
A WKT string
A crs object of another layer, as returned by st_crs

For example, the following expression reprojects the nafot layer to the geographic UTM CRS, using its EPSG code (20138). In the metadata, the PROJCRS["Adindan / UTM zone 38N" showed us the projection is UTM zone 38N

# Change the projection to UTM zone 38N
eth_utm <- st_transform(eth_map, 20138)
st_crs(eth_utm)

Coordinate Reference System:
  User input: EPSG:20138 
  wkt:
PROJCRS["Adindan / UTM zone 38N",
    BASEGEOGCRS["Adindan",
        DATUM["Adindan",
            ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4201]],
    CONVERSION["UTM zone 38N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",45,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Ethiopia - east of 42°E."],
        BBOX[4.11,42,12.85,47.99]],
    ID["EPSG",20138]]

To change the projection of the eth_utm shapefile to WGS CRS:

# to WGS projection
eth_map2 <- st_transform(eth_utm, st_crs(eth_map), crs = 4326)

Geometric calculations

We will see geometric operations on vector layers:

Numeric values —Functions that summarize geometrical properties of:
- A single layer—e.g., area, length
- A pair of layers—e.g., distance
Spatial layers —Functions that create a new layer based on:
- A single layer—e.g., centroid, buffer
- A pair of layers—e.g., intersection area

Numeric

There are several functions to calculate numeric geometric properties of vector layers in package sf:

st_length—Length
st_area—Area
st_distance—Distance
st_bbox—Bounding box
st_dimension—Dimensions (0 = points, 1 = lines, 2 = polygons)

For example, we can calculate the area of each feature in the Ethiopia layer using st_area. The result is assigned to a new attribute named area in eth_map:

# calculate the area of each district
eth_map$area <- st_area(eth_map)

Summary of the area in \(m^2\)

# summary in m^2
summary(eth_map$area)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
5.809e+05 2.672e+08 6.652e+08 1.170e+09 1.337e+09 1.254e+10

Plot the area in \(m^2\).

# plot `m^2`
plot(eth_map["area"], main = "Area of the Districts (m^2)")

You may notice the area values are printed with their units [m^2]. The resulting area values, just like any other numeric calculations in sf, are returned as object of class units:

class(eth_map$area)

[1] "units"

A units object is basically a numeric vector associated with units of measurement. We can transform units measurements to different units, with function set_units from package units. First, we load the units library:

library(units)

Warning: package 'units' was built under R version 4.3.3

udunits database from C:/Users/User/AppData/Local/R/win-library/4.3/units/share/udunits/udunits2.xml

## udunits system database from /usr/share/xml/udunits

Then, we make the transformation to \(km^2\) units :

# km^2
eth_map$area <- set_units(eth_map$area, "km^2")

Lets check the summary of the transformed area in \(Km^2\). In the below summary, we’ve added across(geometry, st_union) to ensure proper summarization of the spatial data. The st_union function will be applied to the geometry column.

# summary with chain
eth_map %>%
    summarise(
        mean_area = mean(area),
        median_area = median(area),
        sd_area = sd(area),
        range_area = range(area),
        across(geometry, st_union)
    )

Plot the calculated area for Ethiopia using the plot() function

# plot `km^2`
plot(eth_map["area"], , main = "Area of the Districts (km^2)")

Spatial Geometry Function

When working with spatial data in R, the sf package is a powerful tool that allows you to manipulate and analyze geometric objects. In the below functions, we’ll explore some common geometry-generating functions provided by sf. These functions are essential for anyone getting started with spatial data analysis.

Centroids (st_centroid)

The st_centroid function calculates the centroid (geometric center) of a polygon. Let’s use an example where we want to find the centroids of polygons representing the “Amhara” region in Ethiopia.

# Filter Amhara
amhara_shp <- eth_map |>
    filter(region == "Amhara")

Now, let’s calculate the centroids: to calculate the centroids of the “Amhara” polygons we can use the st_centroid function as follows:

# Calculate centroids
amhara_ctr <- st_centroid(amhara_shp)

Warning: st_centroid assumes attributes are constant over geometries

# Filled polygons with points
plot(st_geometry(amhara_ctr),
    col = "red",
    pch = 3
)

Buffering (st_buffer)

The st_buffer function creates a buffer around a spatial object. Buffers are useful for visualizing areas of influence or proximity. For instance, you can create a buffer around health facilities to represent their service areas.

# Filter hospitals
hospital <- eth_hf_points %>%
    filter(facility_type == "Hospital")

# Create a buffer around health facilities (e.g., 2.5 km radius)
buffer_radius_km <- 2.5
health_facility_buffer <- st_buffer(hospital, dist = buffer_radius_km * 1000) # Convert km to meters

# Plot the health facility buffers first (as polygons need a base plot)
plot(st_geometry(health_facility_buffer),
    col = "red", lwd = 2, main = "A Buffer arround Health Facilities"
)

# Convert hospital geometries to points if they aren't already
hospital_points <- st_cast(hospital, "POINT")

# Plot the original health facilities on top of the buffers
plot(st_geometry(hospital_points), pch = 16, col = "blue", add = TRUE)

Random sample points (st_sample)

To generate random sample points within a polygon, use the st_sample function. This is handy for spatial sampling in epidemiology or environmental studies. The sampled points are overlaid on the original geometries.

# Randomly sample 10 points within hospital polygons
sample_points <- st_sample(hospital, size = 10)

# Plot the original hospital locations
plot(hospital$geometry, pch = 16, col = "blue", main = "Sample points of Health Facility Locations")

# Overlay the sampled points
plot(sample_points, add = TRUE, pch = 2, col = "red")

Joining non-spatial data to shapefiles

In spatial analysis, it’s common to join non-spatial data (e.g., tabular data) with shapefiles. This process allows you to create maps and perform spatial analyses by combining attribute data with geographic information. In this tutorial, we’ll walk through how to join non-spatial data to shapefile. We’ll also cover some important considerations to ensure your data is properly aligned.

# Load and filter the data for the Amhara region
u5_data <- read_csv("C:/Users/User/OneDrive/from_unsw_drive/training/R_training/Tutorial_R/data_excercise.csv") %>%
    filter(region == "Amhara")

Rows: 967 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): region, district, age_group
dbl (3): conf, pop, Incid_per_1000

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note

Note: Always make sure the data is loaded and filtered correctly before proceeding. Here, we’re focusing on a specific region, but you can adjust the filter condition as needed.

Before joining data, it’s crucial to ensure that the names in your non-spatial data match those in your shapefile. If they don’t match exactly, the join won’t work correctly.

Check names in the Shapefile:

#  Check if the district name is correctly spelled in the shapefile
eth_map %>%
    filter(district == "Hawi Gudina\n")

Simple feature collection with 1 feature and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 40.35742 ymin: 7.863617 xmax: 41.15041 ymax: 8.40077
Geodetic CRS:  WGS 84
  Shape_Leng Shape_Area      district ADM3_PCODE ADM3_REF ADM3ALT1EN ADM3ALT2EN
1   2.771471  0.2528438 Hawi Gudina\n   ET040915     <NA>       <NA>       <NA>
       ADM2_EN ADM2_PCODE region ADM1_PCODE  ADM0_EN ADM0_PCODE       date
1 West Hararge     ET0409 Oromia       ET04 Ethiopia         ET 2019-08-19
     validOn    validTo X_coordina Y_coordina                       geometry
1 2020-10-08 -001-11-30    40.7048    8.12856 MULTIPOLYGON (((40.74895 8....
             area
1 3094.835 [km^2]

# Recode the district name to remove any newline characters
eth_map$district <- recode(eth_map$district,
    "Hawi Gudina\n" = "Hawi Gudina"
)

# Verify the correction
eth_map %>%
    filter(district == "Hawi Gudina")

Simple feature collection with 1 feature and 19 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 40.35742 ymin: 7.863617 xmax: 41.15041 ymax: 8.40077
Geodetic CRS:  WGS 84
  Shape_Leng Shape_Area    district ADM3_PCODE ADM3_REF ADM3ALT1EN ADM3ALT2EN
1   2.771471  0.2528438 Hawi Gudina   ET040915     <NA>       <NA>       <NA>
       ADM2_EN ADM2_PCODE region ADM1_PCODE  ADM0_EN ADM0_PCODE       date
1 West Hararge     ET0409 Oromia       ET04 Ethiopia         ET 2019-08-19
     validOn    validTo X_coordina Y_coordina                       geometry
1 2020-10-08 -001-11-30    40.7048    8.12856 MULTIPOLYGON (((40.74895 8....
             area
1 3094.835 [km^2]

# Note: Always check that the names are correctly updated in your shapefile.

Check names in your data. Please note that ensuring consistent naming between datasets is essential for a successful join

# Verify the district name in the non-spatial data

u5_data %>%
    filter(district == "Hawi Gudina")

# A tibble: 0 × 6
# ℹ 6 variables: region <chr>, district <chr>, age_group <chr>, conf <dbl>,
#   pop <dbl>, Incid_per_1000 <dbl>

Join Non-Spatial Data to the Shapefile

Once you’ve confirmed that the names match, you can proceed to join your non-spatial data with the shapefile.

# Perform the join operation
amhara_data_sf <- amhara_shp %>%
    left_join(u5_data, by = "district")

# Note: The `left_join` function keeps all rows from the shapefile and joins matching rows from the data.
# If a match isn't found, the resulting shapefile will still include the original geographic information.

After the join, ensure your data is still in the sf (simple features) format, which is essential for spatial operations.

# Convert the joined data to an sf object
amhara_data_sf <- st_as_sf(amhara_data_sf)

# Note: Converting to `sf` ensures that the data retains its spatial properties.

Writing shapefiles out

After you’ve completed your analysis or modifications, you may want to save your shapefile for future use or sharing. Here’s how you can write it back to a shapefile format:

# Write the shapefile to disk
st_write(
    amhara_data_sf,
    "C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-08-25-Spatial-analysis-in-R/data/amhara_data_with_shapefile.shp"
)

# Note: Replace "path_to_save" with the directory where you want to save the file.
# The .shp extension indicates it's a shapefile.

Introduction to Raster data

In this section, we’ll explore how to work with raster data using the terra package, which has replaced the older raster package. We’ll cover basic operations like loading and visualizing raster data, manipulating rasters, and performing spatial analysis.

Loading raster data

The terra package in R introduces a new class of raster data called SpatRaster. Let’s start by loading the terra package and importing Ethiopian 2016 population raster data.

The rast() function is used to load the raster data, and simply calling the object name (in this case, population) will display the raster properties like extent, resolution, and CRS (Coordinate Reference System).

# Load the population raster data
population <- rast("C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-08-25-Spatial-analysis-in-R/data/eth_ppp_2016.tif")

# View the raster properties
population

class       : SpatRaster 
dimensions  : 13893, 17983, 1  (nrow, ncol, nlyr)
resolution  : 0.0008333333, 0.0008333333  (x, y)
extent      : 32.99958, 47.98542, 3.322084, 14.89958  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : eth_ppp_2016.tif 
name        : eth_ppp_2016 
min value   : 2.938226e-03 
max value   : 4.783238e+02

Visualizing the Raster

Visualizing raster data helps you quickly understand its spatial distribution. We can start by plotting the raster using the plot() function.

This simple plot provides a visual overview of the raster data. You can use this for quick inspections before more detailed analysis.

# Plot the raster data
plot(population, main = "Raster level populations map")

Plotting, reprojecting and manipulating Rasters in R

Plotting with ggplot2

Before plotting, lets explore the distributions of the raster data with the log scaling.

# plotting the log of the population due to the scale
hist(log(population))


|---------|---------|---------|---------|
=========================================

Warning: [hist] a sample of 0% of the cells was used (of which 47% was NA)

Before plotting, first we need to convert the raster to a data frame. The geom_raster() function is used to plot raster data, and scale_fill_viridis_c() is used to apply a color scale that is perceptually uniform. To make it manageable, first, we need to reduce the resolutions of our data

# resolutions of our data
population_reduced <- aggregate(population, fact = 10, fun = mean)


|---------|---------|---------|---------|
=========================================

# Convert the raster to a data frame
pop_df <- as.data.frame(population_reduced, xy = TRUE)

# Plot the raster using ggplot2
ggplot() +
    geom_tile(data = pop_df, aes(x = x, y = y, fill = eth_ppp_2016)) +
    scale_fill_viridis_c() +
    labs(fill = "Population") +
    theme_bw()

Re-projecting rasters

Sometimes we need to change the projection of our raster data to align it with other spatial data. The crs() function checks the current projection of the raster. By assigning a new CRS, you can reproject the raster to a different coordinate system, which is crucial for spatial analysis when working with multiple datasets.

# Check the current CRS of the raster
crs(population)

# Reproject the raster to UTM zone 48
# crs(population) <- "+proj=utm +zone=48 +datum=WGS84"

Manipulating rasters

We can modify rasters by cropping them to a specific extent or masking out areas outside a certain region. ext() retrieves the spatial extent of the raster. crop() allows you to focus on a specific area by cutting out the rest of the data. mask() sets the raster cells outside a given region (like ‘Amhara’) to NA, effectively hiding them.

# Get the extent of the raster
ext(population)

SpatExtent : 32.999583216, 47.985416489, 3.322083521, 14.899583475 (xmin, xmax, ymin, ymax)

# Assume we have a spatial object 'Amhara' representing a region in Ethiopia
Amhara <- filter(eth_map, region == "Amhara")

# Crop the raster to the Amhara region
am_pop <- crop(population, Amhara)
ext(am_pop)

SpatExtent : 35.2612498735611, 40.2120831869623, 8.71458349957439, 13.7629168128496 (xmin, xmax, ymin, ymax)

# Mask the raster to keep only the Amhara region
am_pop2 <- terra::mask(am_pop, Amhara)
# plot
plot(am_pop2)

Extracting raster data

To extract the population values for each admin 3 district of Ethiopia. For this we provide exact_extract() function in exactextractr package with the raster and the shapefile. The output is then the population value for each cell of the raster, linked to each polygon ID of the shapefile. We can the perform summary statistic on this as we would a standard data frame.

terra::extract(am_pop2, amhara_ctr)

am_population <- exactextractr::exact_extract(am_pop, amhara_shp, "sum", append_cols = "district")

# write.csv(file = "C:/Users/user/Documents/Data_management_in_R/pop.csv", row.names = FALSE) # to save the population

head(am_population) # to six districts with the 2016 population size

Important

If the coordinate reference systems (CRS) of the population raster (am_pop2 or am_pop) and the shapefile (amhara_ctr or amhara_shp) do not match, the extraction might fail to correctly overlay the raster and the shapefile, resulting in incorrect or zero values.

If you need to check the projection:

terra::crs(am_pop2) # population raster data

terra::crs(amhara_ctr) # shapefile

The above script result showed us the CRS of the raster data (am_pop2) and shapefile (amhara_ctr) are similar. To ensure proper alignment between the raster and the shapefile for accurate extraction, we need to reproject one of them so that both layers share the same CRS.

Spatial Epidemiology

In epidemiology, it is concerned with the description, examination and spatial interaction of disease and focuses specifically on the spatial components to understand and explain why health and disease vary by place.

This blog provides how to approach spatial problems in epidemiology, with a specific focus on public health applications.

The framework illustrated in below outlines a logical, sequential process for conducting spatial analyses. However, it should be recognized that this is not a strictly linear process, as presenting the results from exploration and modeling often necessitates a return to visualization.

Figure: Conceptual framework of spatial epidemiological data analysis(Pfeiffer et al. 2008) sourced from (GIS=Geographic information system, DBMS=database management system, R = R programming language)

Making chloropleth maps in R

Plotting spatial data using base R

Visualizing the geographical distribution and patterns of a disease is a critical step in epidemiological research. In R, you can create maps using only a few essential packages like rgdal and sf. While simply calling the plot function on a shapefile object will produce a map, it tends to display all attributes in a somewhat unappealing and cluttered manner.

To create a more focused and meaningful map, you can plot a single attribute of interest by subsetting the sf object. This can be done using the standard R subsetting syntax: sf_object[["name_of_attribute"]]. For example, you can quickly visualize the distribution of a particular disease or condition across administrative regions in Ethiopia by selecting the relevant attribute and plotting it by administrative units.

Customizing the Legend

By default, R will automatically place the legend where it “thinks” it fits best on the plot. However, you can customize the legend’s position and size to enhance the map’s readability.

Positioning the Legend: Use the key.pos argument to specify the location of the legend:
- key.pos = 1 places it below the map.
- key.pos = 2 places it to the left.
- key.pos = 3 places it above the map.
- key.pos = 4 places it to the right.

# to lower legend
plot(amhara_data_sf["conf"],
    main = "Confirmed Malaria cases in Amhara Region",
    key.pos = 1
)

Mapping using ggplot2:

Adding features to maps

Recall from earlier the typical ggplot call: ggplot(data, aesthetic). When making chloropleth maps, our data will typically be an sf object, with a column corresponding to the variable you wish to map, and a geometry column describing the geography of the data.

Given an sf object, you can use geom_sf to produce a simple plot of your shapefile without even explicitly defining any aesthetics.

# base
ggplot(amhara_shp) +
    geom_sf() + # as we've called ggplot using an sf object as the data argument, geom_sf can figure it out itself
    coord_sf() # deals with map projection

viridis package - pretty color map with color scale using the scale_fill_viridis() function

# pretty color map with color scale
library(viridis)

Warning: package 'viridis' was built under R version 4.3.3

Loading required package: viridisLite

Warning: package 'viridisLite' was built under R version 4.3.3

ggplot(
    data = amhara_data_sf,
    aes(fill = conf)
) +
    geom_sf() +
    scale_fill_viridis() +
    theme_void() +
    labs(title = "Confirmed malaria in Amhara Region, Ethiopia")

leaflet package - an open-source JavaScript library for interactive maps to make easy to create Leaflet maps from R. leaflet will also allow you to zoom in and out from the world map view.

library(leaflet)

Warning: package 'leaflet' was built under R version 4.3.3

# create an interactive map
pal <- colorNumeric("YlOrRd", domain = amhara_data_sf$conf)

leaflet(amhara_data_sf) %>%
    addTiles() %>%
    addPolygons(
        color = "white", fillColor = ~ pal(conf),
        fillOpacity = 1
    ) %>%
    addLegend("bottomright", pal = pal, values = ~conf, opacity = 1, title = "Confirmed malaria")

Mapview pacakge - allows to very quickly create interactive visualizations to investigate both the spatial geometries and the variables in the data

library(mapview)

mapview(amhara_data_sf,
    zcol = "conf",
    legend = "TRUE",
    layer.name = "Confirmed malaria"
) # edit legend name

We can use the sync() function to produce a lattice-like view of multiple synchronized maps created with mapview or leaflet.

Create maps of confirmed cases and incidence with synchronized zoom and pan by first creating maps of the variables conf and Incid_per_1000with mapview(), and then passing those maps as arguments of the sync() function

library(leafsync)

Warning: package 'leafsync' was built under R version 4.3.3

# plot the confirmed case map
mcf <- mapview(amhara_data_sf, zcol = "conf", legend = "TRUE", layer.name = "Confirmed malaria")

# plot the incidence map
minc <- mapview(amhara_data_sf, zcol = "Incid_per_1000", legend = "TRUE", layer.name = "Incidence of malaria/1000 population")

# sync both map using the `leafsync` package
m <- sync(mcf, minc)

m

Lets present the confirmed cases in discrete breaks with a specified breaks to distinguish the case density with a clear cut and categories. Following Timo Grossenbacher’s and Milos Popovic code.

The first step is to find a natural breaks with specified interval (6 or 8 intervals).

# six natural interval with quantile breaks
interval <- classIntervals(amhara_data_sf$conf,
    n = 6,
    style = "quantile"
)$brks
am_df <- amhara_data_sf # let's copy the dataset for simplicity

Then we will create categories based on the above intervals:

labels <- c()
for (i in 1:length(interval)) {
    labels <- c(labels, paste0(
        round(interval[i], 0),
        "–",
        round(interval[i + 1], 0)
    ))
}
labels <- labels[1:length(labels) - 1]

Let’s create a new variable called category in the dataframe by dividing the confirmed malaria cases conf into intervals defined by breaks. Then we’ll assigns labels from labels to these intervals and include the lowest value in the first interval.

am_df$cat <- cut(am_df$conf,
    breaks = interval,
    labels = labels,
    include.lowest = T
)
levels(am_df$cat) # let's check how many levels it has (6)

[1] "0–7"       "7–33"      "33–107"    "107–391"   "391–976"   "976–13542"

As you can see the interval clearly showed the data distribution from (0 to 13542) confirmed cases reported across districts in the region. In this category, we are assuming 0 cases are not really zero rather unreported cases. If you have a dataset with missing data, you could create additional category No data. You could see the the Popovic example.

Now we are ready to plot. The below code has two steps. The first layer (geom_sf(data = eth_map, aes(group = district), color = "white", size = 0.2)) is drawing the base map using the eth_map dataset. It outlines the district boundaries in white with a thin line (size = 0.2), which gives the country map its structure. The second layer (geom_sf(data = am_df, aes(group = district, fill = cat))) adds another set of polygons using the am_df dataset. These polygons are filled with colors based on the cat(categorical grouped confirmed malaria cases) variable, which likely represents different categories or groups within the districts. The group = district ensures that the polygons are grouped by district.

p <-
    ggplot() +
    # geom_sf(data = eth_map, aes(group = district),
    #         color = "white", size = 0.2)+
    geom_sf(data = am_df, aes(group = district, fill = cat))

Let’s include title and/or subtitle, the source of our data.

p2 <- p + labs(
    x = "",
    title = "Confirmed malaria cases in Amhara Region,\nEthiopia in 2019",
    subtitle = "Districts level",
    caption = "©2024 Yalemzewod Gelaw https://yalemzewodgelaw.com/\nSource: Ethiopian Public Health Institute (EPHI)"
)

In the below code we will manually assign the color palette

p3 <- p2 + scale_fill_manual(
    name = "Confirmed malaria cases",
    values = c("#ffd080", "#f4a77a", "#e18079", "#c35e7d", "#9b4283", "#6c2d83"),
    labels = c("0–7", "7–33", "33–107", "107–391", "391–976", ">976"),
    drop = F
) +
    guides(fill = guide_legend(
        direction = "horizontal",
        keyheight = unit(1.15, units = "mm"),
        keywidth = unit(15, units = "mm"),
        title.position = "top",
        title.hjust = 0.5,
        label.hjust = .5,
        nrow = 1,
        byrow = T,
        reverse = F,
        label.position = "bottom"
    ))

In the below code, we’re customizing our plot with the following changes: The plot uses a clean minimal theme and removes background elements and grid lines for a clear look. The title is centered and colored, while the subtitle is bold and also centered. The caption is styled to fit neatly at the bottom. We’ve adjusted text sizes and colors for the legend and axis labels. Margins are set to provide ample space around the plot, and axis titles and ticks are hidden to reduce clutter. The plot dimensions are set to 10x8 inches and a high resolution of 300 DPI for clear, detailed visuals:

p4 <- p3 +
    theme_minimal() +
    theme(
        panel.background = element_blank(),
        legend.background = element_blank(),
        legend.position = c(.45, .04),
        panel.border = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        plot.title = element_text(size = 16, color = "#6c2d83", hjust = 0.5, vjust = 2),
        plot.subtitle = element_text(size = 12, color = "#bd5288", hjust = 0.5, vjust = 0.5, face = "bold"),
        plot.caption = element_text(size = 9, color = "grey60", hjust = 0.5, vjust = 1.5),
        axis.title.x = element_text(size = 7, color = "grey60"),
        legend.text = element_text(size = 9, color = "grey20"),
        legend.title = element_text(size = 10, color = "grey20"),
        strip.text = element_text(size = 12),
        plot.margin = margin(t = 10, r = 10, b = 30, l = 10),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()
    )

Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

p4

How to save the plot?

# Save the plot with higher resolution and larger dimensions
# ggsave("malaria.png", plot = p4, width = 10, height = 8, dpi = 300)

Plot the confirmed malaria case by splitting the confirmed malaria cases by administrative level 2 named zone

library(tmap)

Warning: package 'tmap' was built under R version 4.3.3

Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')

tm_shape(am_df) +
    tm_polygons(col = "conf") +
    tm_facets(by = "ADM2_EN", nrow = 2) +
    tm_layout(
        legend.position = c(.05, .04), # Position the legend
        legend.title.size = 1.1, # Adjust the title size
        legend.text.size = 0.8, # Adjust the label text size
        legend.frame = TRUE, # Add a frame around the legend
        legend.bg.color = "white", # Background color of the legend
        legend.bg.alpha = 0.7, # Transparency of the legend background
        legend.outside = TRUE, # Position inside the plot area
        legend.stack = "horizontal" # Horizontal layout for legend items
    )

Adding points to maps

Suppose we want to add the locations of health facilities in Ethiopia to our map of conf. eth_hf_pointsis an sf` object, but this time consisting of points. That doesn’t matter to geom_sf.

# Assign some of these points to be hospitals
eth_hospitals <- sample(eth_hf_points$facility_type, 0.1 * length(eth_hf_points$facility_type))

# filter hospital and health center
eth_hf_points <- eth_hf_points %>%
    mutate(hf_type = if_else(facility_type %in% "Hospital", "Hospital", "Health Center/Clinics"))

scico - package uses more or less the same api as viridis and provides scales for ggplot2 without requiring ggplot2 to be installed. Scico provides 39 different palettes, all of which are perceptually uniform and colourblind safe.

# load package

library(scico)

# Plot
ggplot() +
    geom_sf(
        data = eth_map,
        colour = "grey50", size = 0.2
    ) +
    scico::scale_fill_scico(palette = "davos", trans = "log") +
    geom_sf(data = eth_hf_points, aes(color = hf_type, shape = hf_type), size = 0.5) +
    labs(
        title = "Hospitals and Health Centers in Ethiopia",
        caption = "©2024 Yalemzewod Gelaw https://yalemzewodgelaw.com/\nSource: Ethiopian Public Health Institute (EPHI)",
        subtitle = "SPA, 2021",
        shape = "Health Facility"
    ) +
    coord_sf() +
    guides(color = "none", shape = guide_legend(title = "Health Facility")) + # Hide color legend and show shape legend
    theme_minimal() +
    theme(
        panel.background = element_blank(),
        legend.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        plot.title = element_text(size = 16, color = "#6c2d83", hjust = 0.5, vjust = 2),
        plot.subtitle = element_text(size = 12, color = "#bd5288", hjust = 0.5, vjust = 0.5, face = "bold"),
        plot.caption = element_text(size = 9, color = "grey60", hjust = 0.5, vjust = 1.5),
        axis.title.x = element_text(size = 7, color = "grey60"),
        legend.text = element_text(size = 9, color = "grey20"),
        legend.title = element_text(size = 10, color = "grey20"),
        strip.text = element_text(size = 12),
        plot.margin = margin(t = 10, r = 10, b = 30, l = 10),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()
    )

Exploration: Autocorrelation

In this section discusses what it is spatial exploration (i.e., identify disease clustering) for area data, and how statistics describing it can be computed . For this demonstration we will use tuberculosis notification data in Amhara region from 2014 to 2017(Gelaw et al. 2019).

Measures of spatial autocorrelation describe the degree two which observations (values) at spatial locations (whether they are points, areas, or raster cells), are similar to each other. So we need two things: observations and locations.

# read using 'sf' package
map <- st_read("C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-08-25-Spatial-analysis-in-R/data/shapefile",
    layer = "Amhara_2013s"
)

Reading layer `Amhara_2013s' from data source 
  `C:\Users\User\Documents\R_training\Tutorial_R\R_Tutorial\scripts\tutorials\2024-08-25-Spatial-analysis-in-R\data\shapefile' 
  using driver `ESRI Shapefile'
Simple feature collection with 128 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 92178.34 ymin: 963271.7 xmax: 632414.3 ymax: 1523470
Projected CRS: Adindan / UTM zone 37N

# class
class(map)

[1] "sf"         "data.frame"

Transform from sf to sp spatial data :

# transform to sp
map <- as(map, "Spatial")
# calss
class(map)

[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"

Read tuberculosis data

# read tuberculosis notification data
tb <- read.csv("C:/Users/User/Documents/R_training/Tutorial_R/R_Tutorial/scripts/tutorials/2024-08-25-Spatial-analysis-in-R/data/nall.csv", sep = ",", header = T)

# merge
map_tb <- merge(map, tb, by = "WOREDANAME")

map_tb

class       : SpatialPolygonsDataFrame 
features    : 128 
extent      : 92178.34, 632414.3, 963271.7, 1523470  (xmin, xmax, ymin, ymax)
crs         : +proj=utm +zone=37 +a=6378249.145 +rf=293.465 +towgs84=-166,-15,204,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : WOREDANAME,   ZONENAME, WOR_P_CODE, xcoordinat, ycoordinat,    Area, Adj_ID,    propEPTB,  propPRB.ve,  propPTB.ve, propallcase 
min values  :   Abergele,   Awi/Agew,     030101,    35.8585,     8.9154, 239.959,      1, 4.394484044, 3.723932473, 8.167830059, 32.51918192 
max values  :    Ziquala, West Gojam,     031201,    40.0576,    13.4561, 8494.44,    128, 236.8518347, 113.2267308, 159.4417229,  523.384786

The Moran’s I statistic is not robust to outliers or strongly skewed datasets. It’s therefore good practice to check the distribution of the attribute values. You can plot the attribute values as follows:

Histogram : distribution

# histogram
hist(tb$propallcase, main = NULL)

Box-plot - outlier

# box
boxplot(tb$propallcase, horizontal = T)

The dataset seems to be not well behaved - skewed to the right and there is an outliers. To symbolize the polygons using the proportions of all tuberculosis casesattribute we will first define the classification breaks (style = quantile with n = 8 breaks) and the symbol colors (palette="Greens") using the tmap package.

library(tmap)
# plot
tm_shape(map_tb) +
    tm_polygons(
        col = "propallcase",
        style = "quantile",
        n = 8, palette = "Greens"
    ) +
    tm_legend(outside = TRUE)

Define neighboring polygons

The first step in a Moran’s I analysis requires that we define “neighboring” polygons. This could refer to contiguous polygons, polygons within a certain distance, or it could be non-spatial in nature and defined by social, political or cultural “neighbors”.

Here, we’ll adopt a contiguous neighbor definition. We’ll accept any contiguous polygons that share at least one vertex; this is the “queen” case (if one chooses to adopt the chess analogy) and it’s parameterized as queen = TRUE in the call to poly2nb. If we required that just edges be shared between polygons then we would set queen = FALSE (the rook analogy). A spatial weight matrix is used to specify the spatial relationships of the districts. We defined neighbours usingfirst-order Queen’s contiguity, where by districts sharing borders or a common vertex with each otherwere considered neighbours

For each polygon in our shape object, nb lists all neighboring polygons. A spatial weight matrix is used to specify the spatial relationships of the districts.

library(spdep) # calculate neighboring
## Spatial matrix

nb <- poly2nb(map, row.names = map$ADM3_EN)
# str(nb)# for more details of the neighborhood
cords <- coordinates(map_tb)

To see the neighbors (by ID number) for the first polygon in the shape object, type:

nb

Neighbour list object:
Number of regions: 128 
Number of nonzero links: 660 
Percentage nonzero weights: 4.02832 
Average number of links: 5.15625

or by plotting the cords:

# grey
plot(map_tb, border = "lightgrey")

# red
plot(nb, cords, lwd = 1, col = "red")

Calculate the Rook’s case neighbours - queen’s contiguity, mean districts sharing borders or a common vertex with each other is considered neighbour.

# queen's neighbour
nb2 <- poly2nb(map_tb, queen = TRUE)
nb2

Neighbour list object:
Number of regions: 128 
Number of nonzero links: 660 
Percentage nonzero weights: 4.02832 
Average number of links: 5.15625

Compares different types of neighbours

# Set up a 1x2 layout
layout(matrix(1:2, nrow = 1))

# First plot: Distance
plot(map_tb, border = "lightgrey") # Background map
plot(nb, cords, col = "blue", main = "Distance", add = TRUE)

# Second plot: Queen
plot(map_tb, border = "lightgrey") # Background map
plot(nb2, cords, col = "red", main = "Queen", add = TRUE)

# Reset the layout to default (optional)
layout(1)

Step 2: Assign weights to the neighbors

Assign equal weights to each neighboring polygon, each neighboring polygon will be assigned equal weight when computing the neighboring mean confirmed cases values. The equal neighboring matrix will be computed using nb2listw()function:

# Convert the neighbour data to a listw object
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)

In this example, each neighboring polygon will be assigned equal weight when computing the neighboring mean proportions of tuberculosis values. To see the weight of the first polygon’s neighbors type:

# first polygon
lw$weights[1]

[[1]]
[1] 0.2 0.2 0.2 0.2 0.2

Step 3: Computing Global Moran’s I statistic

Investigation of global clustering patterns across regions is very important in spatial data analysis. Moran's I is a widely used spatial statistic for detecting global spatial patterns - unusually large cluster. Here, we intend to explore the presence of geographical clustering of TB in the region. The Moran’s I statistic can be computed using the moran function.

# Global MORAN
I <- moran(map_tb$propallcase, lw, length(nb2), Szero(lw))[1]
I

$I
[1] 0.3137684

Note

The output will contain: the value of the standard deviate of Moran’s I,the p-value of the test, the value of the observed Moran’s I. Moran’s I value is the slope of the line that best fits the relationship between neighboring proportions of tuberculosis values and each polygon’s proportion in the dataset.

Step 4: Performing a hypothesis test

The hypothesis we are testing states that “the proportions of tuberculosis are randomly distributed across districts following a completely random process”. There are two methods to testing this hypothesis: an analysis method and a Monte Carlo method. We’ll explore both approaches in the following examples.

4.1. Analysis method

To run the Moran’s I analysis using the analytic method, use the moran.test function.

# moran test
moran.test(map_tb$propallcase, lw,
    alternative = "greater"
)


    Moran I test under randomisation

data:  map_tb$propallcase  
weights: lw    

Moran I statistic standard deviate = 6.4757, p-value = 4.719e-11
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.313768383      -0.007874016       0.002467036

Spatial explanatory data analysis indicated significant spatial autocorrelation in district-level TB notification rates; the global Moran’s I = 0.314 test found significant positive autocorrelation (p<0.001) for the year 2014 - 2017 over the region.

4.2. Monte Carlo method

The analytical approach to the Moran’s I analysis benefits from being fast. But it may be sensitive to irregularly distributed polygons. A safer approach to hypothesis testing is to run an MC simulation using the moran.mc() function. The moran.mc function takes an extra argument n, the number of simulations.

# monte carlo
MC <- moran.mc(map_tb$propallcase, lw, nsim = 999, alternative = "greater")

# View results (including p-value)
MC


    Monte-Carlo simulation of Moran I

data:  map_tb$propallcase 
weights: lw  
number of simulations + 1: 1000 

statistic = 0.31377, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater

The MC simulation generates a very small p-value, 0.001 suggesting that proportions are strongly clustered. We can see the results graphically by passing the Moran’s I model to the plot function:

# Plot the Null distribution (note that this is a density plot instead of a histogram)
plot(MC)

The curve shows the distribution of Moran I values we could expect had the proportions of tuberculosis been randomly distributed across the counties. Note that our observed statistic, 0.314, falls way to the right of the distribution suggesting that the tuberculosis proportions are clustered (a positive Moran’s I value suggests clustering whereas a negative Moran’s I value suggests dispersion).

Now, had the Moran’s I statistic been negative (suggesting a dispersed pattern), you would probably want to set the alternative argument to less thus giving you the fraction of simulated I values more dispersed than your observed I value.

A visual exercise that you can perform to assess how “typical” or “atypical” your pattern may be relative to a randomly distributed pattern is to plot your observed pattern alongside a few simulated patterns generated under the null hypothesis.

set.seed(131) # Set the random seed to 131 for reproducibility of the random sampling.

# Create a new variable 'rand1' in 'map_tb' by shuffling the values of 'propallcase'.
# 'replace = FALSE' ensures that each value is used exactly once, meaning no duplicates.
map_tb$rand1 <- sample(map_tb$propallcase, length(map_tb$propallcase), replace = FALSE)

# Create another variable 'rand2' in 'map_tb' by shuffling 'propallcase' again.
# This generates a different random permutation from the previous 'rand1'.
map_tb$rand2 <- sample(map_tb$propallcase, length(map_tb$propallcase), replace = FALSE)

# Create a third variable 'rand3' in 'map_tb' by shuffling 'propallcase' once more.
# This produces yet another random permutation, different from both 'rand1' and 'rand2'.
map_tb$rand3 <- sample(map_tb$propallcase, length(map_tb$propallcase), replace = FALSE)


# Create a faceted plot with labeled facets
tm_shape(map_tb) +
    tm_polygons(
        col = c("propallcase", "rand1", "rand2", "rand3"),
        style = "quantile",
        n = 8,
        palette = "Greens",
        legend.show = FALSE
    ) +
    tm_facets(nrow = 1, free.coords = FALSE) + # Create facets in a single row
    tm_layout(
        panel.labels = c("Proportions", "Random 1", "Random 2", "Random 3"), # Labels for each plot
        panel.label.size = 1.5, # Size of the labels
        panel.label.bg.color = "white", # Background color for labels
        panel.label.fontface = "bold"
    ) # Make the labels bold

Is there a difference between our observed proportions of tuberculosis distribution and those generated from a completely random process? The map on the left is our observed distribution. The three maps on the right are realizations of a completely random process.

Local Moran’s I

The output contains: local moran’s statistic Ii, expectation of local moran’s statistic E.Ii, variance of local moran’s statistic Var.Ii, standard deviate of local moran’s statistic Z.Ii, and p-value of local moran statistic using pr().

# Local distribution
hist(localr[, 5])

In the map below, local moran I’s was negative for higher-order adjacencies, thus districts far away from the one of interest tended to have disparate notification rates

# maps the results
tm_shape(moran.mapr) +
    tm_polygons(
        col = "Ii", n = 6,
        style = "quantile",
        title = "Local moran statistic", Frame = T
    ) +
    tm_compass(
        north = 1, text.size = 0.2,
        position = c("left", "top")
    ) +
    tm_layout(title = "TB local clustering")

Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Let’s present this in Local Indicators of Spatial Association(LISA) using the tidyrgeoda. Details about the

LISA will be found here tidyrgeoda package

Install the development version of the package from github as below

# install.packages("devtools")
library(devtools)

Warning: package 'devtools' was built under R version 4.3.3

Loading required package: usethis

Warning: package 'usethis' was built under R version 4.3.3

devtools::install_github("SpatLyu/tidyrgeoda",
    build_vignettes = T,
    dep = T
)

Skipping install of 'tidyrgeoda' from a github remote, the SHA1 (e7d779b8) has not changed since last install.
  Use `force = TRUE` to force installation

Assessing the Local Moran at 0.05 significance level

library(tidyverse)
library(tidyrgeoda)

Welcome to tidyrgeoda 0.1.1!

# Convert the SpatialPolygonsDataFrame to an sf object
map_sf <- st_as_sf(map_tb)

map_sf %>%
    dplyr::mutate(lisa = st_local_moran(., "propallcase",
        wt = st_contiguity_weights(.),
        significance_cutoff = 0.05
    )) %>%
    dplyr::select(lisa) %>%
    ggplot() +
    geom_sf(aes(fill = lisa), lwd = .1, color = "grey") +
    scale_fill_lisa(name = "LISA 95%") +
    labs(
        title = "Tuberculosis Hotspots in Amhara Region, 2014–2017 ",
        caption = "©2019 Yalemzewod Gelaw https://doi.org/10.1093/trstmh/trz017/",
        subtitle = "LISA at 0.05 significance level"
    ) +
    theme_minimal() +
    theme(
        panel.background = element_blank(),
        legend.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        plot.title = element_text(size = 16, color = "#6c2d83", hjust = 0.5, vjust = 2),
        plot.subtitle = element_text(size = 12, color = "#bd5288", hjust = 0.5, vjust = 0.5, face = "bold"),
        plot.caption = element_text(size = 9, color = "grey60", hjust = 0.5, vjust = 1.5),
        axis.title.x = element_text(size = 7, color = "grey60"),
        legend.text = element_text(size = 9, color = "grey20"),
        legend.title = element_text(size = 10, color = "grey20"),
        strip.text = element_text(size = 12),
        plot.margin = margin(t = 10, r = 10, b = 30, l = 10),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()
    )

Local Getis-Ord Statistics

map_sf %>%
    dplyr::mutate(lisa = st_local_g(., "propallcase",
        wt = st_contiguity_weights(.),
        significance_cutoff = 0.05
    )) %>%
    dplyr::select(lisa) %>%
    ggplot() +
    geom_sf(aes(fill = lisa), lwd = .1, color = "grey") +
    scale_fill_lisa(name = "Local G") +
    labs(
        title = "Tuberculosis Hotspots in Amhara Region, 2014–2017 ",
        caption = "©2019 Yalemzewod Gelaw https://doi.org/10.1093/trstmh/trz017/",
        subtitle = "LISA Getis-Ord Statistics"
    ) +
    theme_minimal() +
    theme(
        panel.background = element_blank(),
        legend.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_blank(),
        plot.title = element_text(size = 16, color = "#6c2d83", hjust = 0.5, vjust = 2),
        plot.subtitle = element_text(size = 12, color = "#bd5288", hjust = 0.5, vjust = 0.5, face = "bold"),
        plot.caption = element_text(size = 9, color = "grey60", hjust = 0.5, vjust = 1.5),
        axis.title.x = element_text(size = 7, color = "grey60"),
        legend.text = element_text(size = 9, color = "grey20"),
        legend.title = element_text(size = 10, color = "grey20"),
        strip.text = element_text(size = 12),
        plot.margin = margin(t = 10, r = 10, b = 30, l = 10),
        axis.title.y = element_blank(),
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank()
    )

LISA is used to detect the location of hotspot districts with high rates of TB notifications surrounded by others also with high rates of TB. The LISA led to the classification of districts into five classes: high–high (hotspot), low–low (LL), low–high (LH), high–low (HL). and not significant. The hotspot and low–low locations suggest clustering of similar areas, whereas the high–low and low–high locations indicate spatial outliers. A high–low spatial outlier describes a location where there is a mixture of high and low scores in neighbouring areas. The outliers (high–low and low–high TB cluster districts) have epidemiological significance for TB prevention and control. High–low clusters indicate a location with a high incidence of notified TB cases sharing borders with locations with low incidence, so that infection can be disseminated from the high-incidence location to the low-incidence locations. Low–high clusters indicate a location with low TB incidence sharing borders with locations with high TB incidence, so that areas of low TB incidence can be a receptor of infection.

References

Elliott, Paul, Jon Wakefield, Nicola Best, and David Briggs, eds. 2001. “Spatial Epidemiology: Methods and Applications.” In, 1st ed., 3–14. Oxford University PressOxford. https://doi.org/10.1093/acprof:oso/9780198515326.003.0001.

Elliott, Paul, and Daniel Wartenberg. 2004. “Spatial Epidemiology: Current Approaches and Future Challenges.” Environmental Health Perspectives 112 (9): 998–1006. https://doi.org/10.1289/ehp.6735.

Gelaw, Yalemzewod Assefa, Gail Williams, Yibeltal Assefa, Moges Asressie, and Ricardo J Soares Magalhães. 2019. “Sociodemographic Profiling of Tuberculosis Hotspots in Ethiopia, 20142017.” Transactions of The Royal Society of Tropical Medicine and Hygiene 113 (7): 379–91. https://doi.org/10.1093/trstmh/trz017.

Pfeiffer, Dirk U., Timothy P. Robinson, Mark Stevenson, Kim B. Stevens, David J. Rogers, and Archie C. A. Clements. 2008. Spatial Analysis in Epidemiology. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780198509882.001.0001.

Introduction to spatial analysis in R

Introduction

Spatial Data

Vector Data

Raster Data

sp and sf packages

Vector - Spatial data

Area spatial data structure

Spatial points

Coordinate Reference Systems (CRS)

Joining non-spatial data to shapefiles

Writing shapefiles out

Introduction to Raster data

Manipulating rasters

Extracting raster data

Spatial Epidemiology

Making chloropleth maps in R

Adding points to maps

Exploration: Autocorrelation

Define neighboring polygons

Step 2: Assign weights to the neighbors

Step 3: Computing Global Moran’s I statistic

Step 4: Performing a hypothesis test

Local Moran’s I

References

`sp` and `sf` packages