COVID-19 Data Visualization in R

Li Wang

6/22/2022

Goal

My goal with this project is to create a vignette about contacting the [COVID19 API] (https://documenter.getpostman.com/view/10808728/SzS8rjbc) using functions I create to query, parse, return well-structured data, and then I will use the functions to obtain data from the API and do some exploratory data analysis.

Load Packages

To use the functions for interacting with the COVID19 API, I used the following packages:

library(httr)
library(jsonlite)
library(readr)
library(dplyr)
library(plyr)
library(tidyr)
library(ggplot2)
library(tidyverse)
library(lubridate)

Write functions to query the data

“By Country” endpoint

I write this function to interact with the “By Country” endpoint of the COVID19 API.

The following function can allow us to query the “country”,”status”, “from”, and “to” modifications to return a specific data. The status must be one of: confirmed, recovered, deaths.

# write a funtion to query the "country","status", "from", and "to" modifications to return a specific data.
covid19_data_by_country<- function(country,status,from,to){
  base<-"https://api.covid19api.com/country/"
  call1 <- paste(base,country,"/","status","/",status,"?","from","=",from, "&","to","=",to, sep="")
  by_country <- GET(call1)
  by_country_text <- content(by_country, "text")
  by_country_json <- fromJSON(by_country_text, flatten = TRUE)
  # Remove the column 4 to 5 which is non-meaningful
  covid19_by_country <- by_country_json[,-(4:5),drop=FALSE]
}

I query the “country” as “China”, the “status” as “confirmed”, the “from” as “2021/01/01”, and “to” as “2021/12/31”, then I got the data set of the COVID19 confirmed cases for china from “2021/01/01” to”2021/12/31”.

# Query the data set "china_confirmed" with the COVID19 confirmed cases for china from "2021/01/01" to"2021/12/31".
china_confirmed <-covid19_data_by_country("china","confirmed","2021/01/01","2021/12/31")
# change the Date to ""%Y/%m/%d""
china_confirmed$Date <- format(as.POSIXct(china_confirmed$Date,format="%Y-%m-%dT%H:%M:%SZ"),format="%Y/%m/%d")
head(china_confirmed)

china_confirmed

“By Country Total” endpoint

I write this function to interact with the “By Country Total” endpoint of the COVID19 API.

The following function can allow us to query the “country”, “status”, “from”, and “to” modifications to return a specific data. The status must be one of: confirmed, recovered, deaths.

# write a function to query the "country","status", "from", and "to" modifications to return a specific data.
covid19_Country_Total<- function(country,status,from,to){
  base<-"https://api.covid19api.com/total/country/"
  call2 <- paste(base,country,"/","status","/",status,"?","from","=",from, "&","to","=",to, sep="")
  country_total <- GET(call2)
  country_total_text <- content(country_total, "text")
  country_total_json <- fromJSON(country_total_text, flatten = TRUE)
  # Remove the column 2 to 7 which is non-meaningful.
  country_total_data <- country_total_json[,-(2:7)]
}

I query the “country” as “China”and “India”, the “status” as “Confirmed”and”deaths”, the “from” as “2021/01/01”, and “to” as “2021/12/31”, and return the COVID19 confirmed and deaths cases for China and India from 2021/01/01 to 2021/12/31.

# Query the data set "china_Total" with the total COVID19 confirmed cases for China from 2021/01/01 to 2021/12/31
china_Total <-covid19_Country_Total("China","confirmed","2021/01/01","2021/12/31")

#  Query the data set "India_Total" with the total COVID19 confirmed cases for India from 2021/01/01 to 2021/12/31
India_Total <-covid19_Country_Total("India","confirmed","2021/01/01","2021/12/31")

# Query the data set "India_Total" with the total COVID19 death cases for India from 2021/01/01 to 2021/12/31
India_death_data <-covid19_Country_Total("India","deaths","2021/01/01","2021/12/31")

# Combine the "India_Total" data and "India_death_data"
jointdataset <- join(India_Total, India_death_data, by=c("Date","Country"))

# change the Date to ""%Y/%m/%d""
china_Total$Date <- format(as.POSIXct(china_Total$Date,format="%Y-%m-%dT%H:%M:%SZ"),format="%Y/%m/%d")
India_Total$Date <- format(as.POSIXct(India_Total$Date,format="%Y-%m-%dT%H:%M:%SZ"),format="%Y/%m/%d")
jointdataset$Date <- format(as.POSIXct(jointdataset$Date,format="%Y-%m-%dT%H:%M:%SZ"),format="%Y/%m/%d")
# change column names of jointdataset
colnames(jointdataset)[2] <- "Confirmed_Cases"
colnames(jointdataset)[5] <- "Deaths_Cases"

# remove "Status" column of jointdataset
jointdataset<-jointdataset %>%
select(Country,Confirmed_Cases,Deaths_Cases,Date)

# View head of 
head(china_Total)
head(India_Total)
head(jointdataset)

China_Total data

India_Total data

joindataset jointdataset

Exploratory Data Analysis (EDA)

For EDA, I pull data from “By Country” endpoint, and “By Country Total” endpoint.

Create new columns

Create new columns named “Month” and “Cases_range” in “china_confirmed”, “china_total”, and “china_total” data :

# Create a new columns named "month" in "china_confirmed"
china_confirmed<-china_confirmed %>%
  mutate(month= month(ymd(china_confirmed$Date)))

# Create new columns named "month", "Daily_change", and "Daily_change_range" in "china_Total"
china_Total<-china_Total %>%
  mutate(month= month(ymd(china_Total$Date)))%>%
  mutate(Daily_change= Cases - lag(Cases))%>%
  mutate(Daily_change_range= cut(Daily_change, c(0, 75, 149,223,297, 373)))%>%group_by(Daily_change_range)
china_Total$month <- as.character(china_Total$month)

# Create three new columns named "month", "Daily_change", and "Daily_change_range" in "India_Total"
India_Total<-India_Total %>%
  mutate(month= month(ymd(India_Total$Date)))%>%
  mutate(Daily_change= Cases - lag(Cases))%>%
  mutate(Daily_change_range= cut(Daily_change, c(0, 82838,165675,248513,331351, 414189)))%>%group_by(Daily_change_range)

Create some contingency tables

Create a contingency table China_Total dataset: Daily_change_range v.s month

table(china_Total$Daily_change_range, china_Total$month)

China_table

From above table, 3 means there are 3 days with confirmed covid-19 cases in range (0,75] in January in China.

As we can see, the most Daily_change confirmed cases in the lowest range (0,75], and the range (75,149] is just in January, July to December, and the other higher Daily_change_range of confirmed cases is just in January and December.

Create a contingency table India_Total dataset: Daily_change_range vs month

table(India_Total$Daily_change_range, India_Total$month)

India_table

From above table, 28 means there are 28 days with confirmed cases in range (0,8.28e+04] in Feb 2021 in India.

As we can see, the most Daily_change confirmed cases in the lowest range (0,8.28e+04], and the other higher Daily_change_range of confirmed cases is almost in April and may.

Create numerical summaries

Create numerical summaries for “Daily_change” variable of china_Total

china_Total %>%
  drop_na(Daily_change)%>%
  group_by(month) %>%
  summarise(
          mean = mean(Daily_change)
            )

china_month_summary

From the above table, the largest of the mean of Daily_change confirmed Covid-19 cases in 2021 China is 135 in January, and the smallest is 17 in May.

Create numerical summaries for “Daily_change” variable of India_Total

India_Total %>% 
  drop_na(Daily_change)%>% 
  group_by(month)%>% 
  summarise(mean = mean(Daily_change)
  )

India_month_summary

From the above table, the largest mean of Daily_change confirmed Covid-19 cases in 2021 India is 290648,and the smallest is 9131 in December.

Create plots

Create Bar Plot of Daily_Change Confirmed Covid19 Cases of China in 2021

china_Total <- china_Total %>% drop_na(Daily_change_range)
g <- ggplot(data = china_Total, aes(x = Daily_change_range))
g + geom_bar(color = "blue", fill = "blue")+
labs(title = "Bar Plot of Daily_Change \nConfirmed Covid19 Cases of China in 2021")

Bar plot of china

From above bar plot, the most Daily_change confirmed cases of China are in the range (0,75], and the least Daily_change confirmed cases of china are in the range (297,373]. It’s decreasing trend.

Create Bar Plot of Daily_Change Confirmed Covid19 Cases of India in 2021

India_Total <- India_Total %>% drop_na(Daily_change_range)
g <- ggplot(data = India_Total, aes(x = Daily_change_range))
g + geom_bar(color = "blue", fill = "blue")+
labs(title = "Bar Plot of Daily_Change \nConfirmed Covid19 Cases of India in 2021")+
theme(axis.text.x = element_text(angle = 45,size = 5))

Bar plot of India

From above bar plot, the most Daily_change confirmed cases of India are in the range (0,8.28e+04], and the least Daily_change confirmed cases of India are in the range (1.66e+05,2.49e+05].

Create histogram of Daily_Change Confirmed Covid19 Cases of China in 2021

g <- ggplot(data = china_Total, aes(x = Daily_change))
g + geom_histogram(color = "blue", fill = "blue",
size = 2, binwidth = 3)+
labs(title = "Histogram of Daily_Change \nConfirmed Covid19 Cases of China in 2021")

Histogram

From above histogram, the most Daily_change confirmed cases of China are in the range (0,75].

Ceate Boxplot of Daily_Change Confirmed Covid19 Cases of Beijing and Shanghai in 2021

china_confirmed<-filter(china_confirmed, Province %in% c("Beijing","Shanghai"))
g <- ggplot(china_confirmed, aes(x = Province, y = Cases))
g + geom_boxplot(fill = "grey")+
labs(title = "Boxplot of Daily_Change \nConfirmed Covid19 Cases of Beijing and Shanghai in 2021")+
  stat_summary(fun = mean, geom = "line", lwd = 1.5, aes(group = month, col = month))

boxplot

From above box plot, the Daily_Change Confirmed Covid19 Cases of Beijing is lower than Shanghai in 2021, and the slope of line with mean of Daily_Change Confirmed Covid_19 Cases of Beijing VS Shanghai grouped-by month is increased.

Create Scatter Plot of Confirmed_Cases Vs. Deaths_Cases of India in 2021

g <- ggplot(jointdataset, aes(x = Confirmed_Cases, y = Deaths_Cases))
g + geom_point(color="blue")+
  labs(title = "Scatter Plot of Confirmed_Cases \n Vs. Deaths_Cases of India in 2021")

scatter plot

From above scatter plot, more confirmed_cases means more deaths_cases of India in 2021. This two variables are positive correlation.