Johns Hopkins Covid-19 Data and R, Part I -- data.table handling.

Summary: This blog showcases the handling of daily data of cases/deaths from Covid-19 in the U.S. published by the Center for Systems Science and Engineering at Johns Hopkins University. The technology deployed to manage and explore the data is R along with its splendid data.table package. Analysts with several months of R experience should benefit from the notebook below.

It's pretty hard to consume any analytics' media these days without seeing explorations of Covid-19 data. I was late to Covid EDA, but am now all in, hoping I can make even a small contribution to the pandemic response. A good starting point for Covid data is the Center for Systems Science and Engineering at Johns Hopkins University, my alma mater. The CSSE maintains a Covid-19 dashboard and posts confirmed case and fatality files daily for the U.S. and the world.

I started looking at that data about a week ago using R, planning later to examine the same data with Python and Julia. The downloadable case and death files hint of spreadsheets, with an ever-expanding date repeating group holding the case/death cumulative counts. The granularity of the data is at county or other jurisdiction within state, so ultimately a normalized relational structure would key on the combination of state, jurisdiction, and date. A problem with the data, noted on the website, is that "The time series tables are subject to be updated if inaccuracies are identified in our historical data. The daily reports will not be adjusted in these instances to maintain a record of raw data." In other words, there are some anomalies in the data that must be accounted for. I try to manage around them best I can with summarization and moving averages.

Any data management work I do in R is built on the nonpareil data.table package, which adds immeasurable functionality to R's native data.frame. A newbie serious about learning R for analytics should make an investment in data.table. It'll take some time, but the rewards are well worth the effort. Python programmers are starting to see the Python data.table as a competitor to the venerable Pandas.

This is the first of a two-part series on R with the CSSE case/fatality data. Part I here details the loading/shaping/grouping of the data, while Part II will explore the data using ggplot. My hope is that readers will find some of the code useful in their own work.

The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, pryr, plyr, fst, and knitr packages are featured, as well as functions from my personal stash, detailed below.

Set options, load packages, include personal functions, and migrate to the working directory. The functions blanks, meta, mykab, and prheadtail are used extensively.

In [1]:
options(warn=-1) 
options(scipen = 20)
options(datatable.print.topn=100)
options(datatable.showProgress=FALSE)

usualsuspects <- c(
    'tidyverse', 'data.table', 'pryr', 'plyr','readxl', 'bit', 'grid', 'gridExtra',     
    'rvest', 'magrittr','lubridate','rlist', 'tictoc', 'skimr','pracma',
    'fst','feather',
    'knitr', 'kableExtra',  
    'ggplot2','RColorBrewer'
) 

suppressMessages(invisible(lapply(usualsuspects, library, character.only = TRUE)))

funcsdir <- "/steve/r/functions"
funcsfile <- "rfunctions.r"

setwd(funcsdir)
source(funcsfile)

blanks(1)
lsf.str()

wdir <- "/steve/covid"
setwd(wdir)

blanks(2)

allfreqs : function (dtn, catlim = 100)  
blanks : function (howmany)  
colsize : function (dt)  
dtmeta : function (df)  
freqsdt : function (DTstr, xstr, percent = TRUE)  
meta : function (df, data = FALSE, dict = TRUE)  
mksumfreq : function (freqalldt)  
mksumfreq2 : function (dt)  
mykab : function (dt)  
obj_sz : function (obj)  
prhead : function (df, howmany = 6)  
prheadtail : function (df, howmany = 6)  
 

Define several convenience functions that do nothing more than save typing.

In [2]:
diffna <- function (var) c(NA,diff(var))
sumna <- function (var) sum(var,na.rm=TRUE)
freadfac <- function (var) fread(var,stringsAsFactors=TRUE)

blanks(2)
 

Download the CSSE U.S. confirmed case file. The starting data.table is the epitome of unnormalized repeating groups.

In [3]:
burl = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series"

fname <- "time_series_covid19_confirmed_US.csv"

covidcasesus <-  freadfac(file.path(burl, fname)) 

meta(covidcasesus)

blanks(2)

|name        |class                        |rows|columns|size   |
|:-----------|:----------------------------|:---|:------|:------|
|covidcasesus|c("data.table", "data.frame")|3261|116    |1.91 MB|

Classes 'data.table' and 'data.frame':	3261 obs. of  116 variables:
 $ UID           : num  16 316 580 630 850 ...
 $ iso2          : Factor w/ 6 levels "AS","GU","MP",..: 1 2 3 4 6 5 5 5 5 5 ...
 $ iso3          : Factor w/ 6 levels "ASM","GUM","MNP",..: 1 2 3 4 6 5 5 5 5 5 ...
 $ code3         : int  16 316 580 630 850 840 840 840 840 840 ...
 $ FIPS          : num  60 66 69 72 78 ...
 $ Admin2        : Factor w/ 1902 levels "","Abbeville",..: 1 1 1 1 1 83 90 101 151 166 ...
 $ Province_State: Factor w/ 58 levels "Alabama","Alaska",..: 3 15 40 45 53 1 1 1 1 1 ...
 $ Country_Region: Factor w/ 1 level "US": 1 1 1 1 1 1 1 1 1 1 ...
 $ Lat           : num  -14.3 13.4 15.1 18.2 18.3 ...
 $ Long_         : num  -170.1 144.8 145.7 -66.6 -64.9 ...
 $ Combined_Key  : Factor w/ 3261 levels "Abbeville, South Carolina, US",..: 57 1148 2105 2410 3018 112 121 133 204 224 ...
 $ 1/22/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/23/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/24/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/25/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/26/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/27/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/28/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/29/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/30/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 1/31/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/1/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/2/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/3/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/4/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/5/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/6/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/7/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/8/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/9/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/10/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/11/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/12/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/13/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/14/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/15/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/16/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/17/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/18/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/19/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/20/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/21/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/22/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/23/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/24/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/25/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/26/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/27/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/28/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 2/29/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/1/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/2/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/3/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/4/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/5/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/6/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/7/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/8/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/9/20        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/10/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/11/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/12/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/13/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/14/20       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ 3/15/20       : int  0 0 0 0 0 0 1 0 0 0 ...
 $ 3/16/20       : int  0 3 0 5 1 0 1 0 0 0 ...
 $ 3/17/20       : int  0 3 0 5 2 0 1 0 0 0 ...
 $ 3/18/20       : int  0 5 0 5 2 0 1 0 0 0 ...
 $ 3/19/20       : int  0 12 0 5 3 0 1 0 0 0 ...
 $ 3/20/20       : int  0 14 0 14 3 0 2 0 0 0 ...
 $ 3/21/20       : int  0 15 0 21 6 0 2 0 0 0 ...
 $ 3/22/20       : int  0 27 0 23 6 0 2 0 0 0 ...
 $ 3/23/20       : int  0 29 0 31 7 0 3 0 0 0 ...
 $ 3/24/20       : int  0 32 0 39 17 1 4 0 0 0 ...
 $ 3/25/20       : int  0 37 0 51 17 4 4 0 0 1 ...
 $ 3/26/20       : int  0 45 0 64 17 6 5 0 0 2 ...
 $ 3/27/20       : int  0 51 0 79 19 6 5 0 0 4 ...
 $ 3/28/20       : int  0 55 0 100 22 6 10 0 0 5 ...
 $ 3/29/20       : int  0 56 0 127 23 6 15 0 0 5 ...
 $ 3/30/20       : int  0 58 0 174 30 6 18 0 2 5 ...
 $ 3/31/20       : int  0 69 2 239 30 7 19 0 3 5 ...
 $ 4/1/20        : int  0 77 6 286 30 8 20 0 3 5 ...
 $ 4/2/20        : int  0 82 6 316 30 10 24 0 4 6 ...
 $ 4/3/20        : int  0 84 8 316 37 12 28 1 4 9 ...
 $ 4/4/20        : int  0 93 8 452 40 12 29 2 4 10 ...
 $ 4/5/20        : int  0 112 8 475 42 12 29 2 5 10 ...
 $ 4/6/20        : int  0 113 8 513 43 12 38 2 7 10 ...
 $ 4/7/20        : int  0 121 8 573 43 12 42 3 8 10 ...
 $ 4/8/20        : int  0 121 11 620 45 12 44 3 9 10 ...
 $ 4/9/20        : int  0 128 11 683 45 15 56 4 9 11 ...
 $ 4/10/20       : int  0 130 11 725 50 17 59 9 11 12 ...
 $ 4/11/20       : int  0 133 11 788 51 19 66 9 13 12 ...
 $ 4/12/20       : int  0 133 11 897 51 19 71 10 16 13 ...
 $ 4/13/20       : int  0 133 11 903 51 19 72 10 17 14 ...
 $ 4/14/20       : int  0 133 11 923 51 23 87 11 17 16 ...
 $ 4/15/20       : int  0 135 13 974 51 24 91 12 18 17 ...
 $ 4/16/20       : int  0 135 13 1043 51 26 101 14 22 18 ...
 $ 4/17/20       : int  0 136 13 1068 51 26 103 15 24 20 ...
 $ 4/18/20       : int  0 136 14 1118 53 25 109 18 26 20 ...
  [list output truncated]
 - attr(*, ".internal.selfref")=<externalptr> 
NULL

 

Search for "keys". UID and Combined_Key qualify.

In [4]:
nrow(covidcasesus)
nrow(covidcasesus[,.N,.(UID)])
nrow(covidcasesus[,.N,.(Combined_Key)])
nrow(covidcasesus[,.N,.(Province_State)])

blanks(2)
3261
3261
3261
58
 

Take a quick look at the data. The granularity is sub state geographies; the repeating group date attributes represent cumulative cases.

In [5]:
mykab(covidcasesus[Province_State=="Delaware",c("Combined_Key","4/23/20","4/24/20","4/25/20","4/26/20","4/27/20","4/28/20","4/29/20")])

blanks(2)

|Combined_Key            |4/23/20|4/24/20|4/25/20|4/26/20|4/27/20|4/28/20|4/29/20|
|:-----------------------|:------|:------|:------|:------|:------|:------|:------|
|Kent, Delaware, US      |539    |558    |579    |625    |652    |728    |743    |
|New Castle, Delaware, US|1389   |1486   |1504   |1599   |1629   |1701   |1717   |
|Sussex, Delaware, US    |1377   |1394   |1490   |1801   |1870   |2114   |2169   |
|Out of DE, Delaware, US |0      |0      |0      |0      |0      |0      |0      |
|Unassigned, Delaware, US|3      |4      |3      |9      |11     |32     |26     |
 

"pivot" or "melt" the data.table by date. Compute daily cases from the cumulative by combined_key. Sort by cumbined_key and date. Note the chaining to create mcovidcasesus.

In [6]:
nmes <- names(covidcasesus)
nc <- length(nmes)
startdt <- 12

names(covidcasesus) <- c(tolower(nmes[1:(startdt-1)]),nmes[startdt:nc])

varnums <- startdt:nc

mcovidcasesus <- melt(covidcasesus,measure.vars=varnums,variable.name="date",value.name="cumcases")[
    ,date:=mdy(date)][order(combined_key,date)][,`:=`(daycases=diffna(cumcases)),.(combined_key)] 

meta(mcovidcasesus)

blanks(2)

|name         |class                        |rows  |columns|size    |
|:------------|:----------------------------|:-----|:------|:-------|
|mcovidcasesus|c("data.table", "data.frame")|342405|14     |25.22 MB|

Classes 'data.table' and 'data.frame':	342405 obs. of  14 variables:
 $ uid           : num  84045001 84045001 84045001 84045001 84045001 ...
 $ iso2          : Factor w/ 6 levels "AS","GU","MP",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ iso3          : Factor w/ 6 levels "ASM","GUM","MNP",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ code3         : int  840 840 840 840 840 840 840 840 840 840 ...
 $ fips          : num  45001 45001 45001 45001 45001 ...
 $ admin2        : Factor w/ 1902 levels "","Abbeville",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ province_state: Factor w/ 58 levels "Alabama","Alaska",..: 47 47 47 47 47 47 47 47 47 47 ...
 $ country_region: Factor w/ 1 level "US": 1 1 1 1 1 1 1 1 1 1 ...
 $ lat           : num  34.2 34.2 34.2 34.2 34.2 ...
 $ long_         : num  -82.5 -82.5 -82.5 -82.5 -82.5 ...
 $ combined_key  : Factor w/ 3261 levels "Abbeville, South Carolina, US",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date          : Date, format: "2020-01-22" "2020-01-23" ...
 $ cumcases      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ daycases      : int  NA 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL

 

Examine the last two dates for Delaware in the melted data.table.

In [7]:
mykab(mcovidcasesus[province_state=="Delaware" & date>=max(date)-1])

blanks(2)

|uid     |iso2|iso3|code3|fips |admin2    |province_state|country_region|lat     |long_    |combined_key            |date      |cumcases|daycases|
|:-------|:---|:---|:----|:----|:---------|:-------------|:-------------|:-------|:--------|:-----------------------|:---------|:-------|:-------|
|84010001|US  |USA |840  |10001|Kent      |Delaware      |US            |39.08647|-75.56885|Kent, Delaware, US      |2020-05-04|833     |12      |
|84010001|US  |USA |840  |10001|Kent      |Delaware      |US            |39.08647|-75.56885|Kent, Delaware, US      |2020-05-05|847     |14      |
|84010003|US  |USA |840  |10003|New Castle|Delaware      |US            |39.57993|-75.64055|New Castle, Delaware, US|2020-05-04|1934    |31      |
|84010003|US  |USA |840  |10003|New Castle|Delaware      |US            |39.57993|-75.64055|New Castle, Delaware, US|2020-05-05|1979    |45      |
|84080010|US  |USA |840  |80010|Out of DE |Delaware      |US            |0.00000 |0.00000  |Out of DE, Delaware, US |2020-05-04|0       |0       |
|84080010|US  |USA |840  |80010|Out of DE |Delaware      |US            |0.00000 |0.00000  |Out of DE, Delaware, US |2020-05-05|0       |0       |
|84010005|US  |USA |840  |10005|Sussex    |Delaware      |US            |38.66144|-75.39031|Sussex, Delaware, US    |2020-05-04|2497    |36      |
|84010005|US  |USA |840  |10005|Sussex    |Delaware      |US            |38.66144|-75.39031|Sussex, Delaware, US    |2020-05-05|2520    |23      |
|84090010|US  |USA |840  |90010|Unassigned|Delaware      |US            |0.00000 |0.00000  |Unassigned, Delaware, US|2020-05-04|24      |1       |
|84090010|US  |USA |840  |90010|Unassigned|Delaware      |US            |0.00000 |0.00000  |Unassigned, Delaware, US|2020-05-05|25      |1       |
 

Check for negative daycases, which generally emerge when cases are removed from "unassigned" admin2. Alas, we must work with these anomalies. Aggregating to the state level and deploying moving averages help dampen the disruption.

In [8]:
mykab(mcovidcasesus[daycases<0,.(.N,daycases=sum(daycases))])
mykab(head(mcovidcasesus[province_state=="Illinois" & daycases<0][order(daycases)],10))

blanks(2)

|N   |daycases|
|:---|:-------|
|2323|-30359  |

|uid     |iso2|iso3|code3|fips |admin2    |province_state|country_region|lat     |long_    |combined_key            |date      |cumcases|daycases|
|:-------|:---|:---|:----|:----|:---------|:-------------|:-------------|:-------|:--------|:-----------------------|:---------|:-------|:-------|
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-03-27|0       |-668    |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-04-04|58      |-357    |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-03-21|3       |-164    |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-05-02|278     |-85     |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-03-24|0       |-63     |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-05-04|214     |-43     |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-04-13|65      |-36     |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-04-23|193     |-35     |
|84090017|US  |USA |840  |90017|Unassigned|Illinois      |US            |0.00000 |0.00000  |Unassigned, Illinois, US|2020-05-03|257     |-21     |
|84017179|US  |USA |840  |17179|Tazewell  |Illinois      |US            |40.50716|-89.51405|Tazewell, Illinois, US  |2020-04-19|26      |-15     |
 

Next up is the deaths file that looks much like cases.

In [9]:
burl = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series"

fname <- "time_series_covid19_deaths_US.csv"

coviddeathsus <-  freadfac(file.path(burl, fname)) 

meta(coviddeathsus,dict=FALSE)

blanks(2)

|name         |class                        |rows|columns|size   |
|:------------|:----------------------------|:---|:------|:------|
|coviddeathsus|c("data.table", "data.frame")|3261|117    |1.92 MB|


 

Melting/pivoting the repeating date attributes.

In [10]:
nmes <- names(coviddeathsus)
nc <- length(nmes)
startdt <- 13

names(coviddeathsus) <- c(tolower(nmes[1:(startdt-1)]),nmes[startdt:nc])

varnums <- startdt:nc

mcoviddeathsus <- melt(coviddeathsus,measure.vars=varnums,variable.name="date",value.name="cumdeaths")[
    ,date:=mdy(date)][order(combined_key,date)][,`:=`(daydeaths=diffna(cumdeaths)),.(combined_key)]

meta(mcoviddeathsus)

blanks(2)

|name          |class                        |rows  |columns|size    |
|:-------------|:----------------------------|:-----|:------|:-------|
|mcoviddeathsus|c("data.table", "data.frame")|342405|15     |26.53 MB|

Classes 'data.table' and 'data.frame':	342405 obs. of  15 variables:
 $ uid           : num  84045001 84045001 84045001 84045001 84045001 ...
 $ iso2          : Factor w/ 6 levels "AS","GU","MP",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ iso3          : Factor w/ 6 levels "ASM","GUM","MNP",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ code3         : int  840 840 840 840 840 840 840 840 840 840 ...
 $ fips          : num  45001 45001 45001 45001 45001 ...
 $ admin2        : Factor w/ 1902 levels "","Abbeville",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ province_state: Factor w/ 58 levels "Alabama","Alaska",..: 47 47 47 47 47 47 47 47 47 47 ...
 $ country_region: Factor w/ 1 level "US": 1 1 1 1 1 1 1 1 1 1 ...
 $ lat           : num  34.2 34.2 34.2 34.2 34.2 ...
 $ long_         : num  -82.5 -82.5 -82.5 -82.5 -82.5 ...
 $ combined_key  : Factor w/ 3261 levels "Abbeville, South Carolina, US",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ population    : int  24527 24527 24527 24527 24527 24527 24527 24527 24527 24527 ...
 $ date          : Date, format: "2020-01-22" "2020-01-23" ...
 $ cumdeaths     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ daydeaths     : int  NA 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL

 

342,405 records = 3261 combined_keys X 105 dates

In [11]:
nrow(mcoviddeathsus)
mykab(mcoviddeathsus[,.(N=.N),.(combined_key)][.N])
mykab(mcoviddeathsus[,.(N=.N),.(date)][.N])

blanks(2)
342405

|combined_key             |N  |
|:------------------------|:--|
|Ziebach, South Dakota, US|105|

|date      |N   |
|:---------|:---|
|2020-05-05|3261|
 

Check out the last two dates for Delaware -- each with five combined keys.

In [12]:
mykab(mcoviddeathsus[province_state=="Delaware" & date>=max(date)-1])

blanks(2)

|uid     |iso2|iso3|code3|fips |admin2    |province_state|country_region|lat     |long_    |combined_key            |population|date      |cumdeaths|daydeaths|
|:-------|:---|:---|:----|:----|:---------|:-------------|:-------------|:-------|:--------|:-----------------------|:---------|:---------|:--------|:--------|
|84010001|US  |USA |840  |10001|Kent      |Delaware      |US            |39.08647|-75.56885|Kent, Delaware, US      |180786    |2020-05-04|30       |0        |
|84010001|US  |USA |840  |10001|Kent      |Delaware      |US            |39.08647|-75.56885|Kent, Delaware, US      |180786    |2020-05-05|32       |2        |
|84010003|US  |USA |840  |10003|New Castle|Delaware      |US            |39.57993|-75.64055|New Castle, Delaware, US|558753    |2020-05-04|82       |3        |
|84010003|US  |USA |840  |10003|New Castle|Delaware      |US            |39.57993|-75.64055|New Castle, Delaware, US|558753    |2020-05-05|83       |1        |
|84080010|US  |USA |840  |80010|Out of DE |Delaware      |US            |0.00000 |0.00000  |Out of DE, Delaware, US |0         |2020-05-04|0        |0        |
|84080010|US  |USA |840  |80010|Out of DE |Delaware      |US            |0.00000 |0.00000  |Out of DE, Delaware, US |0         |2020-05-05|0        |0        |
|84010005|US  |USA |840  |10005|Sussex    |Delaware      |US            |38.66144|-75.39031|Sussex, Delaware, US    |234225    |2020-05-04|70       |2        |
|84010005|US  |USA |840  |10005|Sussex    |Delaware      |US            |38.66144|-75.39031|Sussex, Delaware, US    |234225    |2020-05-05|72       |2        |
|84090010|US  |USA |840  |90010|Unassigned|Delaware      |US            |0.00000 |0.00000  |Unassigned, Delaware, US|0         |2020-05-04|0        |0        |
|84090010|US  |USA |840  |90010|Unassigned|Delaware      |US            |0.00000 |0.00000  |Unassigned, Delaware, US|0         |2020-05-05|0        |0        |
 

Search for negative daily deaths, indications of updates.

In [13]:
mykab(mcoviddeathsus[daydeaths<0,.(.N,daydeaths=sum(daydeaths))])

mykab(head(mcoviddeathsus[daydeaths<0][order(daydeaths)],10))

blanks(2)

|N  |daydeaths|
|:--|:--------|
|525|-5442    |

|uid     |iso2|iso3|code3|fips |admin2    |province_state|country_region|lat|long_|combined_key                 |population|date      |cumdeaths|daydeaths|
|:-------|:---|:---|:----|:----|:---------|:-------------|:-------------|:--|:----|:----------------------------|:---------|:---------|:--------|:--------|
|84090036|US  |USA |840  |90036|Unassigned|New York      |US            |0  |0    |Unassigned, New York, US     |0         |2020-04-19|401      |-658     |
|84090025|US  |USA |840  |90025|Unassigned|Massachusetts |US            |0  |0    |Unassigned, Massachusetts, US|0         |2020-04-25|13       |-590     |
|84090036|US  |USA |840  |90036|Unassigned|New York      |US            |0  |0    |Unassigned, New York, US     |0         |2020-04-20|48       |-353     |
|84090034|US  |USA |840  |90034|Unassigned|New Jersey    |US            |0  |0    |Unassigned, New Jersey, US   |0         |2020-04-01|0        |-247     |
|84090034|US  |USA |840  |90034|Unassigned|New Jersey    |US            |0  |0    |Unassigned, New Jersey, US   |0         |2020-04-22|0        |-233     |
|84090036|US  |USA |840  |90036|Unassigned|New York      |US            |0  |0    |Unassigned, New York, US     |0         |2020-04-04|427      |-196     |
|84090036|US  |USA |840  |90036|Unassigned|New York      |US            |0  |0    |Unassigned, New York, US     |0         |2020-04-07|116      |-193     |
|84090025|US  |USA |840  |90025|Unassigned|Massachusetts |US            |0  |0    |Unassigned, Massachusetts, US|0         |2020-04-11|13       |-166     |
|84090051|US  |USA |840  |90051|Unassigned|Virginia      |US            |0  |0    |Unassigned, Virginia, US     |0         |2020-04-22|0        |-140     |
|84090025|US  |USA |840  |90025|Unassigned|Massachusetts |US            |0  |0    |Unassigned, Massachusetts, US|0         |2020-04-27|8        |-133     |
 

Rollup the cases data to state by date granularity the "data.table" way -- i.e. invoking a date summarizing function by state. Create a key for subsequent easy joining.

In [14]:
mkcases <- function(sd) sd[,.(cumcases=sum(cumcases)),.(date)][,`:=`(daycases=diffna(cumcases))]

covidcasesstdt <- mcovidcasesus[,mkcases(.SD),.(state=province_state)]

setkey(covidcasesstdt,state,date)

meta(covidcasesstdt)

blanks(2)

|name          |class                        |rows|columns|size     |
|:-------------|:----------------------------|:---|:------|:--------|
|covidcasesstdt|c("data.table", "data.frame")|6090|4      |124.95 KB|

Classes 'data.table' and 'data.frame':	6090 obs. of  4 variables:
 $ state   : Factor w/ 58 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date    : Date, format: "2020-01-22" "2020-01-23" ...
 $ cumcases: int  0 0 0 0 0 0 0 0 0 0 ...
 $ daycases: int  NA 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr  "state" "date"
NULL

 

There are a few "extra" states to be deleted later.

In [15]:
covidcasesstdt[,.N,.(state)] %>% nrow

prheadtail(covidcasesstdt)

blanks(2)
58
     state       date cumcases daycases
1: Alabama 2020-01-22        0       NA
2: Alabama 2020-01-23        0        0
3: Alabama 2020-01-24        0        0
4: Alabama 2020-01-25        0        0
5: Alabama 2020-01-26        0        0
6: Alabama 2020-01-27        0        0

     state       date cumcases daycases
1: Wyoming 2020-04-30      559       14
2: Wyoming 2020-05-01      566        7
3: Wyoming 2020-05-02      579       13
4: Wyoming 2020-05-03      586        7
5: Wyoming 2020-05-04      596       10
6: Wyoming 2020-05-05      604        8


 

Check dependable Massachusetts for calculations against the NYT's numbers. Yes.

In [16]:
mykab(covidcasesstdt[state=="Massachusetts"][.N])

blanks(2)

|state        |date      |cumcases|daycases|
|:------------|:---------|:-------|:-------|
|Massachusetts|2020-05-05|70271   |1184    |
 

An obvious case update here.

In [17]:
mykab(covidcasesstdt[state=="Massachusetts" & date %in% ymd(c("2020-04-19","2020-04-20","2020-04-21"))])

blanks(2)

|state        |date      |cumcases|daycases|
|:------------|:---------|:-------|:-------|
|Massachusetts|2020-04-19|38077   |1705    |
|Massachusetts|2020-04-20|38077   |0       |
|Massachusetts|2020-04-21|41199   |3122    |
 

Now a similar deaths rollup across state and date.

In [18]:
mkdeaths <- function(sd) sd[,.(cumdeaths=sum(cumdeaths)),.(date)][,`:=`(daydeaths=diffna(cumdeaths))]

coviddeathsstdt <- mcoviddeathsus[,mkdeaths(.SD),.(state=province_state)]

setkey(coviddeathsstdt,state,date)

meta(coviddeathsstdt)

blanks(2)

|name           |class                        |rows|columns|size     |
|:--------------|:----------------------------|:---|:------|:--------|
|coviddeathsstdt|c("data.table", "data.frame")|6090|4      |124.95 KB|

Classes 'data.table' and 'data.frame':	6090 obs. of  4 variables:
 $ state    : Factor w/ 58 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date     : Date, format: "2020-01-22" "2020-01-23" ...
 $ cumdeaths: int  0 0 0 0 0 0 0 0 0 0 ...
 $ daydeaths: int  NA 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr  "state" "date"
NULL

 

Massachusetts delivers again.

In [19]:
mykab(coviddeathsstdt[state=="Massachusetts"][.N])

blanks(2)

|state        |date      |cumdeaths|daydeaths|
|:------------|:---------|:--------|:--------|
|Massachusetts|2020-05-05|4212     |122      |
 

Finally, join cases and deaths the data.table way.

In [20]:
finalcovidstdt <- covidcasesstdt[coviddeathsstdt]

meta(finalcovidstdt)

blanks(2)

|name          |class                        |rows|columns|size     |
|:-------------|:----------------------------|:---|:------|:--------|
|finalcovidstdt|c("data.table", "data.frame")|6090|6      |172.78 KB|

Classes 'data.table' and 'data.frame':	6090 obs. of  6 variables:
 $ state    : Factor w/ 58 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date     : Date, format: "2020-01-22" "2020-01-23" ...
 $ cumcases : int  0 0 0 0 0 0 0 0 0 0 ...
 $ daycases : int  NA 0 0 0 0 0 0 0 0 0 ...
 $ cumdeaths: int  0 0 0 0 0 0 0 0 0 0 ...
 $ daydeaths: int  NA 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "sorted")= chr  "state" "date"
 - attr(*, ".internal.selfref")=<externalptr> 
NULL

 

Totals -- match early in the day.

In [21]:
mykab(finalcovidstdt[,.(cumcases=sumna(daycases),cumdeaths=sumna(daydeaths))])

blanks(2)

|cumcases|cumdeaths|
|:-------|:--------|
|1204350 |71064    |
 

Reliable Massachusetts.

In [22]:
state <- "Massachusetts"

mykab(finalcovidstdt[state][.N])


blanks(2)

|state        |date      |cumcases|daycases|cumdeaths|daydeaths|
|:------------|:---------|:-------|:-------|:--------|:--------|
|Massachusetts|2020-05-05|70271   |1184    |4212     |122      |
 

The outlist details "non-state" states that add to the U.S. totals.

In [23]:
outlist <- c("Guam","Grand Princess","Virgin Islands","Diamond Princess","Northern Mariana Islands","American Samoa","Puerto Rico")

blanks(2)
 

Latest numbers for each state and DC.

In [24]:
mykab(finalcovidstdt[!is.element(state,outlist),.SD[.N],.(state)][order(-cumcases)])

blanks(2)

|state               |date      |cumcases|daycases|cumdeaths|daydeaths|
|:-------------------|:---------|:-------|:-------|:--------|:--------|
|New York            |2020-05-05|321192  |2239    |25124    |125      |
|New Jersey          |2020-05-05|130593  |2324    |8244     |334      |
|Massachusetts       |2020-05-05|70271   |1184    |4212     |122      |
|Illinois            |2020-05-05|65889   |2112    |2834     |175      |
|California          |2020-05-05|58456   |2572    |2381     |103      |
|Pennsylvania        |2020-05-05|53864   |1048    |3179     |341      |
|Michigan            |2020-05-05|44451   |461     |4183     |44       |
|Florida             |2020-05-05|37439   |542     |1471     |72       |
|Texas               |2020-05-05|33912   |1129    |925      |24       |
|Connecticut         |2020-05-05|30621   |648     |2633     |77       |
|Georgia             |2020-05-05|29998   |547     |1299     |53       |
|Louisiana           |2020-05-05|29996   |323     |2115     |51       |
|Maryland            |2020-05-05|27117   |709     |1390     |73       |
|Indiana             |2020-05-05|21033   |526     |1211     |60       |
|Ohio                |2020-05-05|20969   |495     |1135     |78       |
|Virginia            |2020-05-05|20257   |764     |713      |29       |
|Colorado            |2020-05-05|17367   |449     |903      |52       |
|Washington          |2020-05-05|15594   |132     |862      |21       |
|Tennessee           |2020-05-05|13690   |119     |226      |7        |
|North Carolina      |2020-05-05|12510   |539     |472      |30       |
|Iowa                |2020-05-05|10111   |408     |207      |19       |
|Rhode Island        |2020-05-05|9933    |281     |355      |14       |
|Arizona             |2020-05-05|9305    |381     |395      |33       |
|Missouri            |2020-05-05|9121    |175     |409      |23       |
|Wisconsin           |2020-05-05|8566    |330     |353      |13       |
|Alabama             |2020-05-05|8437    |325     |315      |17       |
|Mississippi         |2020-05-05|8207    |330     |342      |32       |
|Minnesota           |2020-05-05|7851    |617     |455      |28       |
|South Carolina      |2020-05-05|6841    |84      |296      |13       |
|Nebraska            |2020-05-05|6373    |336     |78       |0        |
|Kentucky            |2020-05-05|5822    |577     |275      |14       |
|Nevada              |2020-05-05|5660    |30      |270      |5        |
|Kansas              |2020-05-05|5648    |265     |161      |5        |
|Utah                |2020-05-05|5449    |132     |56       |6        |
|Delaware            |2020-05-05|5371    |83      |187      |5        |
|District of Columbia|2020-05-05|5322    |152     |264      |6        |
|New Mexico          |2020-05-05|4138    |107     |162      |6        |
|Oklahoma            |2020-05-05|4127    |83      |247      |9        |
|Arkansas            |2020-05-05|3525    |34      |83       |3        |
|Oregon              |2020-05-05|2839    |80      |113      |4        |
|South Dakota        |2020-05-05|2721    |53      |24       |3        |
|New Hampshire       |2020-05-05|2636    |48      |92       |6        |
|Idaho               |2020-05-05|2127    |21      |65       |1        |
|North Dakota        |2020-05-05|1266    |41      |25       |0        |
|West Virginia       |2020-05-05|1246    |22      |48       |0        |
|Maine               |2020-05-05|1226    |21      |61       |4        |
|Vermont             |2020-05-05|907     |5       |52       |0        |
|Hawaii              |2020-05-05|625     |4       |17       |0        |
|Wyoming             |2020-05-05|604     |8       |7        |0        |
|Montana             |2020-05-05|456     |-1      |16       |0        |
|Alaska              |2020-05-05|371     |1       |9        |0        |
 

Save the final data.table to an fst file.

In [25]:
fname <- 'finalcovidstdt.fst'
write_fst(finalcovidstdt,fname)

blanks(2)
 

Show how to build hyper normalized "work" data.tables for use with ggplot graphics, introducing moving averages. I will use these techniques widely in the next blog.

In [26]:
lagdays <- 14

stacked <- finalcovidstdt[,.(lag=lagdays,daycases=sumna(daycases),daydeaths=sumna(daydeaths)),.(date)][
        ,`:=`(cumcases=cumsum(daycases),cumdeaths=cumsum(daydeaths))][
        ,`:=`(mavdaycases=round(frollmean(daycases,lagdays)),mavdaydeaths=round(frollmean(daydeaths,lagdays)))] 

meta(stacked)

mstacked <- melt(stacked,id.vars=c(2,1),measure.vars=c(3,4,7,8),variable.name="type",value.name="cases")
meta(mstacked)

blanks(2)

|name   |class                        |rows|columns|size   |
|:------|:----------------------------|:---|:------|:------|
|stacked|c("data.table", "data.frame")|105 |8      |6.98 KB|

Classes 'data.table' and 'data.frame':	105 obs. of  8 variables:
 $ date        : Date, format: "2020-01-22" "2020-01-23" ...
 $ lag         : num  14 14 14 14 14 14 14 14 14 14 ...
 $ daycases    : int  0 0 1 0 3 0 0 0 0 2 ...
 $ daydeaths   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cumcases    : int  0 0 1 1 4 4 4 4 4 6 ...
 $ cumdeaths   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mavdaycases : num  NA NA NA NA NA NA NA NA NA NA ...
 $ mavdaydeaths: num  NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL


|name    |class                        |rows|columns|size    |
|:-------|:----------------------------|:---|:------|:-------|
|mstacked|c("data.table", "data.frame")|420 |4      |13.62 KB|

Classes 'data.table' and 'data.frame':	420 obs. of  4 variables:
 $ lag  : num  14 14 14 14 14 14 14 14 14 14 ...
 $ date : Date, format: "2020-01-22" "2020-01-23" ...
 $ type : Factor w/ 4 levels "daycases","daydeaths",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cases: num  0 0 1 0 3 0 0 0 0 2 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL

 

Take a look.

In [27]:
prheadtail(stacked)
blanks(1)
prheadtail(mstacked)

blanks(2)
         date lag daycases daydeaths cumcases cumdeaths mavdaycases
1: 2020-01-22  14        0         0        0         0          NA
2: 2020-01-23  14        0         0        0         0          NA
3: 2020-01-24  14        1         0        1         0          NA
4: 2020-01-25  14        0         0        1         0          NA
5: 2020-01-26  14        3         0        4         0          NA
6: 2020-01-27  14        0         0        4         0          NA
   mavdaydeaths
1:           NA
2:           NA
3:           NA
4:           NA
5:           NA
6:           NA

         date lag daycases daydeaths cumcases cumdeaths mavdaycases
1: 2020-04-30  14    29515      2029  1069423     62996       28702
2: 2020-05-01  14    34037      1947  1103460     64943       28840
3: 2020-05-02  14    29078      1426  1132538     66369       28899
4: 2020-05-03  14    25501      1313  1158039     67682       28833
5: 2020-05-04  14    22335      1240  1180374     68922       28621
6: 2020-05-05  14    23976      2142  1204350     71064       28381
   mavdaydeaths
1:         2012
2:         1967
3:         1901
4:         1910
5:         1876
6:         1856


   lag       date     type cases
1:  14 2020-01-22 daycases     0
2:  14 2020-01-23 daycases     0
3:  14 2020-01-24 daycases     1
4:  14 2020-01-25 daycases     0
5:  14 2020-01-26 daycases     3
6:  14 2020-01-27 daycases     0

   lag       date         type cases
1:  14 2020-04-30 mavdaydeaths  2012
2:  14 2020-05-01 mavdaydeaths  1967
3:  14 2020-05-02 mavdaydeaths  1901
4:  14 2020-05-03 mavdaydeaths  1910
5:  14 2020-05-04 mavdaydeaths  1876
6:  14 2020-05-05 mavdaydeaths  1856


 

Summaries.

In [28]:
mykab(stacked[,.(cumcases=cumcases[.N],cumdeaths=cumdeaths[.N],cumcases=sumna(daycases),cumdeaths=sumna(daydeaths))])
blanks(1)
mykab(mstacked[type=='daycases',.(cumcases=sumna(cases))])

blanks(2)

|cumcases|cumdeaths|cumcases|cumdeaths|
|:-------|:--------|:-------|:--------|
|1204350 |71064    |1204350 |71064    |


|cumcases|
|:-------|
|1204350 |
 

Look at cases.

In [29]:
prheadtail(mstacked[type=='daycases',.(lag,date,type,cases,cumcases=cumsum(cases))])

blanks(2)
   lag       date     type cases cumcases
1:  14 2020-01-22 daycases     0        0
2:  14 2020-01-23 daycases     0        0
3:  14 2020-01-24 daycases     1        1
4:  14 2020-01-25 daycases     0        1
5:  14 2020-01-26 daycases     3        4
6:  14 2020-01-27 daycases     0        4

   lag       date     type cases cumcases
1:  14 2020-04-30 daycases 29515  1069423
2:  14 2020-05-01 daycases 34037  1103460
3:  14 2020-05-02 daycases 29078  1132538
4:  14 2020-05-03 daycases 25501  1158039
5:  14 2020-05-04 daycases 22335  1180374
6:  14 2020-05-05 daycases 23976  1204350


 

That's it for now. Next week, exploration with graphics.

In [ ]: