Using functions to be more efficient

In yesterday’s post I focused on the task of using R to “drive” MS Excel. I deliberately ended the post with a fully functioning (pun intended) but very ugly set of code. Why “ugly”? Well, because the last set of code wound up repeating 4 lines of code 12 times!

### Manual and painful way
## Create a new workbook
wb <- createWorkbook()
# education by each of the other 4 variables
NameofSheet <- "CoverageByEducation"
TheData <- table(OfInterest$EDUCATION,OfInterest$NOTCOV)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
NameofSheet <- "MedbillByEducation"
TheData <- table(OfInterest$EDUCATION,OfInterest$MEDBILL)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
### many repeating lines removed ###
NameofSheet <- "PNMED12MByAge"
TheData <- table(OfInterest$AGE,OfInterest$PNMED12M)
addWorksheet(wb = wb, sheetName = NameofSheet)
writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
#
saveWorkbook(wb, "BetterExcelExample.xlsx", overwrite = TRUE) ## save to working directory

Now, to be honest, in a nice modern IDE environment like RStudio, cutting, pasting and making small edits 12 times is not all that difficult. I did do exactly that my first pass through. But as I looked at it I also knew I could do much better if I used automation. The code would be easier to read or modify in the future and I’d be much less likely to make a mistake.

Background and catchup

My colleague wanted to be able to do some simple analysis around health care using the Centers for Disease Control and Prevention (https://www.cdc.gov), National Center for Health Statistics (https://www.cdc.gov/nchs/index.htm), National Health Interview Survey (https://www.cdc.gov/nchs/nhis/nhis_2016_data_release.htm). They wanted a series of cross tabulated sets of summary data for variable pairings (for example whether or not the respondent had a formal health care provider by region of the country). They wanted one Excel “workbook” with 12 worksheets each one of which was the summary of counts for a pair of variables. From there they could use Excel’s native plotting tools to make the graphs they needed.

You can review everything that happened yesterday (which I recommend) or you can pick up here. To join us in progress make sure you load the right libraries and grab the dataset we wound up on which is called OfInterest.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
theme_set(theme_bw()) # set theme to my personal preference
# install.packages("openxlsx")
require(openxlsx)

## Loading required package: openxlsx

OfInterest <- read.csv("ofinterest.csv")
### OfInterest <- read.csv("https://raw.githubusercontent.com/ibecav/ibecav.github.io/master/Rmdfiles/ofinterest.csv") available through Github about 9Mb
str(OfInterest)

## 'data.frame':    103789 obs. of  9 variables:
##  $ AGE      : Factor w/ 3 levels "19 to 60","Less than 18",..: 1 1 1 2 2 1 1 1 3 1 ...
##  $ REGION   : Factor w/ 4 levels "Midwest","Northeast",..: 3 4 4 4 4 4 3 3 4 3 ...
##  $ SEX      : Factor w/ 2 levels "Female","Male": 2 1 2 2 1 1 1 2 2 2 ...
##  $ EDUCATION: Factor w/ 3 levels "Bachelor's degree or higher",..: 3 2 1 2 NA 1 2 2 1 3 ...
##  $ EARNINGS : Factor w/ 3 levels "$01-$34,999",..: 1 2 2 NA NA 3 1 NA 2 3 ...
##  $ PDMED12M : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ PNMED12M : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NOTCOV   : Factor w/ 2 levels "Covered","Not covered": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MEDBILL  : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 1 ...

Breaking the task down

So as my tagline indicates I don’t consider myself a “programmer”. I love analyzing data and I love R but I approach programming slowly and cautiously. What I’m about to explain as my method will likely seem quaint and even antiquated to some but has the advantage of being very methodical and very practical. There are lots of places on the web to read about this stuff, I’m simply making the case I hope mine is slow and enough and methodical enough for a beginner.

What do we have there in those 12 iterations? Well what we have are 3 variables EDUCATION, EARNINGS, and AGE that we want to “cross” with 4 other variables NOTCOV, MEDBILL, PDMED12M, and PNMED12M to give us the total of 12. In my discipline we would call the first three the independent variables and the second four the dependent variables.

So as a first step in automating our work lets just make two lists that acknowledge that fact. We’ll even name the list elements so we can make use of those names in future steps. That means we can use the short hand depvars$Coverage instead of depvars[[1]] which I find difficult to keep track of. Notice that the list contains the actual data and is not just a pointer at the dataframe.

depvars <- list(Coverage = OfInterest$NOTCOV, ProbPay = OfInterest$MEDBILL, CareDelay = OfInterest$PDMED12M, NeedNotGet = OfInterest$PNMED12M)
indvars <- list(Education = OfInterest$EDUCATION, Earnings = OfInterest$EARNINGS, Age = OfInterest$AGE)
# these two are identical use head because the list is more than 100,000 entries long
head(depvars$Coverage)

## [1] Covered Covered Covered Covered Covered Covered
## Levels: Covered Not covered

head(depvars[[1]])

## [1] Covered Covered Covered Covered Covered Covered
## Levels: Covered Not covered

Okay, so far so good! Now what? Well R has a wonderful function called lapply which as the documentation ?lapply and numerous websites will tell you applies a function to a list. It sequentially walks its way through a list and applies the function you tell it to use.

Let’s take a small step. We want tables. 12 of them. Each table is of the form table(indvars,depvars) like “education” by “coverage”. Let’s try lapply for just part of that process. The command becomes lapply(depvars, function (x) table(OfInterest$EDUCATION,x)) which you can read as “Take the list of dependent variables depvars. Apply the function called table and wherever you see an x substitute the current value of depvars”. So the very first thing it would do is table(OfInterest$EDUCATION,OfInterest$NOTCOV)) and it would do it for all four variables in the list.

lapply(depvars, function (x) table(OfInterest$EDUCATION,x))

## $Coverage
##                                   x
##                                    Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313
## 
## $ProbPay
##                                   x
##                                       No   Yes
##   Bachelor's degree or higher      19994  1789
##   High School Grad or less         41375  9217
##   Some college or Associate degree 19231  3899
## 
## $CareDelay
##                                   x
##                                       No   Yes
##   Bachelor's degree or higher      20583  1219
##   High School Grad or less         47576  3155
##   Some college or Associate degree 21001  2185
## 
## $NeedNotGet
##                                   x
##                                       No   Yes
##   Bachelor's degree or higher      21108   694
##   High School Grad or less         48191  2532
##   Some college or Associate degree 21545  1645

Perfect! Just what we were looking for. We get back a list of 4 tables. Progress! No surprise it works the other way around as well. We can hold the second part of the table command constant and just vary the independent variable via indvars!

lapply(indvars, function (y) table(y,OfInterest$NOTCOV))

## $Education
##                                   
## y                                  Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313
## 
## $Earnings
##                   
## y                  Covered Not covered
##   $01-$34,999        16148        4121
##   $35,000-$74,999    12116         783
##   $75,000 and over    6253         158
## 
## $Age
##               
## y              Covered Not covered
##   19 to 60       47258        8623
##   Less than 18   24193        1312
##   More than 60   20730         571

What we need, of course, is both of those things. A “nested” set of calls to lapply to walk through both lists and give us 12 tables not 3 or 4. So the next command is a bit ugly to read but hopefully if you have been following along it will make perfectly good sense. We’re going to call lapply and the function we will tell it to run is lapply! The second lapply will in turn call table and all we have to do is to keep our x’s and y’s correct!

lapply(depvars, function (x) lapply(indvars, function (y) table(y,x)))

## $Coverage
## $Coverage$Education
##                                   x
## y                                  Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313
## 
## $Coverage$Earnings
##                   x
## y                  Covered Not covered
##   $01-$34,999        16148        4121
##   $35,000-$74,999    12116         783
##   $75,000 and over    6253         158
## 
## $Coverage$Age
##               x
## y              Covered Not covered
##   19 to 60       47258        8623
##   Less than 18   24193        1312
##   More than 60   20730         571
## 
## 
## $ProbPay
## $ProbPay$Education
##                                   x
## y                                     No   Yes
##   Bachelor's degree or higher      19994  1789
##   High School Grad or less         41375  9217
##   Some college or Associate degree 19231  3899
## 
## $ProbPay$Earnings
##                   x
## y                     No   Yes
##   $01-$34,999      16229  4139
##   $35,000-$74,999  11321  1596
##   $75,000 and over  6069   341
## 
## $ProbPay$Age
##               x
## y                 No   Yes
##   19 to 60     46924  9504
##   Less than 18 20841  4747
##   More than 60 19331  2031
## 
## 
## $CareDelay
## $CareDelay$Education
##                                   x
## y                                     No   Yes
##   Bachelor's degree or higher      20583  1219
##   High School Grad or less         47576  3155
##   Some college or Associate degree 21001  2185
## 
## $CareDelay$Earnings
##                   x
## y                     No   Yes
##   $01-$34,999      17937  2447
##   $35,000-$74,999  12049   877
##   $75,000 and over  6239   174
## 
## $CareDelay$Age
##               x
## y                 No   Yes
##   19 to 60     51621  5020
##   Less than 18 25058   612
##   More than 60 20307  1112
## 
## 
## $NeedNotGet
## $NeedNotGet$Education
##                                   x
## y                                     No   Yes
##   Bachelor's degree or higher      21108   694
##   High School Grad or less         48191  2532
##   Some college or Associate degree 21545  1645
## 
## $NeedNotGet$Earnings
##                   x
## y                     No   Yes
##   $01-$34,999      18481  1907
##   $35,000-$74,999  12403   522
##   $75,000 and over  6327    86
## 
## $NeedNotGet$Age
##               x
## y                 No   Yes
##   19 to 60     52841  3794
##   Less than 18 25262   402
##   More than 60 20638   782

Please note that it took me quite some time to get that nested lapply correct! A lot of searching on Stack Overflow, a lot of trial and error, but I won’t forget it now! If you want to test yourself try on your own to make the change necessary to invert the output to have the columns and rows the other way around.

Now that we have demonstrated we can do it, lets put our 12 tables someplace safe. Let’s call it TablesList and to ensure we know how to get these tables back out again let’s pull just one of them from our list of 12, this is where using names not numbers helps.

TablesList <- lapply(depvars, function (x) lapply(indvars, function (y) table(y,x)))
TablesList$Coverage$Education

##                                   x
## y                                  Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313

TablesList$Coverage

## $Education
##                                   x
## y                                  Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313
## 
## $Earnings
##                   x
## y                  Covered Not covered
##   $01-$34,999        16148        4121
##   $35,000-$74,999    12116         783
##   $75,000 and over    6253         158
## 
## $Age
##               x
## y              Covered Not covered
##   19 to 60       47258        8623
##   Less than 18   24193        1312
##   More than 60   20730         571

Okay we made a list of tables now let’s work on putting those tables where we want them in a series of sheets in a workbook.

For a little more fun

Pardon the pun (for those who got it) but yes we’re now going to use for loops to pull the tables out of the list and put them somewhere in an organized fashion. So the name of our list is TablesList and inside that list are sublists with names like TablesList$Coverage and the individual tables have names like TablesList$Coverage$Education.

So why don’t we walk down TablesList$Coverage and extract the 3 tables in there one by one? Many ways to do it but I’ll use a for loop with for (i in seq_along(TablesList$Coverage)) {print(TablesList$Coverage[[i]])} which says sequence along the list TablesList$Coverage using i as a placeholder for where we are in the list. Then for each item in the list i print the table.

for (i in seq_along(TablesList$Coverage)) {print(TablesList$Coverage[[i]])}

##                                   x
## y                                  Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313
##                   x
## y                  Covered Not covered
##   $01-$34,999        16148        4121
##   $35,000-$74,999    12116         783
##   $75,000 and over    6253         158
##               x
## y              Covered Not covered
##   19 to 60       47258        8623
##   Less than 18   24193        1312
##   More than 60   20730         571

Good. Once again though we want all 12 not just three so we need to nest again. So let’s walk down just EDUCATION first…

for (j in seq_along(TablesList)) {print(TablesList[[j]][[1]])}

##                                   x
## y                                  Covered Not covered
##   Bachelor's degree or higher        20802         895
##   High School Grad or less           43404        6846
##   Some college or Associate degree   20667        2313
##                                   x
## y                                     No   Yes
##   Bachelor's degree or higher      19994  1789
##   High School Grad or less         41375  9217
##   Some college or Associate degree 19231  3899
##                                   x
## y                                     No   Yes
##   Bachelor's degree or higher      20583  1219
##   High School Grad or less         47576  3155
##   Some college or Associate degree 21001  2185
##                                   x
## y                                     No   Yes
##   Bachelor's degree or higher      21108   694
##   High School Grad or less         48191  2532
##   Some college or Associate degree 21545  1645

Driving Excel through automation

Okay now we’re ready to make magic. Scroll back to the top of this post or review the last post and you’ll see the logic is:

Create a new empty workbook object wb <- createWorkbook() once
Invent a name for the tab or worksheet inside the workbook NameofSheet 12 times
Make a table for a pair of variables like TheData 12 times
Add a worksheet (tab) into the workbook addWorksheet 12 times
Write the table we made onto the worksheet with writeData 12 times
Save the workbook with the 12 sheets in it once

So steps 1 & 6 occur once and steps 2-5 need to occur in our loop structure

wb <- createWorkbook()
for loop
1. get table TheData
2. create NameofSheet
3. make an empty worksheet named NameofSheet with addWorksheet
4. writeData called TheData into NameofSheet
saveWorkbook or save it

## Create a new empty workbook
wb <- createWorkbook()
## nested for loop
for (j in seq_along(TablesList)) { #top list with depvars
  for (i in seq_along(TablesList[[j]])) { #for each depvar walk the indvars
    TheData <- TablesList[[j]][[i]]
    NameofSheet <- paste0(names(TablesList[j]), "By", names(TablesList[[j]][i]))
    addWorksheet(wb = wb, sheetName = NameofSheet)
    writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")
    }
  }
## Save our new workbook
saveWorkbook(wb, "newversion.xlsx", overwrite = TRUE) ## save to working directory

Breaking it down:

These two lines create the nested for loops to walk the 12 tables. j is the outer loop of dependent variables and i is the inner loop of independent variables.

for (j in seq_along(TablesList)) { #top list with depvars
  for (i in seq_along(TablesList[[j]])) { #for each depvar walk the indvars

NameofSheet is nice because since our list items are “named” we can put those names together with the word “By” to make the name of the worksheet sensible to a human.

    TheData <- TablesList[[j]][[i]]
    NameofSheet <- paste0(names(TablesList[j]), "By", names(TablesList[[j]][i]))
    addWorksheet(wb = wb, sheetName = NameofSheet)
    writeData(wb = wb, sheet = NameofSheet, x = TheData, borders = "n")

All done (not yet!)

Hopefully this post helps you understand how to put automation in the guise of lapply and for for you. In my next post I’ll build on this scaffolding to discuss how to make these very same graphs in ggplot2 (which IMHO runs circles around Excel for scientific plotting), as well as making this all more efficient through the use of our own functions to take care of some of the more repetitive chores.

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Using functions to be more efficient – March 28, 2018