I Be Cav Home

R Lover but !a Programmer

Chuck Powell

Writing better R functions part four – April 17, 2018

Tagged as: [ R  ggplot2  functions  dplyr  for-loop  ]

In my last four posts I have been working at automating a process, that I am likely to repeat many times, by turning it into a proper R function. In my last post I overcame some real performance problems, combined two sub-functions into one and generally had a workable piece of code. In the final post in this series today I’ll accomplish two more important tasks. I’ll once again refactor the code to streamline it, and I’ll give the user a lot more flexibility on how they input their request.

I’m going to output all the plots in a smaller size for the benefit of you the readers. I’m doing that via RMarkdown and it won’t happen automatically for you if you download and use the code. I’ll initially be using, fig.width=4.5, fig.height=2 and later slighly larger than that

Background and catch-up

We originally started with a simple task – using dplyr and ggplot2 in the console. Take two of the mtcars variables, in this case am and cyl, and conduct a cross tabulation and then plot it. Since it’s the sort of thing I’m likely to do often seemed like a good candidate to write a function for. Then I decided that I might very well want the function to allow me to input more than two variables at a time. It would be very tedious to execute the command 25 times if I had 5 dependent variables and 5 independent variables and needed to fully cross them.

So the code grew and you can see below I address those two use cases in the code. I have very deliberately left comments in the code so that you trace the process. Roughly speaking if you follow the code here’s what’s happening.

  1. Error checking. Do we have the right packages? The right number of arguments passed in? Does the first object even exist and is it a data frame?
  2. Which type of plot does the user want?
  3. Some if logic that says in effect if both the parameters passed in as bare names are columns (variables) in the dataframe, make the plot and leave the function or return.
  4. If the user has given us two column numbers, or vectors e.g. c(8:15) of column numbers as input then run them through some for-loops to completely cross them and print the plots.
  5. If you make it that far then exit the function and return the list of arguments we were passed in to begin with.
PlotMeX <- function(dataframe, xwhich, ywhich, plottype = "side"){
  # error checking
  if (!require(ggplot2)) {
    stop("Can't continue can't load ggplot2")
  }
  theme_set(theme_bw())
  if (!require(dplyr)) {
    stop("Can't continue can't load dplyr")
  }
  if (length(match.call()) <= 3) {
    stop("Not enough arguments passed... requires a dataframe, plus at least two variables")
  }
  argList <-  as.list(match.call()[-1])
  if (!exists(deparse(substitute(dataframe)))) {
    stop("The first object in your list does not exist. It should be a dataframe")
  }
  if (!is(dataframe, "data.frame")) {
    stop("The first name you passed does not appear to be a data frame")
  }
  switch(plottype,
         side =  list(geom_bar(position="dodge", stat="identity"),
                      ylab("Count")) -> whichbar,
         stack = list(geom_bar(stat="identity"),
                      ylab("Count")) -> whichbar,
         percent = list(geom_bar(stat="identity", position="fill"),
                        ylab("Percent")) -> whichbar
  )
# If both variables are found in the dataframe immediately print the plot
  if (deparse(substitute(xwhich)) %in% names(dataframe) & deparse(substitute(ywhich)) %in% names(dataframe)) {
    aaa <- enquo(xwhich)
    bbb <- enquo(ywhich)
    xname <- deparse(substitute(xwhich))
    yname <- deparse(substitute(ywhich))
    dfname <- deparse(substitute(dataframe))
    dataframe %>%
      filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
      mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
      group_by(!! aaa,!! bbb) %>%
      count() -> tempdf
    tempdf %>%
      ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      whichbar +
      ggtitle(bquote("Crosstabs dataset: "*.(dfname)*" variables "*.(xname)~"by "*.(yname))) -> p
    return(p)
  }
# If the user has given us integers indicating the column numbers rather than bare variable names
  # we need to build a list of what is to be plotted and then do the plotting
  # Build two lists
  indvars<-list() # create empty list to add to
  depvars<-list() # create empty list to add to
  totalcombos <- 1 # keep track of where we are
  message("Creating the variable pairings...")
  for (j in seq_along(xwhich)) {
    for (k in seq_along(ywhich)) {
      depvarsbare <- as.name(colnames(dataframe[xwhich[[j]]]))
      indvarsbare <- as.name(colnames(dataframe[ywhich[[k]]]))
      cat("Pairing #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])),
          " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "")
      aaa <- enquo(depvarsbare)
      bbb <- enquo(indvarsbare)
      xname <- deparse(substitute(depvarsbare))
      yname <- deparse(substitute(indvarsbare))
      dfname <- deparse(substitute(dataframe))
      dataframe %>%
        filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
        mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
        group_by(!! aaa,!! bbb) %>%
        count() -> tempdf
      tempdf %>%
        ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
        whichbar +
        ggtitle(bquote("Crosstabs dataset: "*.(dfname)*" variables "*.(xname)~"by "*.(yname))) -> p
      print(p)
      totalcombos <- totalcombos +1
    }
  }
  return(argList)
}
PlotMeX(mtcars, am, cyl)

# exact same thing using column numbers rather than names
PlotMeX(mtcars, 9, 2)
## Creating the variable pairings...

## Pairing #1 am with cyl

## $dataframe
## mtcars
## 
## $xwhich
## [1] 9
## 
## $ywhich
## [1] 2
# a more complex example and using a different plot type
PlotMeX(mtcars, c(9, 2), c(8,10), "percent")
## Creating the variable pairings...

## Pairing #1 am with vs

## Pairing #2 am with gear

## Pairing #3 cyl with vs

## Pairing #4 cyl with gear

## $dataframe
## mtcars
## 
## $xwhich
## c(9, 2)
## 
## $ywhich
## c(8, 10)
## 
## $plottype
## [1] "percent"

If you’d like a detailed review of the sub components it’s covered in earlier posts. In particular, if you’re not familiar with the “tricks” of working with dplyr and ggplot2 inside of functions you may want to look at this post if you’re not familiar with enquo or the !! notation.

As you can see from the output, it works fine as far as we can tell on the mtcars data set for the two cases we have identified so far. It’s not pretty and there is clearly some repetition but hey it is good enough for my needs at this point.

Except it isn’t…

Complexity and simplicity and user choices

Turns out I needed more. In my line of work it’s not uncommon to have a small number of dependent variables and a much larger number of independent variables that you want to examine. Although it’s a very contrived example let’s keep using the mtcars dataset (I promise I’ll use something more fun in a bit). Let’s say that I’m interested in plotting the relationship between cyl (the number of cylinders) and all of vs, am, gear & carb. That will be a total of four plots. My current function would allow me to call PlotMeX(mtcars, 2, c(8:11), "percent") which is all well and good. But, I know my working preferences and I know at some point I will think PlotMeX(mtcars, cyl, c(8:11), "percent"). So lets see what happens.

PlotMeX(mtcars, cyl, c(8:11), "percent")
# Quitting from lines 32-116 (betterfunctions4git.Rmd) 
# Error in PlotMeX(mtcars, cyl, c(8:11), "percent") : 
#  object 'cyl' not found
# Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> PlotMeX
# Execution halted

Ugh! Not only does it fail but it fails ugly. What the heck does it mean object cyl not found? Everyone knows that cyl is a column in mtcars!

One approach is simply make it clear to the user that she can choose either one way of calling the function (with bare variables) or the other by using integers to proxy for column numbers. But they can’t mix and match like that. That’s a very classic response to this issue. But I actually see it as an opportunity to make the function more flexible and useful.

That means we have four major possibilities not two. So we’re going to have to figure out which of those possible paths we’re on. That’s what I set out to do.

Refactoring again

It didn’t take me long to realize that as I moved from two possibilities to four possibilities if I wasn’t careful I’d wind up repeating myself (by cutting and pasting the same code) over and over again. In particular the central piece of code in the function, the one that does the real work is likely to be the same not just in two cases or four cases but literally in almost any case. That is the very snippet of code we developed several posts ago.

    dataframe %>%
      filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
      mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
      group_by(!! aaa,!! bbb) %>%
      count() -> tempdf
    tempdf %>%
      ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      whichbar +
      ggtitle(bquote("Crosstabs dataset: "*.(dfname)*" variables "*.(xname)~"by "*.(yname))) -> p

It takes our dataframe (dataframe) and two variables (aaa and bbb) and does magic to produce the plot (p). tempdf is just a temporary dataframe (well tibble actually) to hold our cross-tabulation. dfname, xname, and yname are really just character versions of the names that we used to make nice labels.

Now, I already told you as I rewrote the code on my first pass I simply did just cut and paste this chunk a few times. It’s not hard at all in RStudio. Could have stayed that way. But here’s the real rub. A week from now when someone asks me to change the Plot title or make some other change I’ll have to do the same thing at least four times. That’s annoying. And prone to error.

In my mind I always new the smart thing to do was to make this a function within the main function and just call it when needed. And that what I did. Here’s what that looks like.

PlotMagic <-  function(dataframe,aaa,bbb,whichbar,dfname,xname,yname) {
  dataframe %>%
    filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
    mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
    group_by(!! aaa,!! bbb) %>%
    count() -> tempdf
  tempdf %>%
    ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
      whichbar +
      ggtitle(bquote("Crosstabs dataset = "*.(dfname)*" and variables = "*.(xname)~"by "*.(yname))) -> p
  print(p)
  }

Now no matter what path I’m on I know what information I have to gather and just pass it along to do the work.

So before I do the big reveal and lay out all the code let’s address some other issues as well.

Back in this post we introduced a switch to allow the user to choose a plot type.

  switch(plottype,
       side =  list(geom_bar(position="dodge", stat="identity"),
                    ylab("Count")) -> whichbar,
       stack = list(geom_bar(stat="identity"),
                    ylab("Count")) -> whichbar,
       percent = list(geom_bar(stat="identity", position="fill"),
                    ylab("Percent")) -> whichbar
    )

It works reliably, but there is a strong possibility that along the way users will make a typo or some other mistake. So let’s add a fail-safe default to the switch. We do that by adding a final unnamed entry, it could be anything, but in our case we’ll set it so that if all else fails the user will get a side by side plot.

  switch(plottype,
         side =  list(geom_bar(position="dodge", stat="identity"),
                      ylab("Count")) -> whichbar,
         stack = list(geom_bar(stat="identity"),
                      ylab("Count")) -> whichbar,
         percent = list(geom_bar(stat="identity", position="fill"),
                        ylab("Percent")) -> whichbar,
         list(geom_bar(position="dodge", stat="identity"),
              ylab("Count")) -> whichbar
  )

And as long as we are rewriting we may as well be defensive in our list building. As I mentioned last time for performance reasons it is a best practice to specify list lengths in advance rather than constantly copying the list and appending a new entry. We have several entries scattered around that look like indvars<-list() # create empty list to add to that we’re going to change to the form indvars <- vector("list", length = length(ywhich)). While it’s unlikely I’ll personally ever feed this function a very long vector better safe than sorry.

The logical tree

On to the main event. We need some logic that generally follows this pattern.

Take the user’s input and parse it into one of four known possibilities and then take appropriate action. The possibilities are:

  1. If both are bare variables and found in the dataframe immediately print the plot
  2. At least one of the variables is bare and found in the dataframe (variable x) and the other is one or more column numbers (variable y)
  3. At least one of the variables is bare and found in the dataframe (variable y) and the other is one or more column numbers (variable x)
  4. Both the variables were passed to us as numbers. Could be one or more numbers for either variable.

We already have logic in our current version for #1 above and #4 happens by default.

# If both variables are found in the dataframe immediately print the plot
  if (deparse(substitute(xwhich)) %in% names(dataframe) & deparse(substitute(ywhich)) %in% names(dataframe))
# If the user has given us integers indicating the column numbers rather than bare variable names 

My first plan was simply to write a series of if statements to lay out these conditions. I like to avoid nested if statements whenever possible since personally when I go back to look at them later I usually have a hard time following what becomes relatively complex logic.

I couldn’t manage it. I’d be thrilled is someone comes along and shows me how I should have, but personally I had to resort to some amount of nesting.

Tip: Rstudio has a great feature that allows you to fold the code while you’re editing it. Small triangles over by the line numbers. They were immensely helpful! Try them.

I’m not going to try and describe the if structure in detail. Instead I am going to show you the code and let you examine it. For your convenience I have tried to place as many comments within the code as possible to help you parse the logic yourself. I acknowledge in advance if I really tried I’m sure there is other duplication or inefficiency here I could eliminate, but it’s good enough for now. Take a look at the code and we’ll see how it runs in a minute.

PlotMeX <- function(dataframe, xwhich, ywhich, plottype = "side"){
# error checking
  if (!require(ggplot2)) {
    stop("Can't continue can't load ggplot2")
  }
  theme_set(theme_bw())
  if (!require(dplyr)) {
    stop("Can't continue can't load dplyr")
  }
  if (length(match.call()) <= 3) {
    stop("Not enough arguments passed... requires a dataframe, plus at least two variables")
  }
  argList <-  as.list(match.call()[-1])
  if (!exists(deparse(substitute(dataframe)))) {
    stop("The first object in your list does not exist. It should be a dataframe")
  }
  if (!is(dataframe, "data.frame")) {
    stop("The first name you passed does not appear to be a data frame")
  }
# process plottype logic -- default is side anything mispelled or not listed is also side
  switch(plottype,
         side =  list(geom_bar(position="dodge", stat="identity"),
                      ylab("Count")) -> whichbar,
         stack = list(geom_bar(stat="identity"),
                      ylab("Count")) -> whichbar,
         percent = list(geom_bar(stat="identity", position="fill"),
                        ylab("Percent")) -> whichbar,
         list(geom_bar(position="dodge", stat="identity"),
              ylab("Count")) -> whichbar
  )

  PlotMagic <-  function(dataframe,aaa,bbb,whichbar,dfname,xname,yname) {
     dataframe %>%
        filter(!is.na(!! aaa), !is.na(!! bbb))  %>%
        mutate(!!quo_name(aaa) := factor(!!aaa), !!quo_name(bbb) := factor(!!bbb)) %>%
        group_by(!! aaa,!! bbb) %>%
        count() -> tempdf
     tempdf %>%
        ggplot(aes_(fill=aaa, y=~n, x=bbb)) +
        whichbar +
        ggtitle(bquote("Crosstabs dataset = "*.(dfname)*" and variables = "*.(xname)~"by "*.(yname))) -> p
     print(p)
  }

# If both are bare variables and found in the dataframe immediately print the plot
  if (deparse(substitute(xwhich)) %in% names(dataframe) & deparse(substitute(ywhich)) %in% names(dataframe)) { # both are names in the dataframe
    aaa <- enquo(xwhich)
    bbb <- enquo(ywhich)
    xname <- deparse(substitute(xwhich))
    yname <- deparse(substitute(ywhich))
    dfname <- deparse(substitute(dataframe))
    PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname)
    return(message(paste("Plotted dataset", argList$dataframe, "variables", argList$xwhich, "by", argList$ywhich)))
  } else { # is at least one in the dataframe?
# Is at least one of them a bare variable in the dataframe
    if (deparse(substitute(xwhich)) %in% names(dataframe)) { # xwhich is in the dataframe
      aaa <- enquo(xwhich)
      if (class(try(eval(ywhich))) %in% c("integer","numeric")) { # ywhich is column numbers
        indvars <- vector("list", length = length(ywhich))
        totalcombos <- 1 # keep track of where we are
        xname <- deparse(substitute(xwhich))
        dfname <- deparse(substitute(dataframe))
        message("Creating the variable pairings from dataframe ", dfname)
        for (k in seq_along(ywhich)) { #for loop
          indvarsbare <- as.name(colnames(dataframe[ywhich[[k]]]))
          cat("Plot #", totalcombos, " ", xname,
              " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "")
          bbb <- enquo(indvarsbare)
          yname <- deparse(substitute(indvarsbare))
          PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname)
          totalcombos <- totalcombos +1
        } # end of for loop
          return(message("Plotting complete"))
        } else { # ywhich is NOT suitable
        stop("Sorry I don't understand your ywhich variable(s)")
        } #

      } else { # xwhich wasn't try ywhich
        if (deparse(substitute(ywhich)) %in% names(dataframe)) { # yes ywhich is
          bbb <- enquo(ywhich)
          if (class(try(eval(xwhich))) %in% c("integer","numeric")) { # then xwhich a suitable number
            # Build one list two ways
            depvars <- vector("list", length = length(xwhich))
            totalcombos <- 1 # keep track of where we are
            yname <- deparse(substitute(ywhich))
            dfname <- deparse(substitute(dataframe))
            message("Creating the variable pairings from dataframe ", dfname)
            for (j in seq_along(xwhich)) {
              depvarsbare <- as.name(colnames(dataframe[xwhich[[j]]]))
              cat("Plot #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])),
                  " with ", yname, "\n", sep = "")
              aaa <- enquo(depvarsbare)
              xname <- deparse(substitute(depvarsbare))
              PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname)
              totalcombos <- totalcombos +1
            } #end of for loop
              return(message("Plotting complete"))
          } else { # xwhich is NOT suitable
            stop("Sorry I don't understand your xwhich variable(s)")
          } #end of else because xwhich not suitable
        } #end of if
     }
  }

# If both variables are numeric print the plot(s)
  if (class(try(eval(xwhich))) %in% c("integer","numeric") & class(try(eval(ywhich))) %in% c("integer","numeric")) {
     indvars <- vector("list", length = length(ywhich))
     depvars <- vector("list", length = length(xwhich))
     dfname <- deparse(substitute(dataframe))
     totalcombos <- 1 # keep track of where we are
     message("Creating the variable pairings from dataframe ", dfname)
     for (j in seq_along(xwhich)) {
        for (k in seq_along(ywhich)) {
           depvarsbare <- as.name(colnames(dataframe[xwhich[[j]]]))
           indvarsbare <- as.name(colnames(dataframe[ywhich[[k]]]))
           cat("Plot #", totalcombos, " ", as.name(colnames(dataframe[xwhich[[j]]])),
               " with ", as.name(colnames(dataframe[ywhich[[k]]])), "\n", sep = "")
           aaa <- enquo(depvarsbare)
           bbb <- enquo(indvarsbare)
           xname <- deparse(substitute(depvarsbare))
           yname <- deparse(substitute(indvarsbare))
                      PlotMagic(dataframe,aaa,bbb,whichbar,dfname,xname,yname)
           totalcombos <- totalcombos +1
        } # end of inner for loop
     }  # end of outer for loop
        return(message("Plotting complete"))
  } # end of if case where all are numeric
} # end of function

Not mentioned above but clear when we run the code is that I’ve also tried to rationalize the messaging you see in the console and the labeling of the plots themselves.

Are you happy I’m all done?

For most of this series of posts I have focused on using the mtcars built-in dataset. It’s handy, convenient, and it’s installed by default. To actually show the function in action I’m going to use a different dataset. Something that should allow you to better see the value of making plots of the crosstabs rather than simple tables. It also has the happy property of being much much larger than mtcars so we can see if there are lags in performance due to the number of rows.

Rather than provide my own or make anyone work too hard I selected that happy dataset that comes bundled with several R packages including productplots and GGally. From the description:

The data is a small sample of variables related to happiness from the general social survey (GSS). The GSS is a yearly cross-sectional survey of Americans, run from 1976. We combine data for 25 years to yield 51,020 observations, and of the over 5,000 variables, we select nine related to happiness.

We’ll be focusing on the non numeric variables. I certainly can’t claim to do a detailed analysis here but at least the questions will be fun I hope…

PackageList <- .packages(all.available = TRUE)
if ("productplots" %in% PackageList) {
  data("happy",package = "productplots")
} else {
  stop("Can't load productplots can't use the following examples")
}
# who's happier by gender
PlotMeX(happy,happy,sex)
## Plotted dataset happy variables happy by sex

# same thing using column numbers and a stacked bar
PlotMeX(happy,2,5,"stack")
## Creating the variable pairings from dataframe happy

## Plot #1 happy with sex

## Plotting complete

# happiness by a variety of possible factors as a percent
PlotMeX(happy, 2, c(5:9), plottype = "percent")
## Creating the variable pairings from dataframe happy

## Plot #1 happy with sex

## Plot #2 happy with marital

## Plot #3 happy with degree

## Plot #4 happy with finrela

## Plot #5 happy with health

## Plotting complete

# turn the numbers around and change them up basically just showing all
# the permutations
PlotMeX(happy, c(2,5), 9, plottype = "side")
## Creating the variable pairings from dataframe happy

## Plot #1 happy with health

## Plot #2 sex with health

## Plotting complete

PlotMeX(happy, c(2,5), c(6:9), plottype = "percent")
## Creating the variable pairings from dataframe happy

## Plot #1 happy with marital

## Plot #2 happy with degree

## Plot #3 happy with finrela

## Plot #4 happy with health

## Plot #5 sex with marital

## Plot #6 sex with degree

## Plot #7 sex with finrela

## Plot #8 sex with health

## Plotting complete

PlotMeX(happy, happy, c(6,7,9), plottype = "percent")
## Creating the variable pairings from dataframe happy

## Plot #1 happy with marital

## Plot #2 happy with degree

## Plot #3 happy with health

## Plotting complete

PlotMeX(happy, c(6,7,9), happy, plottype = "percent")
## Creating the variable pairings from dataframe happy

## Plot #1 marital with happy

## Plot #2 degree with happy

## Plot #3 health with happy

## Plotting complete

It’s probably the case that no function is ever truly “done” but this one is good enough for now. This has become yet another very long post so I’m going to end here.

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Written on April 17, 2018