I Be Cav Home

R Lover but !a Programmer

Chuck Powell

Upgrading to R 3.6.0 on a Mac – May 14, 2019

Every time there is a new major update from The R Foundation (like the recent 3.6.0 release in April). I’m always happy to see the continuing progress and the combination of new features and bug fixes, but I also dread the upgrade because it means I have to address the issue of what to do about the burgeoning number of packages (libraries) I have installed.

Read More

ANCOVA example – April 18, 2019

I recently had the need to run an ANCOVA, not a task I perform all that often and my first time using R to do so (I’ve done it in SPSS and SAS before). Having a decent theoretical idea of what I had to do I set off in search of decent documentation on how to accomplish it in R. I was quite disappointed with what I found after a decent amount of time scouring the web (or at least I thought so). I found “answers” in places like “Stack Overflow” and “Cross Validated” as well as various free and open notes from academic courses. Many were dated, a few off topic, a few outright incorrect, if you ask me, but nothing I could just pick up and use.

Read More

CHAID v ranger v xgboost – a comparison -- July 27, 2018

In an earlier post, I focused on an in depth visit with CHAID (Chi-square automatic interaction detection). Quoting myself, I said “As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it”. In this post I’ll spend a little time comparing CHAID with a random forest algorithm in the ranger library and with a gradient boosting algorithm via the xgboost library. I’ll use the exact same data set for all three so we can draw some easy comparisons about their speed and their accuracy.

Read More

CHAID and caret – a good combo – June 6, 2018

In an earlier post I focused on an in depth visit with CHAID (Chi-square automatic interaction detection). There are lots of tools that can help you predict an outcome, or classify, but CHAID is especially good at helping you explain to any audience how the model arrives at it’s prediction or classification. It’s also incredibly robust from a statistical perspective, making almost no assumptions about your data for distribution or normality. This post I’ll focus on marrying CHAID with the awesome caret package to make our predicting easier and hopefully more accurate. Although not strictly necessary you’re probably best served by reading the original post first.

Read More

Slopegraphs and R -- A pleasant diversion -- May 26, 2018

I try to at least scan the R-bloggers feed everyday. Not every article is of interest to me, but I often have one of two different reactions to at least one article. Sometimes it is an “ah ha” moment because the article is right on point for a problem I have now or have had in the past and the article provides a (better) solution. Other times my reaction is more of an “oh yeah”, because it is something I have been meaning to investigate, or something I once knew, but the article brings a different perspective to it.

Read More

CHAID and R -- When you need explanation – May 15, 2018

A modern data scientist using R has access to an almost bewildering number of tools, libraries and algorithms to analyze the data. In my next two posts I’m going to focus on an in depth visit with CHAID (Chi-square automatic interaction detection). The title should give you a hint for why I think CHAID is a good “tool” for your analytical toolbox. There are lots of tools that can help you predict or classify but CHAID is especially good at helping you explain to any audience how the model arrives at it’s prediction or classification. It’s also incredibly robust from a statistical perspective, making almost no assumptions about your data for distribution or normality. I’ll try and elaborate on that as we work the example.

Read More

Announcing CGPfunctions 0.3 -- April 20, 2018

As I continue to learn and grow in using R I have been trying to develop the habit of being more formal in documenting and maintaining the various functions and pieces of code I write. It’s not that I think they are major inventions but they are useful and I like having them stored in one place that I can keep track of. So I started building them as a package and even publishing them to CRAN. For any of you who might find them of interest as well.

Read More

Writing better R functions part four – April 17, 2018

In my last four posts I have been working at automating a process, that I am likely to repeat many times, by turning it into a proper R function. In my last post I overcame some real performance problems, combined two sub-functions into one and generally had a workable piece of code. In the final post in this series today I’ll accomplish two more important tasks. I’ll once again refactor the code to streamline it, and I’ll give the user a lot more flexibility on how they input their request.

Read More

Writing better R functions part three -- April 13, 2018

In my last post I worked on two functions that took pairs of variables from a dataset and produced some nice useful ggplot plots from them. We started with the simplest case, plotting counts of how two variables cross-tabulate. Then we worked our way up to being able to automate the process of plotting lots of pairings of variables from the same dataframe. We added a feature to change the plot type and tested for general functionality. To be honest I thought I was in great shape until I went and started trying the function on a much larger dataset. Performance was terrible I had made a couple of mistakes at least. Today I’ll fix those problems and combine our two functions into one function.

Read More

Writing better R functions part two – April 10, 2018

In my last post I started to build two functions that took pairs of variables from a dataset and produced some nice useful ggplot plots from them. We started with the simplest case, plotting counts of how two variables cross-tabulate. Then we worked our way up to being able to automate the process of plotting lots of pairings of variables from the same dataframe. Today we’ll improve our functions and add a feature. For now, we’re going to continue to make use of the built in dataframe known as mtcars. We’re doing that to make sure that whatever we do in this post today, it works on a known start point so we can compare and contrast.

Read More

Writing better R functions part one -- April 6, 2018

One of the nicest things about working with R is that with very little effort you can customize and automate activities to produce the output you want – just the way you want it. You can contrast that with more monolithic packages that may allow you to do a bit of scripting, but for the most part, the price of a GUI or packaging everything in one package is that you lose the ability to have things just your way. Since everything in R is pretty much a function already, you may as well invest a little time and energy in making functions… your way, and to exactly your tastes and needs. This post is not meant to be an exhaustive or complete treatment of writing a function. For that you probably want a book, or at least a Chapter like the one Hadley has in Advanced R. This post will focus on a very practical, and hopefully useful, single example.

Read More

Fun with M&M's – April 3, 2018

In this post we’re going to explore the Chi Squared Goodness of Fit test using M&M’s as our subject material. From there we’ll take a look at simultaneous confidence intervals a.k.a. multiple comparisons. On the R side of things we’ll make use of some old friends like ggplot2 and dplyr but we’ll also make use of two packages that were new to me scimp and ggimage. We’ll also make heavy use of the kable package to make our output tables look nicer.

Read More

Writing functions for dplyr and ggplot2 – April 2, 2018

In my last two posts I have been writing about the task of using R to “drive” MS Excel. The first post focused on just the basic mechanics of getting my colleague what she needed. The second post picked up with some ugly inefficient code and made it better using lapply and a for loop, just good old fashioned automation (the thing that computers excel at). Today I’ll take it another step and show how to produce the same graphs in R using ggplot2 as well as how to write some simple functions to make your programming life easier.

Read More

Using R to ‘drive’ MS Excel -- 3/27/2018

I have until recently made it a habit to draw a clear distinction about using R for data analysis and Microsoft Excel for other office productivity tasks. I know there are people who use Excel to process data and even (gasp) to teach statistics with it. But I’m a bit snobbish that way and to date all my efforts have been in getting data out of Excel and into R, either by simple methods like read.csv or if the task was more meaty by using Hadley Wickham’s marvelous readxl package.

Read More

Using functions to be more efficient – March 28, 2018

In yesterday’s post I focused on the task of using R to “drive” MS Excel. I deliberately ended the post with a fully functioning (pun intended) but very ugly set of code. Why “ugly”? Well, because the last set of code wound up repeating 4 lines of code 12 times!

Read More

Oneway ANOVA Explanation and Example in R -- 9/18/2017

The Oneway ANOVA is a statistical technique that allows us to compare mean differences of one outcome (dependent) variable across two or more groups (levels) of one independent variable (factor). If there are only two levels (e.g. Male/Female) of the independent (predictor) variable the results are analogous to Student’s t test.

Read More