I Be Cav Home

R Lover but !a Programmer

Chuck Powell

Slopegraphs and R -- A pleasant diversion -- May 26, 2018

Tagged as: [ R  ggplot2  dplyr  ggrepel  slopegraph  forcats  kable  reshape2  ]

I try to at least scan the R-bloggers feed everyday. Not every article is of interest to me, but I often have one of two different reactions to at least one article. Sometimes it is an “ah ha” moment because the article is right on point for a problem I have now or have had in the past and the article provides a (better) solution. Other times my reaction is more of an “oh yeah”, because it is something I have been meaning to investigate, or something I once knew, but the article brings a different perspective to it.

The second case happened to me this week. I’ve been aware of slopegraphs and bumpcharts for quite some time, and I certainly am aware of Tufte’s work. As an amateur military historian I’ve always loved, for example, his poster depicting Napoleon’s Russian Campaign. So when I saw the article from Murtaza Haider titled “Edward Tufte’s Slopegraphs and political fortunes in Ontario” I just had to take a peek and revisit the topic.

The article does a good job of looking at slopegraphs in both R (via plotrix) and Stata, even providing the code to do the work. My challenge was that even though I’m adequate at plotting in base R, I much prefer using ggplot2 wherever and whenever possible. My memory was that I had seen another article on the related topic of a bumpchart on R-bloggers in the not too distant past. A little sleuthing turned up this earlier article from Dominik Koch who wrote some code to compare national performance at the Winter Olympics, “Bump Chart - Track performance over time”.

Finally, I wound up at this Github repository for a project called “Edward Tufte-Inspired Slopegraphs” from Thomas J. Leeper who has been building code to make slopegraphs using both base plotting functions and ggplot2.

My post today will draw a little bit from all their work and hopefully provide some useful samples for others to draw on if they share some of my quirks about data layout and a preference for ggplot2 versus base plot. I’m going to focus almost exclusively on slopegraphs, although much of the work could be extended to bumpcharts as well.

Setup and library loading

We’re going to make occasional use of dplyr to manipulate the data, extensive use of ggplot2 to do the plotting and ggrepel to solve one specific labeling problem. We’ll load them and I am suppressing the message from dplyr about namespace overrides.

require(dplyr)
require(ggplot2)
require(ggrepel)
require(kableExtra)

Politics in Ontario

The original post is about plotting the data from some polling results in Ontario. For the reader’s convenience I’ve made the data available via a structure command. We have data about two different polling dates, for 5 political parties, and the measured variable is percent of people supporting expressed as x.x (i.e. already multiplied by 100).

data <- structure(list( Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
                                        .Label = c("11-May-18", "18-May-18"), 
                                        class = "factor"), 
                        Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), 
                                         .Label = c("Green", "Liberal", "NDP", "Others", "PC"), 
                                         class = "factor"), 
                        Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4)), 
                  class = "data.frame", 
                  row.names = c(NA, -10L))

str(data)
## 'data.frame':    10 obs. of  3 variables:
##  $ Date : Factor w/ 2 levels "11-May-18","18-May-18": 1 1 1 1 1 2 2 2 2 2
##  $ Party: Factor w/ 5 levels "Green","Liberal",..: 5 3 2 1 4 5 3 2 1 4
##  $ Pct  : num  42.3 28.4 22.1 5.4 1.8 41.9 29.3 22.3 5 1.4
head(data)
##        Date   Party  Pct
## 1 11-May-18      PC 42.3
## 2 11-May-18     NDP 28.4
## 3 11-May-18 Liberal 22.1
## 4 11-May-18   Green  5.4
## 5 11-May-18  Others  1.8
## 6 18-May-18      PC 41.9

Let’s just take the data as we have it and feed it to ggplot in a nice simple fashion and see what we get with very little effort.

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) +
  geom_line(aes(color = Party, alpha = 1), size = 2) +
  geom_point(aes(color = Party, alpha = 1), size = 4) +
  #  Labelling as desired
  labs(
    title = "Voter's stated preferences for June 7 elections in Ontario",
    subtitle = "(Mainstreet Research)",
    caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/"
  )

The nice thing about ggplot is once you get used to the syntax it becomes very “readable”. We’ve identified our dataset, the x & y variables and our grouping variable. Lines too big? An adjustment to size = 2 does it. Don’t like colors? Pull the color = Party clause.

So we’re already pretty close to what we need. Things are scaled properly and the basic labeling of titles etc. is accomplished. Our biggest “problem” is that ggplot has been a little too helpful and adding some things we’d like to remove to give it a more “Tuftesque” look. So what we’ll do in the next few steps is add lines of code – but they are mainly designed to remove unwanted elements. This is in contrast to a base plot where we have to write the code to add elements.

So lets:

  • Move the x axis labels to the top with scale_x_discrete(position = "top")
  • Change to a nice clean black and white theme theme_bw()
  • Not display any legend(s) theme(legend.position = "none")
  • Remove the default border from our plot theme(panel.border = element_blank())
ggplot(data = data, aes(x = Date, y = Pct, group = Party)) +
  geom_line(aes(color = Party, alpha = 1), size = 2) +
  geom_point(aes(color = Party, alpha = 1), size = 4) +
  # move the x axis labels up top
  scale_x_discrete(position = "top") +
  theme_bw() +
  # Format tweaks
  # Remove the legend
  theme(legend.position = "none") +
  # Remove the panel border
  theme(panel.border     = element_blank()) +
  #  Labelling as desired
  labs(
    title = "Voter's stated preferences for June 7 elections in Ontario",
    subtitle = "(Mainstreet Research)",
    caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/"
  )

Nice progress! Continuing to remove things that can be considered “clutter” we add some additional lines that all end in element_blank() and are invoked to remove default plot items such as the plot grid, the y axcis text, etc..

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) +
  geom_line(aes(color = Party, alpha = 1), size = 2) +
  geom_point(aes(color = Party, alpha = 1), size = 4) +
  # move the x axis labels up top
  scale_x_discrete(position = "top") +
  theme_bw() +
  # Format tweaks
  # Remove the legend
  theme(legend.position = "none") +
  # Remove the panel border
  theme(panel.border     = element_blank()) +
  # Remove just about everything from the y axis
  theme(axis.title.y     = element_blank()) +
  theme(axis.text.y      = element_blank()) +
  theme(panel.grid.major.y = element_blank()) +
  theme(panel.grid.minor.y = element_blank()) +
  # Remove a few things from the x axis and increase font size
  theme(axis.title.x     = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(axis.text.x.top      = element_text(size=12)) +
  # Remove x & y tick marks
  theme(axis.ticks       = element_blank()) +
  #  Labelling as desired
  labs(
    title = "Voter's stated preferences for June 7 elections in Ontario",
    subtitle = "(Mainstreet Research)",
    caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/"
  )

Very nice! We’re almost there! The “almost” is because now that we have removed both the legend and all scales and tick marks we no longer know who is who, and what the numbers are! Plus, I’m a little unhappy with the way the titles are formatted, so we’ll play with that. Later, I’ll get fancy but for now let’s just add some simple text labels on the left and right to show the party name and their percentage. The code geom_text(aes(label = Party)) will place the party name right on top of the points that anchor either end of the line. If we make that geom_text(aes(label = paste0(Party, " - ", Pct, "%"))) then we’ll get labels that have both the party and the percent all neatly formatted, but still right on top of the points that anchor the ends of the line. hjust controls horizontal justification so if we change it to geom_text(aes(label = paste0(Party, " - ", Pct, "%")), hjust = 1.35) both sets of labels will slide to the left which is exactly what we want for the May 11 labels but not the May 18 labels. If we feed hjust a negative number they’ll go the other way. So what we’ll do is filter the data using the filter function from dplyr and place the left hand labels differently than the right hand labels. While we’re at it we’ll make it bold face font and a little larger…

ggplot(data = data, aes(x = Date, y = Pct, group = Party)) +
  geom_line(aes(color = Party, alpha = 1), size = 2) +
  geom_point(aes(color = Party, alpha = 1), size = 4) +
  geom_text(data = data %>% filter(Date == "11-May-18"), 
            aes(label = paste0(Party, " - ", Pct, "%")) , 
            hjust = 1.35, 
            fontface = "bold", 
            size = 4) +
  geom_text(data = data %>% filter(Date == "18-May-18"), 
            aes(label = paste0(Party, " - ", Pct, "%")) , 
            hjust = -.35, 
            fontface = "bold", 
            size = 4) +
  # move the x axis labels up top
  scale_x_discrete(position = "top") +
  theme_bw() +
  # Format tweaks
  # Remove the legend
  theme(legend.position = "none") +
  # Remove the panel border
  theme(panel.border     = element_blank()) +
  # Remove just about everything from the y axis
  theme(axis.title.y     = element_blank()) +
  theme(axis.text.y      = element_blank()) +
  theme(panel.grid.major.y = element_blank()) +
  theme(panel.grid.minor.y = element_blank()) +
  # Remove a few things from the x axis and increase font size
  theme(axis.title.x     = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(axis.text.x.top      = element_text(size=12)) +
  # Remove x & y tick marks
  theme(axis.ticks       = element_blank()) +
  # Format title & subtitle
  theme(plot.title       = element_text(size=14, face = "bold", hjust = 0.5)) +
  theme(plot.subtitle    = element_text(hjust = 0.5)) +
  #  Labelling as desired
  labs(
    title = "Voter's stated preferences for June 7 elections in Ontario",
    subtitle = "(Mainstreet Research)",
    caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/"
  )

Eureka! Not perfect yet but definitely looking good.

Adding complexity

I’m feeling pretty good about the solution so far but there are three things I’d like to make better.

  1. How well will this solution work when we have more than two time periods? Need to make sure it generalizes to a more complex case.
  2. As Murtaza Haider notes in his post we’ll have issues if the data points are identical or very close together. Our very neat little labels will overlap each other. In his post I believe he mentions that he manually moved them in some cases. Let’s try and fix that.
  3. Oh my, that’s a lot of code to keep cutting and pasting, can we simplify?

To test #1 and #2 I have “invented”” a new dataset called moredata. It is fictional it’s labelled May 25th but today is actually May 24th. But I created it to add a third polling date and to make sure that we had a chance to test what happens when we have two identical datapoints on the same day. Notice that on May 25th the polling numbers for the Liberals and the NDP are identical at 26.8%.

moredata <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
                                            .Label = c("11-May-18", "18-May-18", "25-May-18"), 
                                            class = "factor"), 
                           Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), 
                                             .Label = c("Green", "Liberal", "NDP", "Others", "PC"), 
                                             class = "factor"), 
                           Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4, 41.9, 26.8, 26.8, 5, 1.4)), 
                      class = "data.frame", 
                      row.names = c(NA, -15L))
tail(moredata)
##         Date   Party  Pct
## 10 18-May-18  Others  1.4
## 11 25-May-18      PC 41.9
## 12 25-May-18     NDP 26.8
## 13 25-May-18 Liberal 26.8
## 14 25-May-18   Green  5.0
## 15 25-May-18  Others  1.4

You’ll notice at the beginning of this post I loaded the ggrepel library. ggrepel works with ggplot2 to repel things that overlap, in this case our geom_text labels. The invocation is geom_text_repel and it is very similar to geom_text but allows us to deconflict the overlaps. We’ll use hjust = "left" and hjust = "right" to control justifying the labels. We’ll use a fixed nudge left and right nudge_x = -.45 and nudge_x = .5 to move the labels left and right off the plotted data points and we will explicitly tell geom_text_repel to only move the labels vertically to avoid overlap with direction = "y". Everything else remains the same.

ggplot(data = moredata, aes(x = Date, y = Pct, group = Party)) +
  geom_line(aes(color = Party, alpha = 1), size = 2) +
  geom_point(aes(color = Party, alpha = 1), size = 4) +
  geom_text_repel(data = moredata %>% filter(Date == "11-May-18"), 
                  aes(label = paste0(Party, " - ", Pct, "%")) , 
                  hjust = "left", 
                  fontface = "bold", 
                  size = 4, 
                  nudge_x = -.45, 
                  direction = "y") +
  geom_text_repel(data = moredata %>% filter(Date == "25-May-18"), 
                  aes(label = paste0(Party, " - ", Pct, "%")) , 
                  hjust = "right", 
                  fontface = "bold", 
                  size = 4, 
                  nudge_x = .5, 
                  direction = "y") +
  # move the x axis labels up top
  scale_x_discrete(position = "top") +
  theme_bw() +
  # Format tweaks
  # Remove the legend
  theme(legend.position = "none") +
  # Remove the panel border
  theme(panel.border     = element_blank()) +
  # Remove just about everything from the y axis
  theme(axis.title.y     = element_blank()) +
  theme(axis.text.y      = element_blank()) +
  theme(panel.grid.major.y = element_blank()) +
  theme(panel.grid.minor.y = element_blank()) +
  # Remove a few things from the x axis and increase font size
  theme(axis.title.x     = element_blank()) +
  theme(panel.grid.major.x = element_blank()) +
  theme(axis.text.x.top      = element_text(size=12)) +
  # Remove x & y tick marks
  theme(axis.ticks       = element_blank()) +
  # Format title & subtitle
  theme(plot.title       = element_text(size=14, face = "bold", hjust = 0.5)) +
  theme(plot.subtitle    = element_text(hjust = 0.5)) +
  #  Labelling as desired
  labs(
    title = "Bogus Data",
    subtitle = "(Chuck Powell)",
    caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/"
  )

Very nice! We have confirmed that our solution works for more than two dates without any additional changes and we have found a solution to the label overlap issue. In a little while we’ll talk about labeling the data points in the center (if we want to).

Before we move on let’s make our life a little simpler. While the output plot is good it’s a lot of code to produce one graph. Let’s see if we can simplify…

Since ggplot2 objects are just regular R objects, you can put them in a list. This means you can apply all of R’s great functional programming tools. For example, if you wanted to add different geoms to the same base plot, you could put them in a list and use lapply().

But for now let’s at least take all the invariant lines of code and put them in a list. Then when we go to plot we can just invoke the list and remain confident we get the right formatting. For now let’s name this list something quaint and obvious like MySpecial.

MySpecial <- list(  
  # move the x axis labels up top
  scale_x_discrete(position = "top"),
  theme_bw(),
  # Format tweaks
  # Remove the legend
  theme(legend.position = "none"),
  # Remove the panel border
  theme(panel.border     = element_blank()),
  # Remove just about everything from the y axis
  theme(axis.title.y     = element_blank()),
  theme(axis.text.y      = element_blank()),
  theme(panel.grid.major.y = element_blank()),
  theme(panel.grid.minor.y = element_blank()),
  # Remove a few things from the x axis and increase font size
  theme(axis.title.x     = element_blank()),
  theme(panel.grid.major.x = element_blank()),
  theme(axis.text.x.top      = element_text(size=12)),
  # Remove x & y tick marks
  theme(axis.ticks       = element_blank()),
  # Format title & subtitle
  theme(plot.title       = element_text(size=14, face = "bold", hjust = 0.5)),
  theme(plot.subtitle    = element_text(hjust = 0.5))
)
summary(MySpecial)
##       Length Class                 Mode       
##  [1,] 17     ScaleDiscretePosition environment
##  [2,] 57     theme                 list       
##  [3,]  1     theme                 list       
##  [4,]  1     theme                 list       
##  [5,]  1     theme                 list       
##  [6,]  1     theme                 list       
##  [7,]  1     theme                 list       
##  [8,]  1     theme                 list       
##  [9,]  1     theme                 list       
## [10,]  1     theme                 list       
## [11,]  1     theme                 list       
## [12,]  1     theme                 list       
## [13,]  1     theme                 list       
## [14,]  1     theme                 list

MySpecial is actually an incredibly complex structure so I used the summary function. What’s important to us is that in the future all we need to do is include it in the ggplot command and magic happens. Perhaps another day I’ll make it a proper function but for now I can change little things like line size or titles and labels without worrying about the rest. So here it is with some little things changed.

ggplot(data = moredata, aes(x = Date, y = Pct, group = Party)) +
  geom_line(aes(color = Party, alpha = 1), size = 1) +
  geom_point(aes(color = Party, alpha = 1), size = 3) +
  geom_text_repel(data = moredata %>% filter(Date == "11-May-18"), 
                  aes(label = paste0(Party, " : ", Pct, "%")) , 
                  hjust = "left", 
                  fontface = "bold", 
                  size = 4, 
                  nudge_x = -.45, 
                  direction = "y") +
  geom_text_repel(data = moredata %>% filter(Date == "25-May-18"), 
                  aes(label = paste0(Party, " : ", Pct, "%")) , 
                  hjust = "right", 
                  fontface = "bold", 
                  size = 4, 
                  nudge_x = .5, 
                  direction = "y") +
  MySpecial +
  labs(
    title = "Bogus Data",
    subtitle = "(Chuck Powell)",
    caption = "https://www.mainstreetresearch.ca/gap-between-ndp-and-pcs-narrows-while-liberals-hold-steady/"
  )

Even more complex

Feeling good about the solution so far I decided to press on to a much more complex problem. Thomas J. Leeper has a nice plot of Tufte’s Cancer survival slopegraph N.B. that the original Tufte is not accurate on the vertical scale. Look at Prostate and Thyroid for example since visually I would argue they should cross to reflect the data.

Let’s grab the data as laid out by Tufte.

cancer <- structure(list(Year.5 = c(99, 96, 95, 89, 86, 85, 84, 82, 71, 69, 63, 62, 62, 58, 57, 55, 43, 32, 30, 24, 15, 14, 8, 4), 
                         Year.10 = c(95, 96, 94, 87, 78, 80, 83, 76, 64, 57, 55, 54, 55, 46, 46, 49, 32, 29, 13, 19, 11, 8, 6, 3), 
                         Year.15 = c(87, 94, 91, 84, 71, 74,  81, 70, 63, 46, 52, 50, 54, 38, 38, 50, 30, 28, 7, 19, 7, 8, 6, 3), 
                         Year.20 = c(81, 95, 88, 83, 75, 67, 79, 68, 60, 38, 49, 47, 52, 34, 33, 50, 26, 26, 5, 15, 6, 5, 8, 3)), 
                    class = "data.frame", 
                    row.names = c("Prostate", "Thyroid", "Testis", "Melanomas", "Breast", "Hodgkin's", "Uterus", "Urinary", "Cervix", "Larynx", "Rectum", "Kidney", "Colon", "Non-Hodgkin's", "Oral", "Ovary", "Leukemia", "Brain", "Multiple myeloma", "Stomach", "Lung", "Esophagus", "Liver", "Pancreas"))

str(cancer)
## 'data.frame':    24 obs. of  4 variables:
##  $ Year.5 : num  99 96 95 89 86 85 84 82 71 69 ...
##  $ Year.10: num  95 96 94 87 78 80 83 76 64 57 ...
##  $ Year.15: num  87 94 91 84 71 74 81 70 63 46 ...
##  $ Year.20: num  81 95 88 83 75 67 79 68 60 38 ...
kable(head(cancer,10)) %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Year.5 Year.10 Year.15 Year.20
Prostate 99 95 87 81
Thyroid 96 96 94 95
Testis 95 94 91 88
Melanomas 89 87 84 83
Breast 86 78 71 75
Hodgkin’s 85 80 74 67
Uterus 84 83 81 79
Urinary 82 76 70 68
Cervix 71 64 63 60
Larynx 69 57 46 38

There, we have it in a neat data frame but not organized as we need it. Not unusual, and an opportunity to use some other tools from broom and reshape2. Let’s do the following:

  1. Let’s transpose the data with t
  2. Let’s use broom::fix_data_frame to get valid column names and convert rownames to a proper column all in one function. Right now the types of cancer are nothing but rownames.
  3. Use reshape2::melt to take our transposed dataframe and convert it to long format so we can send it off to ggplot. Along the way we’ll rename the resulting dataframe newcancer with columns named Year, Type and Survival.
# stepping through for demonstration purposes
t(cancer) # returns a matrix
##         Prostate Thyroid Testis Melanomas Breast Hodgkin's Uterus Urinary
## Year.5        99      96     95        89     86        85     84      82
## Year.10       95      96     94        87     78        80     83      76
## Year.15       87      94     91        84     71        74     81      70
## Year.20       81      95     88        83     75        67     79      68
##         Cervix Larynx Rectum Kidney Colon Non-Hodgkin's Oral Ovary
## Year.5      71     69     63     62    62            58   57    55
## Year.10     64     57     55     54    55            46   46    49
## Year.15     63     46     52     50    54            38   38    50
## Year.20     60     38     49     47    52            34   33    50
##         Leukemia Brain Multiple myeloma Stomach Lung Esophagus Liver
## Year.5        43    32               30      24   15        14     8
## Year.10       32    29               13      19   11         8     6
## Year.15       30    28                7      19    7         8     6
## Year.20       26    26                5      15    6         5     8
##         Pancreas
## Year.5         4
## Year.10        3
## Year.15        3
## Year.20        3
broom::fix_data_frame(
   t(cancer), 
   newcol = "Year") # make it a dataframe with Year as a proper column
##      Year Prostate Thyroid Testis Melanomas Breast Hodgkin.s Uterus
## 1  Year.5       99      96     95        89     86        85     84
## 2 Year.10       95      96     94        87     78        80     83
## 3 Year.15       87      94     91        84     71        74     81
## 4 Year.20       81      95     88        83     75        67     79
##   Urinary Cervix Larynx Rectum Kidney Colon Non.Hodgkin.s Oral Ovary
## 1      82     71     69     63     62    62            58   57    55
## 2      76     64     57     55     54    55            46   46    49
## 3      70     63     46     52     50    54            38   38    50
## 4      68     60     38     49     47    52            34   33    50
##   Leukemia Brain Multiple.myeloma Stomach Lung Esophagus Liver Pancreas
## 1       43    32               30      24   15        14     8        4
## 2       32    29               13      19   11         8     6        3
## 3       30    28                7      19    7         8     6        3
## 4       26    26                5      15    6         5     8        3
reshape2::melt(
   broom::fix_data_frame(
      t(cancer), 
      newcol = "Year"), 
   id="Year", variable.name="Type", value.name = "Survival") # melt it to long form
##       Year             Type Survival
## 1   Year.5         Prostate       99
## 2  Year.10         Prostate       95
## 3  Year.15         Prostate       87
## 4  Year.20         Prostate       81
## 5   Year.5          Thyroid       96
## 6  Year.10          Thyroid       96
## 7  Year.15          Thyroid       94
## 8  Year.20          Thyroid       95
## 9   Year.5           Testis       95
## 10 Year.10           Testis       94
## 11 Year.15           Testis       91
## 12 Year.20           Testis       88
## 13  Year.5        Melanomas       89
## 14 Year.10        Melanomas       87
## 15 Year.15        Melanomas       84
## 16 Year.20        Melanomas       83
## 17  Year.5           Breast       86
## 18 Year.10           Breast       78
## 19 Year.15           Breast       71
## 20 Year.20           Breast       75
## 21  Year.5        Hodgkin.s       85
## 22 Year.10        Hodgkin.s       80
## 23 Year.15        Hodgkin.s       74
## 24 Year.20        Hodgkin.s       67
## 25  Year.5           Uterus       84
## 26 Year.10           Uterus       83
## 27 Year.15           Uterus       81
## 28 Year.20           Uterus       79
## 29  Year.5          Urinary       82
## 30 Year.10          Urinary       76
## 31 Year.15          Urinary       70
## 32 Year.20          Urinary       68
## 33  Year.5           Cervix       71
## 34 Year.10           Cervix       64
## 35 Year.15           Cervix       63
## 36 Year.20           Cervix       60
## 37  Year.5           Larynx       69
## 38 Year.10           Larynx       57
## 39 Year.15           Larynx       46
## 40 Year.20           Larynx       38
## 41  Year.5           Rectum       63
## 42 Year.10           Rectum       55
## 43 Year.15           Rectum       52
## 44 Year.20           Rectum       49
## 45  Year.5           Kidney       62
## 46 Year.10           Kidney       54
## 47 Year.15           Kidney       50
## 48 Year.20           Kidney       47
## 49  Year.5            Colon       62
## 50 Year.10            Colon       55
## 51 Year.15            Colon       54
## 52 Year.20            Colon       52
## 53  Year.5    Non.Hodgkin.s       58
## 54 Year.10    Non.Hodgkin.s       46
## 55 Year.15    Non.Hodgkin.s       38
## 56 Year.20    Non.Hodgkin.s       34
## 57  Year.5             Oral       57
## 58 Year.10             Oral       46
## 59 Year.15             Oral       38
## 60 Year.20             Oral       33
## 61  Year.5            Ovary       55
## 62 Year.10            Ovary       49
## 63 Year.15            Ovary       50
## 64 Year.20            Ovary       50
## 65  Year.5         Leukemia       43
## 66 Year.10         Leukemia       32
## 67 Year.15         Leukemia       30
## 68 Year.20         Leukemia       26
## 69  Year.5            Brain       32
## 70 Year.10            Brain       29
## 71 Year.15            Brain       28
## 72 Year.20            Brain       26
## 73  Year.5 Multiple.myeloma       30
## 74 Year.10 Multiple.myeloma       13
## 75 Year.15 Multiple.myeloma        7
## 76 Year.20 Multiple.myeloma        5
## 77  Year.5          Stomach       24
## 78 Year.10          Stomach       19
## 79 Year.15          Stomach       19
## 80 Year.20          Stomach       15
## 81  Year.5             Lung       15
## 82 Year.10             Lung       11
## 83 Year.15             Lung        7
## 84 Year.20             Lung        6
## 85  Year.5        Esophagus       14
## 86 Year.10        Esophagus        8
## 87 Year.15        Esophagus        8
## 88 Year.20        Esophagus        5
## 89  Year.5            Liver        8
## 90 Year.10            Liver        6
## 91 Year.15            Liver        6
## 92 Year.20            Liver        8
## 93  Year.5         Pancreas        4
## 94 Year.10         Pancreas        3
## 95 Year.15         Pancreas        3
## 96 Year.20         Pancreas        3
# all those steps in one long line saved to a new dataframe
newcancer <- reshape2::melt(broom::fix_data_frame(t(cancer), newcol = "Year"), id="Year", variable.name="Type", value.name = "Survival")

Now we have whipped the data into the shape we need it. 96 rows with the three columns we want to plot, Year, Type, and Survival. If you look at the data though, you’ll notice two small faults. First, Year is not a factor. The plot will work but have an annoying limitation. Since “Year.5” is a character string it will be ordered after all the other years. We could fix that on the fly within our ggplot call but I find it cleaner and more understandable if I take care of that first. I’ll use the factor function from base R to accomplish that and while I’m at it make the values nicer looking. Second in three cases R changed cancer type names because they couldn’t be column names in a dataframe. I’ll use forcats::fct_recode to make them look better.

newcancer$Year <- factor(newcancer$Year, 
                         levels = c("Year.5", "Year.10", "Year.15", "Year.20"), 
                         labels = c("5 Year","10 Year","15 Year","20 Year"), 
                         ordered = TRUE)
newcancer$Type <- forcats::fct_recode(newcancer$Type, 
                                      "Hodgkin's" = "Hodgkin.s", 
                                      "Non-Hodgkin's" = "Non.Hodgkin.s", 
                                      "Multiple myeloma" = "Multiple.myeloma")
head(newcancer)
##      Year     Type Survival
## 1  5 Year Prostate       99
## 2 10 Year Prostate       95
## 3 15 Year Prostate       87
## 4 20 Year Prostate       81
## 5  5 Year  Thyroid       96
## 6 10 Year  Thyroid       96

Now that we have the data the way we want it we can make our slopegraph. Some of the necessary changes are obvious x = Year, y = Survival and group = Type for example. Since there are a lot of plotted lines I’ve reduced the weight or size of the individual lines. We no longer want to plot the big round points, we’re going to substitute in the actual numbers, so that line gets commented out. The left and right labels require no change and geom_text_repel will keep them from overlapping which is almost inevitable given the data. To put the actual survival numbers on the plot we’ll turn to geom_label. It’s like geom_text only it puts a label box around the text. We’ll choose a smallish size, minimize the amount of padding, and make the border of the box invisible. The end result is what we want. It overlays on top of the lines we’ve already plotted and the invisible padding gives us just enough room.

ggplot(data = newcancer, aes(x = Year, y = Survival, group = Type)) +
  geom_line(aes(color = Type, alpha = 1), size = 1) +
#  geom_point(aes(color = Type, alpha = .1), size = 4) +
  geom_text_repel(data = newcancer %>% filter(Year == "5 Year"), 
                  aes(label = Type) , 
                  hjust = "left", 
                  fontface = "bold", 
                  size = 3, 
                  nudge_x = -.45, 
                  direction = "y") +
  geom_text_repel(data = newcancer %>% filter(Year == "20 Year"), 
                  aes(label = Type) , 
                  hjust = "right", 
                  fontface = "bold", 
                  size = 3, 
                  nudge_x = .5, 
                  direction = "y") +
  geom_label(aes(label = Survival), 
             size = 2.5, 
             label.padding = unit(0.05, "lines"), 
             label.size = 0.0) +
  MySpecial +
  labs(
    title = "Estimates of Percent Survival Rates",
    subtitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.",
    caption = "https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk"
  )

Done for now

I hope you’ve found this useful. I am always open to comments, corrections and suggestions.

Chuck (ibecav at gmail dot com)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Written on May 25, 2018