Data Visualisation using ggplot

Learning Objectives

  • Understand that data may need to be formatted correctly for plotting.
  • Perform basic filtering and treatment for missing data
  • Be able to reshape data into the correct format (long vs wide) for making your desired plot.
  • Be able to perform basic calculations on your data for making plots.

Combining data wrangling and plotting

Dealing with NAs

Let’s now try something a little more advanced (and a little more realistic). We will use a messier example dataset that is available in R.

Make sure you have the following packages loaded in your R session:

library('tidyr')
library('dplyr')
library('ggplot2')

Now let’s bring up a dataset of sleep patterns across various animal species.

data('msleep')

image

This dataset will be more like your data. There are multiple categorical character columns with grouping metadata, as well as several numerical data columns with measurement data and a lot of NAs/missing data that will affect the behaviour of ggplot.

Let’s start simple by seeing whether diet affects sleep. Let’s plot the vore column vs sleep_cycle.

ggplot(data = msleep)+
  geom_point(mapping = aes(x = vore, y = sleep_cycle))+
  theme_classic()

image

And we get a warning message:

Warning message:
Removed 51 rows containing missing values or values outside the scale range (`geom_point()`). 

These are likely NA for sleep_cycle. You can check whether this is the case by filtering for NAs in sleep_cycle:

msleep %>%
  filter(is.na(sleep_cycle))

What prints out in the console is # A tibble: 51 × 11 - so there are 51 rows (i.e. animals in this dataset) that have missing sleep_cycle data.

Generally, missing y axis values will not be plotted by ggplot, while missing x axis values will.

You can also see that there is a point “NA” in the x axes. You can check what animal that is by running:

msleep %>%
  filter(is.na(vore) & !is.na(sleep_cycle))

Looks like researchers don’t know the diet of the musk shrew or not. If you happen to know it, you can update the entry and have it included. Alternatively, remove it. We will remove missing entrie in this case. Let’s do that by filtering again.

I also felt like the range for herbivore and omnivore were too high to be informative. There’s a lot of data points there so why don’t we split it further?

I still want to retain info about the diet, but lets also split further on order. I also want to make the theme less ugly so lets copy what we did previously and add a common premade theme and rotate the axes.

msleep %>% 
  drop_na(order, vore) %>% 
    ggplot()+
    geom_point(mapping = aes(x = order, y = sleep_cycle, colour = vore))+
    theme_classic()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

image

This is better, but the colours are all over the place! I want to more easily distinguish the diets of these animals better. This is where facet_wrap and facet_grid are your best friends and why R is better than Excel/Graphpad for plotting. Faceting lets your group by variables in a very powerful way. Let’s try it:

msleep %>% 
  drop_na(order, vore) %>% 
  ggplot()+
    geom_point(mapping = aes(x = order, y = sleep_cycle, colour = vore))+
    facet_wrap(vars(vore), nrow = 1, scales = 'free_x')+
    theme_classic()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

image

This is now looking even more informative but I have 2 gripes with it:

  1. We should order it in ascending order of sleep
  2. The number of variables per group is different across the different diets and this unevenness is ugly

In order to fix the first problem, this is where factors come back in. It turns out we’ll need an extra package forcats. This is loaded automatically if you load tidyverse at the very beginning instead of tidyr, dplyr and ggplot2 separately. However, this will also load a lot of packages you may not necessarily need, so I usually prefer to just load required packages.

install.packages("forcats")
library('forcats')

## Here we are sending the filtered and reordered data directly into the ggplot function

msleep %>%
  drop_na(order, vore, sleep_cycle) %>%
  mutate(order_new = fct_reorder(order, sleep_cycle)) %>%
    ggplot()+
        geom_point(mapping = aes(x = order_new, y = sleep_cycle, colour = vore))+
        facet_grid(cols = vars(vore), scales = 'free_x', space = 'free_x')+
        theme_classic()+
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

image

Bonus: Add a boxplot!

You can combine plots on top of each other if you’d like. The geoms get stacked on top of each other in the order you write them.

msleep %>%
  drop_na(order, vore, sleep_cycle) %>%
  mutate(order_new = fct_reorder(order, sleep_cycle)) %>%
    ggplot()+
        geom_boxplot(mapping = aes(x = order_new, y = sleep_cycle, fill = vore), colour = 'black')+
        geom_point(mapping = aes(x = order_new, y = sleep_cycle), colour = 'black')+
        facet_grid(cols = vars(vore), scales = 'free_x', space = 'free_x')+
        theme_classic()+
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

image

Extra calculations

Let’s say you want to check how many animals you actually have sleep_cycle data for for each diet (i.e. no NAs!)

You need to calculate the “counts”. Let’s see an example of how you’d ask the LLM:

image

Let’s try it!

msleep %>%
  # keep only rows with non-missing sleep_cycle
  filter(!is.na(sleep_cycle), !is.na(vore)) %>%
  # count animals by diet type
  count(vore) %>%
  # plot
  ggplot(aes(x = vore, y = n)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Count of Animals by Diet Type",
    x = "Diet (vore)",
    y = "Number of Animals"
  ) +
  theme_minimal()

image

Exercise: Can you compare the brain to body weight ratio of each animal between diets?

This is the plot we are looking for:

image

For anything where you have a different number of samples plotted, I HIGHLY recommend using geom_text to plot the number of samples, such as in this boxplot. However, this is a bit more advanced. Hint: You need to use a different data argument for the geom_text… A counts data perhaps…?

Reshaping data

The animal sleep dataset was actually in mostly the correct format for plotting. What if we have one that is completely wrong?

Lets investigate the world population by country longitudinal data set:

data('world_bank_pop', package = 'tidyr')

image

Super long format:

image

Most sensible format for plotting longitudinal data:

image

Exercise: Can you plot the population over time for Australia?

image

Hint: You can try to be very specific with LLMs:

image


Further Reading =======================

We have really only touched the surface of basic calculations and data wrangling.

Imagine if for the world population data, someone could give you a separate table which is basically just an index of country codes, country names and continent they’re on. Example:

country country_name region
AUS Australia Oceania
CAN Canada North America
GER Germany Europe

If you then merge or join this table (matching the country code) to the world_bank_pop table, you could further group the data and start analysing regional variations in population trends for example.

Similarly, if some countries had missing data and you got an extra table with more countries in them, you could join that table based on the column names as well.

Then there are also so many other types of data that could be represented as tables which we haven’t even touched on - heatmaps are super common to plot and there are dedicated heatmap plotting packages (e.g. ComplexHeatmap or you could go and use geom_tile in ggplot2.

This site is also a good resource that lists both common data transformations and also how you could do those transformations using different syntaxes.

There are also many cheat sheets available online for various methods of reshaping data with some handy illustrations of what the reshaping does.

Choosing a “dialect”

As a beginner, just do whatever makes the most sense to you, but try to keep it consistent or you’ll find learning extra difficult. If your goal is to just replace Excel/Graphpad prism, it really doesn’t matter what you use so do whatever you like best.

However, if you plan to start working on analysis of more complex data, the type that can’t really be opened on Excel:

  • If you plan to do a lot of Xenium/Phenocycler analysis, then the data.table syntax would be strongly recommended over the tidy/dplyr for data wrangling and calculations. You can read the vignettes here.

  • If your goal is to learn single cell RNAseq analysis, it would be prudent to stick to tidyverse because the syntax is more similar.