Data Visualisation using ggplot

Learning Objectives

Understand that data may need to be formatted correctly for plotting.

Perform basic filtering and treatment for missing data

Be able to reshape data into the correct format (long vs wide) for making your desired plot.

Be able to perform basic calculations on your data for making plots.

Combining data wrangling and plotting

Dealing with NAs

Let’s now try something a little more advanced (and a little more realistic). We will use a messier example dataset that is available in R.

Make sure you have the following packages loaded in your R session:

library('tidyr')
library('dplyr')
library('ggplot2')

Now let’s bring up a dataset of sleep patterns across various animal species.

data('msleep')

This dataset will be more like your data. There are multiple categorical character columns with grouping metadata, as well as several numerical data columns with measurement data and a lot of NAs/missing data that will affect the behaviour of ggplot.

Let’s start simple by seeing whether diet affects sleep. Let’s plot the vore column vs sleep_cycle.

ggplot(data = msleep)+
  geom_point(mapping = aes(x = vore, y = sleep_cycle))+
  theme_classic()

And we get a warning message:

Warning message:
Removed 51 rows containing missing values or values outside the scale range (`geom_point()`). 

These are likely NA for sleep_cycle. You can check whether this is the case by filtering for NAs in sleep_cycle:

msleep %>%
  filter(is.na(sleep_cycle))

What prints out in the console is # A tibble: 51 × 11 - so there are 51 rows (i.e. animals in this dataset) that have missing sleep_cycle data.

Generally, missing y axis values will not be plotted by ggplot, while missing x axis values will.

You can also see that there is a point “NA” in the x axes. You can check what animal that is by running:

msleep %>%
  filter(is.na(vore) & !is.na(sleep_cycle))

Looks like researchers don’t know the diet of the musk shrew or not. If you happen to know it, you can update the entry and have it included. Alternatively, remove it. We will remove missing entrie in this case. Let’s do that by filtering again.

I also felt like the range for herbivore and omnivore were too high to be informative. There’s a lot of data points there so why don’t we split it further?

I still want to retain info about the diet, but lets also split further on order. I also want to make the theme less ugly so lets copy what we did previously and add a common premade theme and rotate the axes.

msleep %>% 
  drop_na(order, vore) %>% 
    ggplot()+
    geom_point(mapping = aes(x = order, y = sleep_cycle, colour = vore))+
    theme_classic()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

This is better, but the colours are all over the place! I want to more easily distinguish the diets of these animals better. This is where facet_wrap and facet_grid are your best friends and why R is better than Excel/Graphpad for plotting. Faceting lets your group by variables in a very powerful way. Let’s try it:

msleep %>% 
  drop_na(order, vore) %>% 
  ggplot()+
    geom_point(mapping = aes(x = order, y = sleep_cycle, colour = vore))+
    facet_wrap(vars(vore), nrow = 1, scales = 'free_x')+
    theme_classic()+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

This is now looking even more informative but I have 2 gripes with it:

We should order it in ascending order of sleep
The number of variables per group is different across the different diets and this unevenness is ugly

In order to fix the first problem, this is where factors come back in. It turns out we’ll need an extra package forcats. This is loaded automatically if you load tidyverse at the very beginning instead of tidyr, dplyr and ggplot2 separately. However, this will also load a lot of packages you may not necessarily need, so I usually prefer to just load required packages.

install.packages("forcats")
library('forcats')

## Here we are sending the filtered and reordered data directly into the ggplot function

msleep %>%
  drop_na(order, vore, sleep_cycle) %>%
  mutate(order_new = fct_reorder(order, sleep_cycle)) %>%
    ggplot()+
        geom_point(mapping = aes(x = order_new, y = sleep_cycle, colour = vore))+
        facet_grid(cols = vars(vore), scales = 'free_x', space = 'free_x')+
        theme_classic()+
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Bonus: Add a boxplot!

You can combine plots on top of each other if you’d like. The geoms get stacked on top of each other in the order you write them.

msleep %>%
  drop_na(order, vore, sleep_cycle) %>%
  mutate(order_new = fct_reorder(order, sleep_cycle)) %>%
    ggplot()+
        geom_boxplot(mapping = aes(x = order_new, y = sleep_cycle, fill = vore), colour = 'black')+
        geom_point(mapping = aes(x = order_new, y = sleep_cycle), colour = 'black')+
        facet_grid(cols = vars(vore), scales = 'free_x', space = 'free_x')+
        theme_classic()+
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Extra calculations

Let’s say you want to check how many animals you actually have sleep_cycle data for for each diet (i.e. no NAs!)

You need to calculate the “counts”. Let’s see an example of how you’d ask the LLM:

Let’s try it!

msleep %>%
  # keep only rows with non-missing sleep_cycle
  filter(!is.na(sleep_cycle), !is.na(vore)) %>%
  # count animals by diet type
  count(vore) %>%
  # plot
  ggplot(aes(x = vore, y = n)) +
  geom_col(fill = "steelblue") +
  labs(
    title = "Count of Animals by Diet Type",
    x = "Diet (vore)",
    y = "Number of Animals"
  ) +
  theme_minimal()

Exercise: Can you compare the brain to body weight ratio of each animal between diets?

This is the plot we are looking for:

For anything where you have a different number of samples plotted, I HIGHLY recommend using geom_text to plot the number of samples, such as in this boxplot. However, this is a bit more advanced. Hint: You need to use a different data argument for the geom_text… A counts data perhaps…?

Reshaping data

The animal sleep dataset was actually in mostly the correct format for plotting. What if we have one that is completely wrong?

Lets investigate the world population by country longitudinal data set:

data('world_bank_pop', package = 'tidyr')

Super long format:

Most sensible format for plotting longitudinal data:

Exercise: Can you plot the population over time for Australia?

Hint: You can try to be very specific with LLMs:

country	country_name	region
AUS	Australia	Oceania
CAN	Canada	North America
GER	Germany	Europe

Choosing a “dialect”

As a beginner, just do whatever makes the most sense to you, but try to keep it consistent or you’ll find learning extra difficult. If your goal is to just replace Excel/Graphpad prism, it really doesn’t matter what you use so do whatever you like best.

However, if you plan to start working on analysis of more complex data, the type that can’t really be opened on Excel:

If you plan to do a lot of Xenium/Phenocycler analysis, then the data.table syntax would be strongly recommended over the tidy/dplyr for data wrangling and calculations. You can read the vignettes here.
If your goal is to learn single cell RNAseq analysis, it would be prudent to stick to tidyverse because the syntax is more similar.