4 Data wrangling: beyond basics

4.1 Outline / cheat sheet

  • Manipulating date variables:
    • lubridate package: Use date format functions (e.g. ymd) to specify date data formats and extract particular date-related data
  • Joining multiple data frames:
    • _join series: Combine two data frames by one or more shared columns. Specify with kind of join(full_, left_, etc.) to determine which observations are kept/dropped.
  • Reshaping (lengthening/widening) data:
    • pivot_longer: Take values that are spread across multiple columns and ‘lengthen’ the data by stacking those columns on top of one another
    • pivot_wider: Take ‘long’ values stacked into one column and ‘widen’ the data by spreading them out
  • Working with text:
    • separate: Divide the contents of one column into multiple based on particular symbols
    • stringr package: Identify patterns using ‘regular expressions’ and apply a series of commands (e.g. remove, extract, detect) to manipulate text data
    • paste: combine contents from multiple columns (or external text)

4.2 Manipulating date variables

Remember the weather data we loaded in at the start. We have weather data from the Palmer Station in Antarctica from 1989 - 2019, where data are available for each day

str(weather)
## 'data.frame':    10674 obs. of  24 variables:
##  $ Date                           : chr  "1989-04-01" "1989-04-02" "1989-04-03" "1989-04-04" ...
##  $ Temperature.High..C.           : num  2.8 1.1 -0.6 1.1 -0.6 2.5 -1.4 -0.8 -1 -1.5 ...
##  $ Temperature.Low..C.            : num  -1 -2.7 -3.5 -4.4 -2.9 -3.1 -3.2 -4.5 -4 -3.8 ...
##  $ Temperature.Average..C.        : num  0.9 -0.8 -2.05 -1.65 -1.75 -0.3 -2.3 -2.65 -2.5 -2.65 ...
##  $ Sea.Surface.Temperature..C.    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Sea.Ice..WMO.Code.             : chr  "" "" "" "" ...
##  $ Pressure.High..mbar.           : num  1004 998 998 1002 1002 ...
##  $ Pressure.Low..mbar.            : num  998 995 995 998 997 ...
##  $ Pressure.Average..mbar.        : num  1001 996 997 1000 1000 ...
##  $ Windspeed.Peak                 : int  18 24 13 14 14 15 16 39 27 16 ...
##  $ Windspeed.5.Sec.Peak           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Windspeed.2.Min.Peak           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Windspeed.Average              : num  4 9 8 6 4 4 8 14 11 6 ...
##  $ Wind.Peak.Direction..True...º. : int  110 30 60 210 230 180 40 120 50 60 ...
##  $ Wind.Peak.Direction..º.        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Wind.5.Sec.Peak.Direction...º. : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Wind.2.Min.Peak.Direction...º. : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Wind.Direction.Prevailing      : chr  "SW" "NE" "NE" "NW" ...
##  $ Rainfall..mm.                  : num  0 0 0 0 0 0 1 0 0 -998 ...
##  $ Precipitation.Snow..cm.        : num  0 0 0 0 0 0 -998 0 0 -998 ...
##  $ Depth.at.Snowstake..cm.        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Cloud.Cover                    : int  4 2 5 5 10 2 9 5 0 10 ...
##  $ Data.flag...Temperature.Average: chr  "C" "C" "C" "C" ...
##  $ Data.flag...Pressure.Average   : chr  "C" "C" "C" "C" ...

However, the structure of the data above indicate that the first variable, Date, is a ‘character’. This means that R is not reading these dates as numbers, and so it may be hard to manipulate them any further (e.g. extracting certain years).

head(weather$Date)
## [1] "1989-04-01" "1989-04-02" "1989-04-03" "1989-04-04" "1989-04-05"
## [6] "1989-04-06"
class(weather$Date)
## [1] "character"

We want to use the lubridate package (loaded in when we load in tidyverse) to convert these variables into dates and create a column for the year. This package is handy because it has very intuitive function names and argument structures. For instance, you can convert a character type into a date type by using a function with letters representing the date format. You can see all of these formats by specifically calling upon the lubridate package and then using semicolons to call upon the functions, such as those starting with ‘y’.

lubridate::y

In lubridate, y = year, m = month, d = day, h = hour, m = minute, s = second, etc.

We can see that our data is formatted in the year-month-day format, and so we can use the ymd function to indicate to R what format we want it to detect as it converts characters to dates. If we use this function and assign the output to a date vector, we can then evaluate the class to see that it is a date.

date_vector <- ymd(weather$Date)
class(date_vector)
## [1] "Date"

It is important to remember that manipulating a column does not mean it automatically saves into the data frame. Instead, we need to remember to write (or overwrite) that column into our data frame. Remember that we can use the mutate function to do this, where we can overwrite out date column. Note two different ways of doing this. First, we could use the ymd function directly in our mutate function:

weather <- mutate(weather, Date = ymd(Date))

Or because we have saved the new date data as its own vector, we can feed that vector into the data frame as is.

weather <- mutate(weather, Date = date_vector)

Now we have our date data saved in our data frame.

class(weather$Date)
## [1] "Date"

Now that R recognizes these as dates, we can extract certain date-related features, such as the year, month, and day. The lubridate package has functions conveniently named year, month, day that can be used with a date argument, and that element of the date will be extracted. For example:

head(year(weather$Date), n = 75)
##  [1] 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989
## [16] 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989
## [31] 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989
## [46] 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989
## [61] 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989 1989

Again, let’s use mutate to assign this as a new column

weather <- mutate(weather, year = year(Date))

4.2.1 Check in challenge

We are going to review our basic wrangling skills to prepare the Palmer Station weather data to merge with the penguins data.

We would like to merge a summary of the weather data with the penguins. For this summary, we want the weather data summarized by year, rather than by day (as it currently is). Then per year, we want to know the maximum, minimum, and average temperatures. Create this summary table called ‘weather_summary’.

Check your answer
weather_summary <- weather %>% 
  group_by(year) %>% 
  summarize(average_temp = mean(Temperature.Average..C., na.rm = T),
            max_temp = max(Temperature.High..C., na.rm = T),
            min_temp = min(Temperature.Low..C., na.rm = T))

4.3 Joining multiple data frames

In your data wrangling journey, you will often find yourself wanting to combine one data frame with some kind of supplementary or partner data frame. In our case, we have the penguins and weather data stored separately, but if we ever wanted to explore any relationships between them, we’d likely want all of that data stored in one place so that we can line up data points. This lining up of data points is called joining in the tidyverse language.

The key to joining is to identify the variables by which you want to join the data. In other words, which columns in each data are the ones that link them together? In some cases these may be one-to-one matches (e.g. ID numbers to IDs numbers), or in other cases there is data at different levels that need to be lined up.

By default, join functions will link the data together based on all of the shared column names, but usually you’ll want to double check to make sure that this will get you the desired output. Typically you can do this by just checking out the column names of each data frame (for this exercise we will use the weather_summary object created in the last challenge).

colnames(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
colnames(weather_summary)
## [1] "year"         "average_temp" "max_temp"     "min_temp"

We have one column overlapping, year, which makes it the likely candidate for joining by. But it is also important to think: what is it that actually links penguins to weather? Penguins experience weather at any given time, and so we will need to choose a shared time scale at which penguins are experiencing weather and the scale of data we collected for weather. In the penguins data set we only know the year that the observations were taken, so we’re interested in the annual weather summaries. This is why we are looking at the weather summary object, so that we can align the data relatively well.

Now, let’s join. The join functions take three arguments: the two data frames you’d like to join, and the name of the column by which to join. There are also several types of joins, which we can explore in the challenge, but for now let’s use a left_join:

penguins_leftjoined <- left_join(penguins, weather_summary, by = "year")
dim(penguins_leftjoined)
## [1] 344  11

We go from 8 columns in the penguins data to 11 now that we’ve appended the three weather-related columns to the data. And we stick with 344 rows because that is the number of rows in our ‘left’ data frame, penguins. These dimensions might change depending on the kind of join you use.

4.3.1 Check in challenge

Go ahead and check out the documentation for join by using the help function, a question mark:

?join

You’ll see the variety of joins that exist in dplyr. Try to experiment with them: What happens when you use another kind of join? Check how the dimensions of the data change depending on the join you choose.

Check your answer
penguins_fulljoined <- full_join(penguins, weather_summary, by = "year")
dim(penguins_fulljoined)
## [1] 372  11
# We now have more rows. Why? We had weather data for more years than we have penguin data, so we have a lot of weather data that is unmatched to any pengion observation

4.4 Reshaping (lengthening/widening) data

There are often cases where you’d like to change the shape of your data. For instance, you might want to turn several columns into two: one with the measurements or values from the previous columns, and one with a categorical indicator of what is represented in the new column. This is a case for pivoting longer.

Let’s say we wanted to take the key measures of our penguins and stack them on top of one another (this is not actually useful given our data, but it is the practice that counts). When we stack all of these columns on top of one another we are going to be creating two new columns: one with the names of the columns and another with the values of the columns. In the pivot_longer arguments, it looks like this:

# Pivot the data to make it longer
penguins_longer <- pivot_longer(penguins, 
                                cols = c(bill_length_mm, bill_depth_mm,
                                         flipper_length_mm, body_mass_g), 
                                names_to = "measure", values_to = "value")
head(penguins_longer)
## # A tibble: 6 × 6
##   species island    sex     year measure            value
##   <fct>   <fct>     <fct>  <int> <chr>              <dbl>
## 1 Adelie  Torgersen male    2007 bill_length_mm      39.1
## 2 Adelie  Torgersen male    2007 bill_depth_mm       18.7
## 3 Adelie  Torgersen male    2007 flipper_length_mm  181  
## 4 Adelie  Torgersen male    2007 body_mass_g       3750  
## 5 Adelie  Torgersen female  2007 bill_length_mm      39.5
## 6 Adelie  Torgersen female  2007 bill_depth_mm       17.4

The reverse scenario is having data within one column that you’d like to widen out into several columns. Here we have to identify the column we’d like to un-stack (values_from), how we’d like to name the new columns filled with the widened values (names_from). Note that we don’t need to put these column names in quotes because they already exist in the data frame!

penguins_wider <- pivot_wider(penguins_longer, 
                              names_from = measure, 
                              values_from = value)
## Warning: Values from `value` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
##   {data} |>
##   dplyr::summarise(n = dplyr::n(), .by = c(species, island, sex, year,
##   measure)) |>
##   dplyr::filter(n > 1L)

BUT, we get a bit of an issue here. Our result has condensed many of the measures down into list forms, leaving us with only 35 observations and several repeated measured for each observation. Why? When dplyr is widening out the data it wants to match the new columns to unique observations. So the function created one row for each species-island-sex-year combination, thinking those are unique. We can fix this by creating an ID column (id_col).

Here want to specify what it is that actually identifies unique observations in the data. Our issue is that we don’t have unique identifiers in the data. Whoops. Let’s fix that and re-run these functions to see why this matters.

First, let’s assign a unique ID column to the original penguins data and then re-run our pivot_longer:

penguins$ID <- 1:nrow(penguins)
penguins_longer <- pivot_longer(penguins, 
                                cols = c(bill_length_mm, bill_depth_mm,
                                         flipper_length_mm, body_mass_g), 
                                names_to = "measure", values_to = "value")
head(penguins_longer)
## # A tibble: 6 × 7
##   species island    sex     year    ID measure            value
##   <fct>   <fct>     <fct>  <int> <int> <chr>              <dbl>
## 1 Adelie  Torgersen male    2007     1 bill_length_mm      39.1
## 2 Adelie  Torgersen male    2007     1 bill_depth_mm       18.7
## 3 Adelie  Torgersen male    2007     1 flipper_length_mm  181  
## 4 Adelie  Torgersen male    2007     1 body_mass_g       3750  
## 5 Adelie  Torgersen female  2007     2 bill_length_mm      39.5
## 6 Adelie  Torgersen female  2007     2 bill_depth_mm       17.4

Now this will plat an impotant part in the pivot_wider. Let’s set ID as the the id_cols argument in the function:

penguins_widerID <- pivot_wider(penguins_longer, 
                                id_cols = ID,
                                names_from = measure, 
                                values_from = value)
head(penguins_widerID)
## # A tibble: 6 × 5
##      ID bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <int>          <dbl>         <dbl>             <dbl>       <dbl>
## 1     1           39.1          18.7               181        3750
## 2     2           39.5          17.4               186        3800
## 3     3           40.3          18                 195        3250
## 4     4           NA            NA                  NA          NA
## 5     5           36.7          19.3               193        3450
## 6     6           39.3          20.6               190        3650

Uh, close, but still some issues. What happened to all of our other columns? Well, since we said ID was our only id column, it dropped everything else. So we can actually pile everything into our ID column that we wan to keep.

penguins_widerID_allcols <- pivot_wider(penguins_longer, 
                                        id_cols = species:ID,
                                        names_from = "measure", 
                                        values_from = "value")
head(penguins_widerID_allcols)
## # A tibble: 6 × 9
##   species island    sex     year    ID bill_length_mm bill_depth_mm
##   <fct>   <fct>     <fct>  <int> <int>          <dbl>         <dbl>
## 1 Adelie  Torgersen male    2007     1           39.1          18.7
## 2 Adelie  Torgersen female  2007     2           39.5          17.4
## 3 Adelie  Torgersen female  2007     3           40.3          18  
## 4 Adelie  Torgersen <NA>    2007     4           NA            NA  
## 5 Adelie  Torgersen female  2007     5           36.7          19.3
## 6 Adelie  Torgersen male    2007     6           39.3          20.6
## # ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>

4.5 Working with text

Let’s back up a bit to the cleaning process. We’ve been working on getting our data in summary form, bringing in new data, and changing data shape, all of which is pretty far down the data wrangling pipeline. But we are going to take this last moment to talk about an earlier stage task: cleaning text data.

When we first create and collect data, it may not be in its tidy-est form. One common un-tidy data practice is to have multiple types of data in one column. For instance, we can look at the ‘raw’ version of the penguins data, also available through the package. Go ahead and read that in:

penguins_raw <- penguins_raw

Let’s look at the third and sixth columns, for instance.

head(select(penguins_raw, 3,6))
## # A tibble: 6 × 2
##   Species                             Stage             
##   <chr>                               <chr>             
## 1 Adelie Penguin (Pygoscelis adeliae) Adult, 1 Egg Stage
## 2 Adelie Penguin (Pygoscelis adeliae) Adult, 1 Egg Stage
## 3 Adelie Penguin (Pygoscelis adeliae) Adult, 1 Egg Stage
## 4 Adelie Penguin (Pygoscelis adeliae) Adult, 1 Egg Stage
## 5 Adelie Penguin (Pygoscelis adeliae) Adult, 1 Egg Stage
## 6 Adelie Penguin (Pygoscelis adeliae) Adult, 1 Egg Stage

The Stage column is a good example of where multiple data points might be combined into one cell as text. Here we have ‘Adult, 1 Egg Stage’ reported. But, we might want to separate out the title of the penguin’s stage, Adult, from their egg stage count. For this we can use the separate function to divide the contents of one column into multiple based on particular symbols. In this case, there is a convention to separate the word Adult (or other stage) from the egg count using a comma (,).

For the separate function we need to enter a few arguments: the data, the column to separate, the names of the new columns, and the separator symbol. Note that the help file says that the separator symbol is “interpreted as a regular expression” – we’ll get to to that in a second.

penguins_raw_sep <- separate(penguins_raw, Stage, c("age_cat", "egg_stage"), ',')

Now we can see that these two have replaced the previous Stage column which adds one more column to the data frame.

head(select(penguins_raw_sep, 4:7))
## # A tibble: 6 × 4
##   Region Island    age_cat egg_stage     
##   <chr>  <chr>     <chr>   <chr>         
## 1 Anvers Torgersen Adult   " 1 Egg Stage"
## 2 Anvers Torgersen Adult   " 1 Egg Stage"
## 3 Anvers Torgersen Adult   " 1 Egg Stage"
## 4 Anvers Torgersen Adult   " 1 Egg Stage"
## 5 Anvers Torgersen Adult   " 1 Egg Stage"
## 6 Anvers Torgersen Adult   " 1 Egg Stage"

Now, we could even go one step further. The data here are well organized and uniform, meaning that separate works quite well. Every Stage column has a comma, and even now in the egg_stage column there are three words, so we could separate by a space ” ” and get three new columns. However, when data is messy we might need some more powerful text manipulation tools.

Enter: the stringr package and regular expressions. stringr is a package providing a set of functions for manipulating “strings”, also called text or character data. Check out the great stringr cheat sheet. Part of what makes stringr so powerful is that it relies on patterns interpreted as ‘regular expressions’ (or regex). Regex is its own pattern language that allows you to generalize string patterns for great adaptability to string cleaning.

We will practice just one example, which is to extract the number from the egg_stage column. The power of regex is that instead of specifying every possible number that could exist in the column (1,2,3,4,5,6,7,8,9), there is a symbol for numbers in the regex language, which is \\d. There is also a symbol to quantify the number of symbols, where ? is zero or one, * is zero or more, + is one or more, etc. By combining \\d+ we’ve created a regular expression for the pattern: one or more digits. We can then use this in one of the string functions, str_extract, with the two arguments it needs: the character vector and the pattern.

str_extract(penguins_raw_sep$egg_stage, '\\d+')
##   [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [19] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [37] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [55] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [73] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##  [91] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [109] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [127] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [145] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [163] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [181] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [199] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [217] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [235] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [253] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [271] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [289] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [307] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [325] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
## [343] "1" "1"

Excellent. Our pattern worked and we’ve successfully extracted the number pattern from the column. One tiny tweak is that the numbers are in quotation marks, meaning that they are still character data. We can coerce them to numeric once they are all numbers.

penguins_raw_sep$egg_no <- as.numeric(str_extract(penguins_raw_sep$egg_stage,
                                                  '\\d+'))
head(penguins_raw_sep$egg_no)
## [1] 1 1 1 1 1 1

Finally, maybe we want to put things back together? The paste function is a straightforward way to paste columns or values back together. The arguments for paste are as many value you’d like to paste together, separated by a comma. For instance:

head(paste(penguins_raw_sep$age_cat, penguins_raw_sep$egg_stage))
## [1] "Adult  1 Egg Stage" "Adult  1 Egg Stage" "Adult  1 Egg Stage"
## [4] "Adult  1 Egg Stage" "Adult  1 Egg Stage" "Adult  1 Egg Stage"

And you can also mix and match data types and lengths, etc:

paste("Paste together items of different lengths, for example:", 
      1:10, "different items")
##  [1] "Paste together items of different lengths, for example: 1 different items" 
##  [2] "Paste together items of different lengths, for example: 2 different items" 
##  [3] "Paste together items of different lengths, for example: 3 different items" 
##  [4] "Paste together items of different lengths, for example: 4 different items" 
##  [5] "Paste together items of different lengths, for example: 5 different items" 
##  [6] "Paste together items of different lengths, for example: 6 different items" 
##  [7] "Paste together items of different lengths, for example: 7 different items" 
##  [8] "Paste together items of different lengths, for example: 8 different items" 
##  [9] "Paste together items of different lengths, for example: 9 different items" 
## [10] "Paste together items of different lengths, for example: 10 different items"

4.5.1 Check in challenge

Try to separate the common name from the scientific name from the ‘Species’ column in penguins_raw. E.g. separate ‘Adelie Penguin (Pygoscelis adeliae)’ into two columns: ‘Adelie Penguin’ and ‘Pygoscelis adeliae’. You may need to combine separate with a stringr function (e.g. str_remove) depending on how you’d like to deal with the parentheses. Note that parentheses in regular expressions are special characters and will need to be treated as such both in the separate and stringr functions – see the cheat sheet!

Check your answer
# Separate by the first opening parentheses
penguins_raw <- separate(penguins_raw, Species, 
                         c("common_name", "sci_name"), "\\(")
# Then remove the second closing parentheses with stringr
penguins_raw$sci_name <- str_remove(penguins_raw$sci_name, "\\)")
# I also like to use the base R function 'trimws' to trim white space
penguins_raw$common_name <- trimws(penguins_raw$common_name)
penguins_raw$sci_name <- trimws(penguins_raw$sci_name)
# Let's take a look
head(select(penguins_raw, common_name, sci_name))
## # A tibble: 6 × 2
##   common_name    sci_name          
##   <chr>          <chr>             
## 1 Adelie Penguin Pygoscelis adeliae
## 2 Adelie Penguin Pygoscelis adeliae
## 3 Adelie Penguin Pygoscelis adeliae
## 4 Adelie Penguin Pygoscelis adeliae
## 5 Adelie Penguin Pygoscelis adeliae
## 6 Adelie Penguin Pygoscelis adeliae