Project 7 - Joins and transformations

Author
Affiliation

Andrew Moles

Learning Developer, Digital Skills Lab

Published

October 14, 2025

Learning objectives:

  • Bring together two datasets that have related information
  • Tidy the datasets using pattern matching and replacement, and row wise calculations
  • Reshape that dataset to a longer format so we can work with it a different way
  • Visualise the data as a line plot

Outcomes

We will write a program that brings together various film review ratings across different datasets, which you will then visualise. Your code will automate the process of loading the data, performing operations on that data, and making it into a presentable format.

We will be aiming for two outcomes in this project.

Outcome 1 - Preparing the dataset

Outcome 2 - Making a line plot from the dataset

The data

We will be using two csv files provided (imdb.csv & rotten_tom_ratings.csv). Click the links below to download the data.

The datasets are film ratings from Rotten Tomatoes and IMDB.

Steps to help you get to the outcome

Part 1 - the setup

Open an R script file and save it.

Make sure to save the datasets in the same folder as your R script file. You can download them from the data section.

Part 2 - load in the dataset

Import the csv files into R.

View the dataset you just loaded into R.

Part 3 - removed duplicates

The Rotten Tomatoes dataset has some duplicated data where film reviews are shown multiple times. For example, there are four entries for the film The Godfather but the data is the same for all of them.

Remove the duplicated data from the Rotten Tomatoes dataset using the release_date column as your reference point.

Part 4 - joining datasets

Join the IMD and Rotten Tomatoes datasets together only keeping observations that are matching in both datasets.

Part 5 - string manipulation and type conversion

Using pattern matching and replacement convert the data in the columns critic_score and audience_score from percent scores to numbers. Store the newly converted data as new columns in your data frame.

Part 6 - column calculations

Convert the avg_vote column to be on the same scale as the rotten tomato scores. Store the result as a new column.

Part 7 - row-wise calculations

Calculate the average ratings across the critic_score, audience_score, and imd scores. Store the result as a new column.

If you have done parts 5, 6, and 7 correctly your new columns should look like the output below, which is showing the first five rows.

# A tibble: 5 × 4
  rotten_tom_critic rotten_tom_audience   imd avg_ratings
              <int>               <int> <dbl>       <dbl>
1                98                  80    60        79.3
2                91                  47    63        67  
3                78                  NA    64        71  
4                NA                  NA    69        69  
5                NA                  NA    64        64  

Part 8 - reshaping the data to be longer

Select the title, year, critic reviews, audience reviews, IMD reviews, and the average rating columns from your dataset. Then reshape the data from wide to long. Make sure to save the result as a new data frame.

The columns that are being transformed are the critic reviews, audience reviews, IMD reviews. You should end up with a column with the names (rating type) and a column for the rating (or value).

By this point your data should look like what is shown in Section 2.1.

Part 9 - filtering films

There is a lot of data in this dataset. We want to only look at the highest rated films on IMD.

Make a vector that contains the following films:

"The Shawshank Redemption","The Godfather","The Dark Knight","The Lord of the Rings: The Return of the King","Schindler's List","The Lord of the Rings: The Fellowship of the Ring","Pulp Fiction","The Good, the Bad and the Ugly","Forrest Gump","The Lord of the Rings: The Two Towers"

Filter your now transformed (long) dataset to only have the above films, using the vector you just made. Make sure to save the result as a new data frame.

Part 10 - using factors

Make the title column in your dataset a factor, ordering the factor by the same order that appears in the vector you made in part 9.

Part 11 - making a line plot

Make a line graph visualisation of the data, with the title on the x axis, and ratings on the y. Colour the data by the review type.

Part 12 - aesthetic adjustments

Add aesthetic changes to make the visual more appealing and to look like what we see in Section 2.2.

  • Add horizontal line to show the average rating across all films. Add an annotation to that line to say what it means.
  • Wrap x axis labels so they are more readable.
  • Adjust the y axis so we only see data from 70-100, with breaks of 5.
  • Add manual colours. We used "#F5C518", "#068098", "#961E06".
  • Move the legend to be at the bottom of the plot.
  • Add a title, and remove the y and x axis labels.
  • Adjust the theme to be more minimal.

Final task - fill out the survey!

We are always looking to improve and iterate our workshops. Follow the link to give your feedback.