Project 4 - Football data analysis

Author
Affiliation

Andrew Moles

Learning Developer, Digital Skills Lab

Published

September 25, 2025

Learning objectives:

  • Load in data
  • Modify your data using calculations on and across columns
  • Get summary statistics from data
  • Filter rows and selecting columns of a dataset

Outcomes

We will write a program that provides us with various metrics on Harry Kane (a football player). Your code will automate the process of loading the data, performing operations on that data, and making it into a presentable format.

We will be aiming for two outcomes in this project.

Outcome 1 - total career summary statistics

You should end up with a output that contains the following information, presented as two text outputs:

  1. The name of the player. Harry Kane in this case
  2. The number of seasons played
  3. The total number of appearances
  4. The total number of goals scored
  5. The total number of goals and assists combined
  6. The average expected goals

[1] "Player: Harry Kane | Seasons: 15 | Appearances: 436"
[1] "Goals: 289 | Goals and assists: 351 | Average expected goals: 20.3"

Outcome 2 - filtered dataset showing highest scoring years

You should end up with a output that shows a filtered dataset with some selected columns.

You will need to create the columns goals_assists, goals_xg_diff, goals_no_pens to get this output.


      season age         squad goals goals_assists goals_xg_diff goals_no_pens
9  2016-2017  23     Tottenham    29            34            NA            24
10 2017-2018  24     Tottenham    30            32           5.2            28
13 2020-2021  27     Tottenham    23            37           2.9            19
15 2022-2023  29     Tottenham    30            33           8.6            25
16 2023-2024  30 Bayern Munich    36            44           5.4            31
17 2024-2025  31 Bayern Munich    26            35           5.7            17

The data

We will be using the csv file provided (harry_kane_stats.csv). Click the link below to download the data.

The dataset provides general football metrics on each season Harry Kane has played, from 2010-2025.

If you are not sure what the expected goals and assists columns mean, you can find the definitions on the statsperform webpage, under the header ‘Expected Goals & Expected Assists’.

Steps to help you get to the outcome

Part 1 - the setup

Open an R script file and save it. Make sure to save the dataset in the same folder as your R script file.

Part 2 - load in the dataset

Import the csv file into R.

View the dataset you just loaded into R.

Part 3 - column calculations

Our dataset is missing useful information about our players season statistics. Fortunately we can calculate these!

Calculate the following metrics and add them as columns to your dataset.

  • Combined goals and assists per season
  • Goals excluding penalties
  • Difference between goals scored and expected goals

Part 4 - calculating the total career statistics

To build our career summary we will need to create variables with the metrics we need by performing calculations on columns in the dataset.

  • The name of the player. Harry Kane in this case
  • The number of seasons played
  • The total number of appearances
  • The total number of goals scored
  • The total number of combined goals and assists
  • The average expected goals

The metrics you produce should match what we see in Section 2.1.

Part 5 - making career summary

Print out the multi line string message that shows you the career summary metrics.

Part 6 - filtering rows of a dataset

Subset your dataset to only contain seasons where our players combined goals and assists where greater than 30.

It is good practice to assign the results of your subset to a new data frame under a different name.

Part 7 - selecting columns of a dataset

Select the columns, shown in the second outcome, to make your results more presentable.

You should be able to adjust the code you have written to select the columns. Your outcome should be the same as seen in Section 2.2.

Final task - fill out the survey!

We are always looking to improve and iterate our workshops. Follow the link to give your feedback.