What Movies and TV Shows Share Actors?
In this blog post, I’m going to make a super cool web scraper to answer the following question:
What movie or TV shows share actors with your favorite movie or show?
Here’s a link to my project repository:
https://github.com/acgowda/imdb_scraper
§1. Setup
§1.1. Locate the Starting IMDB Page
I decided to start with my favorite movie, Fast Five. Its IMDB page is at:
https://www.imdb.com/title/tt1596343/
§1.2 Initialize Your Project
- Create a new GitHub repository, and sync it with GitHub Desktop. This repository will house the scraper.
- Verify that you have the latest version of Scrapy installed in your environment. Open a terminal in the location of the repository on your computer, and type:
conda activate pic
scrapy startproject IMDB_scraper
cd IMDB_scraper
Note: Edit the first line with the name of your conda environment.
This will create quite a lot of files, but you won’t really need to touch most of them.
§1.3 Tweak Settings
For now, add the following line to the file settings.py
:
CLOSESPIDER_PAGECOUNT = 20
This line prevents the scraper from downloading too much data while we’re still testing things out.
§2. Writing the Scraper
§2.1 Create the Spider
Now, create a file inside the spiders
directory called imdb_spider.py
. Add the following lines to the file:
# to run
# scrapy crawl imdb_spider -o movies.csv
import scrapy
class ImdbSpider(scrapy.Spider):
name = 'imdb_spider'
start_urls = ['https://www.imdb.com/title/tt1596343/']
We will proceed by implementing three parsing methods for the ImdbSpider
class to aquire the data we are searching for.
§2.2 Access the Movie Credits
First, we start on a movie page, and then navigate to the Cast & Crew page. This page has url <movie_url>fullcredits
. Once there, the parse_full_credits(self,response)
should be called.
def parse(self, response):
"""
Callback used by Scrapy to process downloaded responses.
Args:
response (Response): An object that represents an HTTP response, fed to the Spiders for processing
Yields:
A new request with the movie's credits page and a callback to parse_full_credits().
"""
# Set the url to go to next.
next_page = response.urljoin('fullcredits/')
yield scrapy.Request(next_page, callback=self.parse_full_credits)
This method simply joins the relative url of the credits page to the starting url and then calls the next method, parse_full_credits()
.
§2.3 Find all the Actors
Next, starting on the Cast & Crew page, we want to yield a scrapy.Request
for the page of each actor listed on the page. Crew members are not included. The yielded request should specify the method parse_actor_page(self, response)
.
def parse_full_credits(self, response):
"""
Callback used to process each actor included in the credits page.
Args:
response (Response): An object that represents an HTTP response, fed to the Spiders for processing
Yields:
A new request with an actor's page and a callback to parse_full_credits().
"""
# Get relative links for all actors who worked on the specified movie.
links = [a.attrib["href"] for a in response.css("td.primary_photo a")]
for link in links:
# Set the url to go to next.
next_page = response.urljoin(link)
yield scrapy.Request(next_page, callback=self.parse_actor_page)
This method works by creating a list of relative links to actors’ pages. It then loops through each link, calling the next method, parse_actor_page()
.
§2.4 Find all the Movies
Then, starting on the page of an actor, we want to yield a dictionary with two key-value pairs, of the form {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}
. The method should yield one such dictionary for each of the movies or TV shows on which that actor has worked.
def parse_actor_page(self, response):
"""
Callback used to process each actor's credits page and read each movie or show they have acted in.
Args:
response (Response): An object that represents an HTTP response, fed to the Spiders for processing
Yields:
A dictionary containing the actor's name and the movie's name.
"""
# Get the actor's name.
actor = response.css('h1.header span.itemprop::text').get()
# Only looks at movies and shows the actor has acted in. (Not produced, written, etc.)
for movie in response.css('div#filmo-head-actor + div > div.filmo-row'):
# Get the movie's name
name = movie.css('a::text').get()
yield {
"actor" : actor,
"movie_or_TV_name" : name
}
This method works by finding the actor’s name and then outputing dictionaries with the actor and each movie or tv show they have acted in.
§3. Using the Scraper
Now that the spider is fully written, in the settings.py
file, comment out the following line:
CLOSESPIDER_PAGECOUNT = 20
Then, open a terminal the scrapy project folder and run the following command:
scrapy crawl imdb_spider -o results.csv
This will run the scraper and store the output dictionaries in a csv file.
§4. Analysis
§4.1 Requirements
To follow this part, the following packages are required.
import pandas as pd
import plotly.express as px
§4.2 Most Common Movies
Now, we want to view the movies with the most shared actors. We can do this using Pandas.
# Load webscraping results.
df = pd.read_csv('results.csv')
# Count the number of actors per movie and sort in descending order.
df = df.groupby('movie_or_TV_name').count().sort_values(['actor'], ascending=False).reset_index()
# Rename columns.
df.columns = ['movie/tv', 'number of shared actors']
df.head(10)
movie/tv | number of shared actors | |
---|---|---|
0 | Fast Five | 77 |
1 | Fast & Furious 6 | 13 |
2 | CSI: Miami | 9 |
3 | Furious 7 | 9 |
4 | The Fast and the Furious | 8 |
5 | Drop Dead Diva | 8 |
6 | F9: The Fast Saga | 8 |
7 | Fast & Furious | 8 |
8 | The Game | 8 |
9 | Bones | 7 |
We can also visualize this data using Plotly.
# Create barchart of the movies with the most shared actors.
fig = px.bar(df[:10],
x='movie/tv',
y='number of shared actors',
title="Movies with the Most Shared Actors")
fig.show()
§4.3 Recommendations
In theory, this data could be used to make recommendations on what movies and individual should watch. In this scenario, a fan of Fast Five might also enjoy several of the other movies from the Fast and Furious franchise, which logically makes sense.