Predictive Film Revenue¶

By: Owen Butler

Link for Github: https://ozbutler12.github.io/

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab Notebooks
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks

Introduction¶

For this project, I have decided to analyze movies to determine if there is a way to predict different characteristics of the movie. More specifically, I will use a movie's genre, runtime, and rating to predict a movie's gross revenue or box office returns. This information could be very important as it can show what characteristic of a film has the strongest relationship with its gross revenue. As a result, this information is very important for actors as it directly influences how much money they will make, and it also influences movie theaters as it can help them to determine which movies they show and which ones they should not.

These datasets can be used to answer a variety of important questions in the film industry. Some of the questions that the datasets could answer are...

  • Can you predict a movie’s success based on its characteristics?
  • Which characteristics of a movie has the largest influence on a movies success?
  • Is a specific genre of movies more successful than others?

Collaboration Plan¶

For this project, I plan on working by myself and using a GitHub repository to store all my information and research. I will also be using Google Drive to help me to stay organized as well. Due to this being an individual project, it will be easy for me to keep track of everything occurring and what needs to be done. I am planning on also using Google Calendar to keep track of project deadlines to ensure that nothing is forgotten about.


Importing Libraries¶

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import re
import ast
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
pd.options.display.max_rows = 10000
pd.options.display.width = 1000

Datasets¶

I have decided to analyze three different datasets to predict a movie's rating. These data sets are from kaggle.com. The data sets are from IMDB, the box office returns of the top 1000 grossing movies, and Amazon Prime. By using these three datasets, I can analyze over 10,000 different movies. I will also use a dataset of the inflation rates from 1900 to 2022. This data will adjust different gross revenues to their current dollar amount.

Dataset 1: Inflation (1900 - 2022)

This dataset contains the inflation rate for each year from 1900 to 2022. This data will be used to convert the gross revenue into its current value. It stores the conversion rate for each year's currency to the 1900's dollar.

Dataset 2: IMDB Movies Dataset

This dataset contains the top 1000 rated movies and tv shows on IMDB, and it has a lot of helpful information for each movie. This includes its release year, certificate (rating), runtime, genre, IMDB rating, meta score, and gross revenue. This data is also going to be used in the prediction algorithm as well.

Dataset 3: Top 1000 Highest Grossing Movies

This dataset contains information about the top 1000 grossing movies of all time. It includes information on its domestic revenue, global revenue, rating, runtime, and genre. This dataset will be used to create a prediction model for a movie's gross revenue by using its genre, runtime, and rating.

Dataset 4: Amazon Prime Movies and TV Shows

This dataset contains more than 10,000 movies and tv shows available on their platform as of mid-2021. It includes its cast, director, ratings, release year, duration, and genre. This dataset will be used to show one possible application of the prediction algorithm, and I will attempt to predict the gross income of movies in this dataset.

In [3]:
boxOffice_df = pd.read_csv(r'Data/Highest Holywood Grossing Movies.csv', encoding='latin-1') #import box office information 
IMDB_df = pd.read_csv(r'Data/IMDB.csv') #import IMDB
amazon_df = pd.read_csv(r'Data/amazon.csv') #import amazon
inflation_df = pd.read_csv(r'Data/inflation_data.csv') #import inflation

Extraction, Transform, and Load¶

While all of these datasets have similar information, they are all organized in different ways. So to solve this problem, I need to reorganize and change the labels of each dataset.

Inflation

In [4]:
inflation_df.head()
Out[4]:
year amount inflation rate
0 1900 1.00 0.01
1 1901 1.01 0.01
2 1902 1.02 0.01
3 1903 1.05 0.02
4 1904 1.06 0.01

For this dataset, I made a function that converts the gross revenue from the past to what it would be in today's currency. This function works by converting the gross revenue into the 1900 currency, then changing it to 2022 currency from there.

In [5]:
def inflationChange(gross,year):
  year=year-1900
  rate=inflation_df["amount"][year] 
  oldValue=gross/rate #Converts to 1900 currency
  newValue=oldValue*inflation_df.loc[122,'amount'] #Converts to 2022 currency
  return newValue

IMDB

In [6]:
IMDB_df.head()
Out[6]:
Poster_Link Series_Title Released_Year Certificate Runtime Genre IMDB_Rating Overview Meta_score Director Star1 Star2 Star3 Star4 No_of_Votes Gross
0 https://m.media-amazon.com/images/M/MV5BMDFkYT... The Shawshank Redemption 1994 A 142 min Drama 9.3 Two imprisoned men bond over a number of years... 80.0 Frank Darabont Tim Robbins Morgan Freeman Bob Gunton William Sadler 2343110 28,341,469
1 https://m.media-amazon.com/images/M/MV5BM2MyNj... The Godfather 1972 A 175 min Crime, Drama 9.2 An organized crime dynasty's aging patriarch t... 100.0 Francis Ford Coppola Marlon Brando Al Pacino James Caan Diane Keaton 1620367 134,966,411
2 https://m.media-amazon.com/images/M/MV5BMTMxNT... The Dark Knight 2008 UA 152 min Action, Crime, Drama 9.0 When the menace known as the Joker wreaks havo... 84.0 Christopher Nolan Christian Bale Heath Ledger Aaron Eckhart Michael Caine 2303232 534,858,444
3 https://m.media-amazon.com/images/M/MV5BMWMwMG... The Godfather: Part II 1974 A 202 min Crime, Drama 9.0 The early life and career of Vito Corleone in ... 90.0 Francis Ford Coppola Al Pacino Robert De Niro Robert Duvall Diane Keaton 1129952 57,300,000
4 https://m.media-amazon.com/images/M/MV5BMWU4N2... 12 Angry Men 1957 U 96 min Crime, Drama 9.0 A jury holdout attempts to prevent a miscarria... 96.0 Sidney Lumet Henry Fonda Lee J. Cobb Martin Balsam John Fiedler 689845 4,360,000

For this dataset, several columns are not needed in the analysis. This includes the poster link, IMDB rating, overview, meta score, director, cast, and the number of votes it received. To rectify this, I removed any unnecessary columns and renamed the remaining columns to make them easier to understand.

In [7]:
IMDB_df=IMDB_df.drop(columns = ["Poster_Link","IMDB_Rating","Overview","Meta_score","Star1",	"Star2","Star3","Star4","No_of_Votes"])
IMDB_df = IMDB_df.rename(columns={'Director':'director','Released_Year': 'release_year','Genre':'listed_in','Series_Title':'title','Certificate':'rating',"Runtime":"duration"})
IMDB_df.head()
Out[7]:
title release_year rating duration listed_in director Gross
0 The Shawshank Redemption 1994 A 142 min Drama Frank Darabont 28,341,469
1 The Godfather 1972 A 175 min Crime, Drama Francis Ford Coppola 134,966,411
2 The Dark Knight 2008 UA 152 min Action, Crime, Drama Christopher Nolan 534,858,444
3 The Godfather: Part II 1974 A 202 min Crime, Drama Francis Ford Coppola 57,300,000
4 12 Angry Men 1957 U 96 min Crime, Drama Sidney Lumet 4,360,000

This data is currently using the Central Board of Film Certification. To use this data, I need to convert it to the United States rating system.

In [8]:
IMDB_df["rating"]=IMDB_df["rating"].map({"A": "R","U":"PG","UA":"PG-13","U/A":"PG-13","S":"NC-17"})

Also, I need to remove any empty values and the commas from the gross values.

In [9]:
IMDB_df=IMDB_df.dropna()
IMDB_df.Gross = IMDB_df['Gross'].str.replace(',', '') #Removes ,

Next, I need to convert the duration and gross revenue into an integer.

In [10]:
IMDB_df['Gross'] = IMDB_df['Gross'].astype(int)
IMDB_df['duration'] = IMDB_df['duration'].str.replace(r' min$', '', regex=True) #Removes the 'min' from the duration
IMDB_df["duration"]=IMDB_df["duration"].astype(int)

Lastly, I need to adjust the gross revnue for inflation.

In [11]:
IMDB_df['currentGross']=""
for ind in IMDB_df.index: #Adjusts for inflation
  gross=IMDB_df["Gross"][ind]
  year=IMDB_df["release_year"][ind]
  IMDB_df.loc[ind,"currentGross"]=inflationChange(gross,year)
IMDB_df['currentGross'] = IMDB_df['currentGross'].astype(int)

IMDB_df.head()
Out[11]:
title release_year rating duration listed_in director Gross currentGross
0 The Shawshank Redemption 1994 R 142 Drama Frank Darabont 28341469 57004269
1 The Godfather 1972 R 175 Crime, Drama Francis Ford Coppola 134966411 961567924
2 The Dark Knight 2008 PG-13 152 Action, Crime, Drama Christopher Nolan 534858444 740412703
3 The Godfather: Part II 1974 R 202 Crime, Drama Francis Ford Coppola 57300000 346337989
4 12 Angry Men 1957 PG 96 Crime, Drama Sidney Lumet 4360000 46176955

Box Office Data

In [12]:
boxOffice_df.head()
Out[12]:
Unnamed: 0 Title Movie Info Distributor Release Date Domestic Sales (in $) International Sales (in $) World Sales (in $) Genre Movie Runtime License
0 0 Star Wars: Episode VII - The Force Awakens (2015) As a new threat to the galaxy rises, Rey, a de... Walt Disney Studios Motion Pictures December 16, 2015 936662225 1132859475 2069521700 ['Action', 'Adventure', 'Sci-Fi'] 2 hr 18 min PG-13
1 1 Avengers: Endgame (2019) After the devastating events of Avengers: Infi... Walt Disney Studios Motion Pictures April 24, 2019 858373000 1939128328 2797501328 ['Action', 'Adventure', 'Drama', 'Sci-Fi'] 3 hr 1 min PG-13
2 2 Avatar (2009) A paraplegic Marine dispatched to the moon Pan... Twentieth Century Fox December 16, 2009 760507625 2086738578 2847246203 ['Action', 'Adventure', 'Fantasy', 'Sci-Fi'] 2 hr 42 min PG-13
3 3 Black Panther (2018) T'Challa, heir to the hidden but advanced king... Walt Disney Studios Motion Pictures NaN 700426566 647171407 1347597973 ['Action', 'Adventure', 'Sci-Fi'] 2 hr 14 min NaN
4 4 Avengers: Infinity War (2018) The Avengers and their allies must be willing ... Walt Disney Studios Motion Pictures NaN 678815482 1369544272 2048359754 ['Action', 'Adventure', 'Sci-Fi'] 2 hr 29 min NaN

To analyze this data set, I need to make several alterations to it. The first one is that I need to remove any columns that are not needed in my analysis. This includes the first column (Unnamed: 0), Movie Info, Distributor, and domestic and international sales. I also need to remove any movies with missing information and match the column names with the previous dataset.

In [13]:
boxOffice_df=boxOffice_df.drop(columns = ["Unnamed: 0","Movie Info","Distributor","Domestic Sales (in $)","International Sales (in $)"])
boxOffice_df = boxOffice_df.rename(columns={'Title':'title','Release Date':'release_year','World Sales (in $)':'Gross','Genre':'listed_in','Movie Runtime':'duration','License':'rating'})
boxOffice_df=boxOffice_df.dropna()
boxOffice_df.head()
Out[13]:
title release_year Gross listed_in duration rating
0 Star Wars: Episode VII - The Force Awakens (2015) December 16, 2015 2069521700 ['Action', 'Adventure', 'Sci-Fi'] 2 hr 18 min PG-13
1 Avengers: Endgame (2019) April 24, 2019 2797501328 ['Action', 'Adventure', 'Drama', 'Sci-Fi'] 3 hr 1 min PG-13
2 Avatar (2009) December 16, 2009 2847246203 ['Action', 'Adventure', 'Fantasy', 'Sci-Fi'] 2 hr 42 min PG-13
6 Titanic (1997) December 19, 1997 2201647264 ['Drama', 'Romance'] 3 hr 14 min PG-13
7 Jurassic World (2015) June 10, 2015 1670516444 ['Action', 'Adventure', 'Sci-Fi'] 2 hr 4 min PG-13

Also, I need to reformat the release_year column only to have the year instead of the entire date. Additionally, I need to change the listed_in column to a string instead of an array.

In [14]:
boxOffice_df['release_year'] = boxOffice_df['release_year'].str[-4:] #Makes it the year only
boxOffice_df['listed_in'] = boxOffice_df['listed_in'].str.replace('[', '', regex=True) #Removes [
boxOffice_df['listed_in'] = boxOffice_df['listed_in'].str.replace('\'', '', regex=True) #Removes '
boxOffice_df['listed_in'] = boxOffice_df['listed_in'].str.replace(']', '', regex=True) #Removes ]
boxOffice_df.head()
Out[14]:
title release_year Gross listed_in duration rating
0 Star Wars: Episode VII - The Force Awakens (2015) 2015 2069521700 Action, Adventure, Sci-Fi 2 hr 18 min PG-13
1 Avengers: Endgame (2019) 2019 2797501328 Action, Adventure, Drama, Sci-Fi 3 hr 1 min PG-13
2 Avatar (2009) 2009 2847246203 Action, Adventure, Fantasy, Sci-Fi 2 hr 42 min PG-13
6 Titanic (1997) 1997 2201647264 Drama, Romance 3 hr 14 min PG-13
7 Jurassic World (2015) 2015 1670516444 Action, Adventure, Sci-Fi 2 hr 4 min PG-13

Lastly, I need to remove the year from the title column and make the duration and release year an integer that shows how long the movie is in minutes. I also converted the gross revenue into the current dollar equivalence.

In [15]:
boxOffice_df['title'] = boxOffice_df['title'].str[:-6] #Removes the year from the title column
In [16]:
def durationConvert(time): #Converts the string runtime into an integer
  hour=time[0:time.find('hr')-1]
  minutes=time[time.find('hr')+3:time.find('min')]
  if(hour==""):
    hour=0
  if(minutes==""):
    minutes=0
  runtime = int(float(hour))*60+ int(float(minutes)) #Creates the new integer runtime
  return runtime
for ind in boxOffice_df.index: #Cycles through the dataframe
  boxOffice_df.loc[ind,"duration"]=durationConvert(boxOffice_df["duration"][ind])
boxOffice_df["release_year"]=boxOffice_df["release_year"].astype(int)
boxOffice_df['currentGross']=""
for ind in boxOffice_df.index: #Adjusts for inflation
  gross=boxOffice_df["Gross"][ind]
  year=boxOffice_df["release_year"][ind]
  boxOffice_df.loc[ind,"currentGross"]=inflationChange(gross,year)
boxOffice_df.head()
Out[16]:
title release_year Gross listed_in duration rating currentGross
0 Star Wars: Episode VII - The Force Awakens 2015 2069521700 Action, Adventure, Sci-Fi 138 PG-13 2601935858.114812
1 Avengers: Endgame 2019 2797501328 Action, Adventure, Drama, Sci-Fi 181 PG-13 3260688144.462549
2 Avatar 2009 2847246203 Action, Adventure, Fantasy, Sci-Fi 162 PG-13 3955375696.258418
6 Titanic 1997 2201647264 Drama, Romance 194 PG-13 4087621398.572475
7 Jurassic World 2015 1670516444 Action, Adventure, Sci-Fi 124 PG-13 2100280773.675407

Amazon Prime

In [17]:
amazon_df.head()
Out[17]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie The Grand Seduction Don McKellar Brendan Gleeson, Taylor Kitsch, Gordon Pinsent Canada March 30, 2021 2014 NaN 113 min Comedy, Drama A small fishing village must procure a local d...
1 s2 Movie Take Care Good Night Girish Joshi Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar India March 30, 2021 2018 13+ 110 min Drama, International A Metro Family decides to fight a Cyber Crimin...
2 s3 Movie Secrets of Deception Josh Webber Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R... United States March 30, 2021 2017 NaN 74 min Action, Drama, Suspense After a man discovers his wife is cheating on ...
3 s4 Movie Pink: Staying True Sonia Anderson Interviews with: Pink, Adele, Beyoncé, Britney... United States March 30, 2021 2014 NaN 69 min Documentary Pink breaks the mold once again, bringing her ...
4 s5 Movie Monster Maker Giles Foster Harry Dean Stanton, Kieran O'Brien, George Cos... United Kingdom March 30, 2021 1989 NaN 45 min Drama, Fantasy Teenage Matt Banting wants to work with a famo...

To analyze the ratings of the movies in this dataset, I need to convert the ratings to the American rating system and remove any columns that are not required.

In [18]:
amazon_df=amazon_df.drop(columns = ["show_id","cast","country","date_added","description"], axis=1)
amazon_df["rating"]=amazon_df["rating"].map({
    "All_Ages": "G",
    "ALL":"G",
    "7+":"PG",
    "13+":"PG-13",
    "16+":"PG-13",
    "18+":"R",
    "16":"PG-13",
    "AGES_16":"PG-13",
    "AGES_16_":"PG-13",
    "AGES_18":"R",
    "AGES_18_":"R",
    "NOT_RATE":"UNRATED"
})
amazon_df.head()
Out[18]:
type title director release_year rating duration listed_in
0 Movie The Grand Seduction Don McKellar 2014 NaN 113 min Comedy, Drama
1 Movie Take Care Good Night Girish Joshi 2018 PG-13 110 min Drama, International
2 Movie Secrets of Deception Josh Webber 2017 NaN 74 min Action, Drama, Suspense
3 Movie Pink: Staying True Sonia Anderson 2014 NaN 69 min Documentary
4 Movie Monster Maker Giles Foster 1989 NaN 45 min Drama, Fantasy

I also need to remove any tv shows from the dataset and any empty values.

In [19]:
amazon_df=amazon_df.dropna()
amazon_df=amazon_df[amazon_df['type']=='Movie']
amazon_df=amazon_df.drop(columns = ["type"])
amazon_df.head()
Out[19]:
title director release_year rating duration listed_in
1 Take Care Good Night Girish Joshi 2018 PG-13 110 min Drama, International
16 Zoombies Glenn Miller 2016 PG-13 87 min Horror, Science Fiction
25 Zambezi: Force of Life Dr. Rudolf Lammers 2005 G 50 min Documentary
32 Young Lovers Ida Lupino 1949 R 86 min Drama
38 Yoga with Sound Meditations Mark Knight 2016 G 9 min Fitness, Special Interest

Lastly, I need to convert the duration to an integer of the runtime.

In [20]:
amazon_df['duration'] = amazon_df['duration'].str.replace(r' min$', '', regex=True)
amazon_df["duration"]=amazon_df["duration"].astype(int)
amazon_df.head()
Out[20]:
title director release_year rating duration listed_in
1 Take Care Good Night Girish Joshi 2018 PG-13 110 Drama, International
16 Zoombies Glenn Miller 2016 PG-13 87 Horror, Science Fiction
25 Zambezi: Force of Life Dr. Rudolf Lammers 2005 G 50 Documentary
32 Young Lovers Ida Lupino 1949 R 86 Drama
38 Yoga with Sound Meditations Mark Knight 2016 G 9 Fitness, Special Interest

Merging the Data¶

To analyze the entirety of the movie industry, I needed to merge my data frames. I decided to combine the box office return and the IMDB datasets to check the relationship between a movie's revenue and its genre, runtime, and rating. Additionally, there are more than 1,000 movies in this new dataset. After merging the data, it leaves me with two data frames, one of the box office returns and the IMDB dataset, and one from Amazon Prime.

In [21]:
movies_df=pd.concat([IMDB_df, boxOffice_df],axis=0)
movies_df['duration']=movies_df['duration'].astype(int)
movies_df['currentGross']=movies_df['currentGross'].astype(int)
movies_df.head()
Out[21]:
title release_year rating duration listed_in director Gross currentGross
0 The Shawshank Redemption 1994 R 142 Drama Frank Darabont 28341469 57004269
1 The Godfather 1972 R 175 Crime, Drama Francis Ford Coppola 134966411 961567924
2 The Dark Knight 2008 PG-13 152 Action, Crime, Drama Christopher Nolan 534858444 740412703
3 The Godfather: Part II 1974 R 202 Crime, Drama Francis Ford Coppola 57300000 346337989
4 12 Angry Men 1957 PG 96 Crime, Drama Sidney Lumet 4360000 46176955

After merging the data, I realized that the genres were stored as a list in the column. So to analyze the data, I need to change the data frame so that each movie only has a single genre that corresponds with it. To do this, I created copies of each film, with each having only one of its corresponding genre. I did this with both data frames.

In [22]:
movies_df["listed_in_new"] = movies_df["listed_in"].apply(lambda x: x.split(","))
exploded_movies = movies_df.explode("listed_in_new").reset_index(drop=True)
exploded_movies['listed_in_new'] = exploded_movies['listed_in_new'].str.replace(" ", "")
exploded_movies.head()
Out[22]:
title release_year rating duration listed_in director Gross currentGross listed_in_new
0 The Shawshank Redemption 1994 R 142 Drama Frank Darabont 28341469 57004269 Drama
1 The Godfather 1972 R 175 Crime, Drama Francis Ford Coppola 134966411 961567924 Crime
2 The Godfather 1972 R 175 Crime, Drama Francis Ford Coppola 134966411 961567924 Drama
3 The Dark Knight 2008 PG-13 152 Action, Crime, Drama Christopher Nolan 534858444 740412703 Action
4 The Dark Knight 2008 PG-13 152 Action, Crime, Drama Christopher Nolan 534858444 740412703 Crime
In [23]:
amazon_df["listed_in_new"] = amazon_df["listed_in"].apply(lambda x: x.split(","))
exploded_amazon = amazon_df.explode("listed_in_new").reset_index(drop=True)
exploded_amazon['listed_in_new'] = exploded_amazon['listed_in_new'].str.replace(" ", "")
exploded_amazon.head()
Out[23]:
title director release_year rating duration listed_in listed_in_new
0 Take Care Good Night Girish Joshi 2018 PG-13 110 Drama, International Drama
1 Take Care Good Night Girish Joshi 2018 PG-13 110 Drama, International International
2 Zoombies Glenn Miller 2016 PG-13 87 Horror, Science Fiction Horror
3 Zoombies Glenn Miller 2016 PG-13 87 Horror, Science Fiction ScienceFiction
4 Zambezi: Force of Life Dr. Rudolf Lammers 2005 G 50 Documentary Documentary

Exploratory Data Analysis¶

Now that the data is organized into one data frame of box office returns, I can look for different patterns that it has. However, I decided to create four new data frames to do this. Each of these data frames will contain one-quarter of the data, and the current gross revenue will sort it. For example, the third quarter data frame will include movies with a current gross revenue between $313,166,900 and $556,023,900.

In [24]:
movies_df["currentGross"].describe()
Out[24]:
count    1.287000e+03
mean     4.215479e+08
std      4.266132e+08
min      1.019700e+04
25%      1.446810e+08
50%      3.131492e+08
75%      5.554031e+08
max      4.087621e+09
Name: currentGross, dtype: float64
In [25]:
#Spliting the dataframe in half based on gross revenue
Half1=movies_df[movies_df["currentGross"]<= 3.131669e+08]
Half2=movies_df[movies_df["currentGross"]>= 3.131669e+08]
#Dividing it into quarters
Q1=Half1[Half1["currentGross"]< 1.447342e+08]
Q2=Half1[Half1["currentGross"]> 1.447342e+08]
Q3=Half2[Half2["currentGross"]< 5.560239e+08]
Q4=Half2[Half2["currentGross"]> 5.560239e+08]

After doing this, I can compare each movie's ratings. Doing this allows me to see which ratings occur more frequently in each quartile of the movie's current gross revenue.

In [26]:
#Counts the frequency of the ratings
ratings=movies_df["rating"].value_counts().iloc[:]
ratings1=Q1["rating"].value_counts().iloc[:]
ratings2=Q2["rating"].value_counts().iloc[:]
ratings3=Q3["rating"].value_counts().iloc[:]
ratings4=Q4["rating"].value_counts().iloc[:]

fig, ax= plt.subplots(1,5, figsize=(25,5))
fig.suptitle('Movie Ratings', fontsize=22)
ax[0].pie(ratings1, autopct='%1.1f%%')#Quarter 1
ax[1].pie(ratings2, autopct='%1.1f%%')#Quarter 2
ax[2].pie(ratings3, autopct='%1.1f%%')#Quarter 3
ax[3].pie(ratings4, autopct='%1.1f%%')#Quarter 4
ax[4].pie(ratings, autopct='%1.1f%%')#All Movies
#Labels
labels=movies_df["rating"].unique()
ax[0].set_title("Quarter 1")
ax[1].set_title("Quarter 2")
ax[2].set_title("Quarter 3")
ax[3].set_title("Quarter 4")
ax[4].set_title("All Movies")
plt.legend(labels,bbox_to_anchor=(1.0, 1.0))
utopct='%1.1f%%'

Based on this information, rated R movies are typically more successful than other genres. This is shown through the number of rated R movies increasing as gross revenue increases each quarter. It is also interesting that the percentage of PG movies decreases as gross revenue increases.

Additionally, I can see if a movie that is more successful and makes more money typically has a longer or shorter duration. To do this, I created a histogram that shows the frequency of a movie's runtime in each quartile of its gross revenue.

In [27]:
#Creating new dataframes that only contain the duration of each movie
duration=movies_df["duration"]
duration1=Q1["duration"]
duration2=Q2["duration"]
duration3=Q3["duration"]
duration4=Q4["duration"]

fig, ax= plt.subplots(1,4, figsize=(25,8))
fig.suptitle('Movie Runtime', fontsize=22)
#Plotting the histograms
sns.histplot(duration1, ax=ax[0])#Quarter 1
sns.histplot(duration2, ax=ax[1])#Quarter 2
sns.histplot(duration3, ax=ax[2])#Quarter 3
sns.histplot(duration4, ax=ax[3])#Quarter 4
#Labeling
labels=movies_df["rating"].unique()
ax[0].set_title("Quarter 1")
ax[1].set_title("Quarter 2")
ax[2].set_title("Quarter 3")
ax[3].set_title("Quarter 4")

ax[0].set_xlabel("Runtime")
ax[0].set_ylabel("Frequency")
ax[1].set_xlabel("Frequency")
ax[1].set_ylabel("Runtime")
ax[2].set_xlabel("Frequency")
ax[2].set_ylabel("Runtime")
ax[3].set_xlabel("Frequency")
ax[3].set_ylabel("Runtime")
Out[27]:
Text(0, 0.5, 'Runtime')

This graph shows that the runtime slightly influences a movie's box office performance. This is shown by the average distribution on each chart moving slightly towards a longer movie as the gross revenue increases. Additionally, this graph shows that the ideal runtime for a film is around 140 minutes or 2 hours and 20 minutes.

To look at how a movie's release year influences its gross revenue, I created scatter plots that display a movie's gross revenue and what year it was released.

In [28]:
#Grouping the dataframes into gross revenue and release year
release_year=movies_df.groupby("release_year", as_index=False)["currentGross"].mean() #All movies
release_year1=Q1.groupby("release_year", as_index=False)["currentGross"].mean() #Quarter 1
release_year2=Q2.groupby("release_year", as_index=False)["currentGross"].mean() #Quarter 2
release_year3=Q3.groupby("release_year", as_index=False)["currentGross"].mean() #Quarter 3
release_year4=Q4.groupby("release_year", as_index=False)["currentGross"].mean() #Quarter 4
#Creating the lines of best fit for each graph
a1, b1 = np.polyfit(release_year1["release_year"], release_year1["currentGross"], 1)
a2, b2 = np.polyfit(release_year2["release_year"], release_year2["currentGross"], 1)
a3, b3 = np.polyfit(release_year3["release_year"], release_year3["currentGross"], 1)
a4, b4 = np.polyfit(release_year4["release_year"], release_year4["currentGross"], 1)
a, b = np.polyfit(release_year["release_year"], release_year["currentGross"], 1)

fig, ax= plt.subplots(1,5, figsize=(25, 8))
fig.suptitle('Movie Release Year', fontsize=22)
#Plotting the scatter plot
ax[0].scatter(release_year1["release_year"],release_year1["currentGross"])#Quarter 1
ax[1].scatter(release_year2["release_year"],release_year2["currentGross"])#Quarter 2
ax[2].scatter(release_year3["release_year"],release_year3["currentGross"])#Quarter 3
ax[3].scatter(release_year4["release_year"],release_year4["currentGross"])#Quarter 4
ax[4].scatter(release_year["release_year"],release_year["currentGross"])#All movies
#Plotting the lines of best fit
ax[0].plot(release_year1["release_year"], a1*release_year1["release_year"]+b1) 
ax[1].plot(release_year2["release_year"], a2*release_year2["release_year"]+b2) 
ax[2].plot(release_year3["release_year"], a3*release_year3["release_year"]+b3) 
ax[3].plot(release_year4["release_year"], a4*release_year4["release_year"]+b4) 
ax[4].plot(release_year["release_year"], a*release_year["release_year"]+b) 
#Labels
ax[0].set_title("Quarter 1")
ax[1].set_title("Quarter 2")
ax[2].set_title("Quarter 3")
ax[3].set_title("Quarter 4")
ax[4].set_title("All Movies")

ax[0].set_xlabel("Release Year")
ax[0].set_ylabel("Gross Revenue (in millions)")
ax[1].set_xlabel("Release Year")
ax[1].set_ylabel("Gross Revenue (in millions)")
ax[2].set_xlabel("Release Year")
ax[2].set_ylabel("Gross Revenue (in millions)")
ax[3].set_xlabel("Release Year")
ax[3].set_ylabel("Gross Revenue (in billions)")
ax[4].set_xlabel("Release Year")
ax[4].set_ylabel("Gross Revenue (in billions)")
Out[28]:
Text(0, 0.5, 'Gross Revenue (in billions)')

This graph shows that a movie's release year has minimal impact on a movie's box office performance. However, it does show that as the release year increases, there is an increase in the average box office performance each year.

To analyze a movie's genre's impact on its revenue, I created a pie chart that shows the frequency of each of the five most popular genres. These genres include drama, action, adventure, comedy, and thriller.

In [29]:
#Creates a new dataframe storing movies with only the 5 most popular genres
exploded_movies_new = exploded_movies[exploded_movies['listed_in_new'].isin(exploded_movies['listed_in_new'].value_counts()[:5].index.tolist())]
#Spliting the dataframe in half based on gross revenue
Half1=exploded_movies_new[exploded_movies_new["currentGross"]<= 3.131669e+08]
Half2=exploded_movies_new[exploded_movies_new["currentGross"]>= 3.131669e+08]
#Dividing it into quarters
eQ1=Half1[Half1["currentGross"]< 1.447342e+08]
eQ2=Half1[Half1["currentGross"]> 1.447342e+08]
eQ3=Half2[Half2["currentGross"]< 5.560239e+08]
eQ4=Half2[Half2["currentGross"]> 5.560239e+08]
In [30]:
fig, ax= plt.subplots(1,5, figsize=(25,5))
fig.suptitle('Top 5 Genres', fontsize=22)
ax[0].pie(eQ1["listed_in_new"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 1
ax[1].pie(eQ2["listed_in_new"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 2
ax[2].pie(eQ3["listed_in_new"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 3
ax[3].pie(eQ4["listed_in_new"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 4
ax[4].pie(exploded_movies_new["listed_in_new"].value_counts().iloc[:], autopct='%1.1f%%') #All movies
labels=exploded_movies_new["listed_in_new"].unique()
#Labels
ax[0].set_title("Quarter 1")
ax[1].set_title("Quarter 2")
ax[2].set_title("Quarter 3")
ax[3].set_title("Quarter 4")
ax[4].set_title("All Movies")
plt.legend(labels,bbox_to_anchor=(1.0, 1.0))
autopct='%1.1f%%'

This graph shows that there is a relationship between the genre of a movie and its gross revenue. More specifically, there is a positive relationship between a movie's gross revenue and the action genre. This is shown through the frequency of action movies increasing as the gross revenue increases. There are not any strong relationships between the other four genres and a movie's gross revenue.

To analyze a director's influence on a movie's gross revenue, I created a boxplot of the five most frequent directors and a pie chart of the ten most frequent directors.

In [31]:
#Creates a new dataframe storing movies with only the 10 most popular directors
movies_new = movies_df[movies_df['director'].isin(movies_df['director'].value_counts()[:10].index.tolist())] 
#Spliting the dataframe in half based on gross revenue
Half1=movies_new[movies_new["currentGross"]<= 3.131669e+08]
Half2=movies_new[movies_new["currentGross"]>= 3.131669e+08]
#Dividing it into quarters
dQ1=Half1[Half1["currentGross"]< 1.447342e+08]
dQ2=Half1[Half1["currentGross"]> 1.447342e+08]
dQ3=Half2[Half2["currentGross"]< 5.560239e+08]
dQ4=Half2[Half2["currentGross"]> 5.560239e+08]
In [32]:
fig, ax= plt.subplots(1,1, figsize=(10,6))
Spielberg=movies_df[movies_df['director']=='Steven Spielberg'] #Spielberg's Movies
Scorsese=movies_df[movies_df['director']=='Martin Scorsese'] #Scorsese's Movies
Nolan=movies_df[movies_df['director']=='Christopher Nolan'] #Nolan's Movies
Hitchcock=movies_df[movies_df['director']=='Alfred Hitchcock'] #Hitchcock's Movies
Fincher=movies_df[movies_df['director']=='David Fincher'] #Fincher's Movies
Kubrick=movies_df[movies_df['director']=='Stanley Kubrick'] #Kubrick's Movies

#Merging the new data frames
PopularDirector = pd.DataFrame({'Steven Spielberg': Spielberg['currentGross'],
                             'Martin Scorsese': Scorsese['currentGross'],
                             'Christopher Nolan': Nolan['currentGross'],
                             'Alfred Hitchcock': Hitchcock['currentGross'],
                             'David Fincher': Fincher['currentGross'],
                             'Stanley Kubrick':Kubrick['currentGross']})
ax.set_ylabel("Gross Revenue (in billions)")
fig.tight_layout()
sns.boxplot(data=PopularDirector, palette='Set1', ax=ax) #Displaying the graph
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd1247f5850>

This graph shows that a movie's director significantly impacts a movie's gross revenue. Interestingly, Steven Spielberg's films make signifanctly more money on average than the other directors. For example, only 25% of Steven Spielberg's movies make less money than the most successful movies by Martin Scorsese.

In [33]:
fig, ax= plt.subplots(1,5, figsize=(25,5))
fig.suptitle('Top 10 Directors', fontsize=22)
#Creates five pie charts 
ax[0].pie(dQ1["director"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 1
ax[1].pie(dQ2["director"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 2
ax[2].pie(dQ3["director"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 3
ax[3].pie(dQ4["director"].value_counts().iloc[:], autopct='%1.1f%%') #Quarter 4
ax[4].pie(movies_new["director"].value_counts().iloc[:], autopct='%1.1f%%') #All movies
#Labeling the graph
labels=movies_new["director"].unique()
ax[0].set_title("Quarter 1")
ax[1].set_title("Quarter 2")
ax[2].set_title("Quarter 3")
ax[3].set_title("Quarter 4")
ax[4].set_title("All Movies")
plt.legend(labels,bbox_to_anchor=(1.0, 1.0))
utopct='%1.1f%%' #Adds percentages to each part of the circle

This graph further shows a movie's director's influence on its gross revenue. Even though the box plot showed that Steven Spielberg was the most successful director, this graph shows that Christopher Nolan is. This is interesting because while Steven Spielberg's movies make more money on average, Christopher Nolan's films make up 55.6% of movies in the top 25% of gross revenue, while Spielberg only makes up 22.6%.

Lastly, I created a heatmap showing how strong the correlation is between a movie's gross revenue, release year, duration, director, rating, and genre.

In [34]:
#Creating the correlation for the top 4 directors
director_exploded_movies= exploded_movies[exploded_movies['director'].isin(exploded_movies['director'].value_counts()[:4].index.tolist())]
director_exploded_movies=director_exploded_movies.drop(columns = ['title','listed_in','listed_in_new','rating'])
director_exploded_movies=pd.get_dummies(director_exploded_movies)
director_exploded_movies=director_exploded_movies.corr()
#Creating the correlation for the ratings
rating_exploded_movies=exploded_movies.drop(columns = ['title','listed_in','listed_in_new','director'])
rating_exploded_movies=pd.get_dummies(rating_exploded_movies)
rating_exploded_movies=rating_exploded_movies.corr()
#Creating the correlation for top 4 genres
genre_exploded_movies= exploded_movies[exploded_movies['listed_in_new'].isin(exploded_movies['listed_in_new'].value_counts()[:4].index.tolist())]
genre_exploded_movies=genre_exploded_movies.drop(columns = ['title','listed_in','director','rating'])
genre_exploded_movies=pd.get_dummies(genre_exploded_movies)
genre_exploded_movies=genre_exploded_movies.corr()
#Creating the heatmap
fig, ax =plt.subplots(1,3, figsize=(20,5))
sns.heatmap(director_exploded_movies, annot=True,ax=ax[0])
sns.heatmap(rating_exploded_movies, annot=True, ax=ax[1])
sns.heatmap(genre_exploded_movies, annot=True, ax=ax[2])
fig.tight_layout()
ax[0].set_title("Directors")
ax[1].set_title("Rating")
ax[2].set_title("Genre")
Out[34]:
Text(0.5, 1.0, 'Genre')

This graph shows that the most significant influence on a movie's gross revenue is its director. This is shown through Steven Spielberg's .46 correlation with its current gross revenue. The second strongest influence on a movie's current gross revenue is its genre, and the third largest is its rating. Additionally, this graph shows that a movie's runtime and release year have a correlation of .1 with current gross revenue. While this is not as strong as the other categories, it still greatly influences a movie's gross revenue. This information reinforces the ideas and trends that the previous graphs display. Based on this information, it is essential to include all of these factors when attempting to predict a movie's current gross revenue.


Prediction Model¶

The information above shows a relationship between a movie's rating, genre, release year, runtime, and gross revenue. As a result, it is possible to attempt to create a model that can predict a movie's gross income with these characteristics. To do this, I will use a K Nearest Neighbors function. By using this model, I can attempt to predict a movie's gross revenue by comparing to movies with similar characteristics.

When using this model for forecasting, I need to decide how many neighbors I should use. To do this, I used a cross-validation model with ten folds and the mean absolute error to determine the optimal value of K. By doing this, I learned that the best number for K is 2.

In [35]:
exploded_movies=exploded_movies.dropna()
# Define the features.
features = ["duration", "release_year","director","rating","listed_in_new"]
# Define the training data.
X_train_dict = exploded_movies[features].to_dict(orient="records")
X_dict = exploded_movies[features].to_dict(orient="records")
y=exploded_movies["currentGross"]
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
model = KNeighborsRegressor(n_neighbors=20)
pipeline = Pipeline([
    ("scaler", scaler),
    ("model", model)
])
cross_val_score(pipeline, X_train, y, 
                cv=15, scoring="neg_mean_absolute_error")
# calculates estimate of test error based on 15-fold cross validation
def get_cv_error(k):
    model = KNeighborsRegressor(n_neighbors=k)
    pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
    mae = np.absolute(-cross_val_score(
        pipeline, X_dict, y, 
        cv=15, scoring="neg_mean_absolute_error"
    )).mean()
    return mae
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
ax = ks.apply(get_cv_error)
ax.argmin()
Out[35]:
2

After this, I created a function to forecast a movie's current gross revenue by using its duration, release year, rating, genre, and director.

In [36]:
def prediction(duration, release_year, rating, listed_in_new, director):
  # Define the features.
  features = ["duration", "release_year","director","rating","listed_in_new"]
  # Define the training data.
  X_train_dict = exploded_movies[features].to_dict(orient="records")
  X_dict = exploded_movies[features].to_dict(orient="records")
  y=exploded_movies["currentGross"]
  X_new_dict = [{
      "duration":duration,
      "release_year":release_year,
      "rating":rating,
      "listed_in_new":listed_in_new,
      "director":director
  }]
  y_train = exploded_movies["currentGross"]

  # Dummy encoding
  vec = DictVectorizer(sparse=False)
  vec.fit(X_train_dict)
  X_train = vec.transform(X_train_dict)
  X_new = vec.transform(X_new_dict)

  # Standardization
  scaler = StandardScaler()
  scaler.fit(X_train)
  X_train_sc = scaler.transform(X_train)
  X_new_sc = scaler.transform(X_new)

  # K-Nearest Neighbors Model
  model = KNeighborsRegressor(n_neighbors=2)
  model.fit(X_train_sc, y_train)
  x=model.predict(X_new_sc)
  revenue=x[0]
  return revenue

For example, here is the predicted gross revenue for The Dark Knight.

In [37]:
print(prediction(152, 2008, 'PG-13', 'Action', 'Christopher Nolan'))
570140931.5

Application¶

Now that I can estimate a movie's gross revenue, there are many possible applications of it. One of these potential applications is to project the gross revenue of movies for Amazon Prime.

For example, I organized the first 100 movies in the amazon dataset from highest projected revenue to lowest.

In [38]:
exploded_amazon["currentGross"]=0 #Makes a new column called currentGross and sets all values equal to 0
for ind in range(100): #Loops through the first 100 movies
  duration=exploded_amazon["duration"][ind]
  release_year=exploded_amazon["release_year"][ind]
  rating=exploded_amazon["rating"][ind]
  listed_in_new=exploded_amazon["listed_in_new"][ind]
  director=exploded_amazon["director"][ind]
  exploded_amazon.loc[ind,"currentGross"]=prediction(duration, release_year, rating, listed_in_new, director)
exploded_amazon["currentGross"]=exploded_amazon["currentGross"].astype(int)

After this, I am able to show which 5 movies have the highest revenue.

In [39]:
exploded_amazon.sort_values(by=['currentGross'], ascending=False).head()
Out[39]:
title director release_year rating duration listed_in listed_in_new currentGross
30 Yellowneck R. John Hugh 1955 R 84 Action, Military and War, Western MilitaryandWar 1127559762
95 Western Gold Howard Bretherton 1937 R 56 Military and War, Western MilitaryandWar 1127559762
91 What Price Crime Albert Herman 1935 R 65 Suspense Suspense 1127559762
40 Women of Valor Buzz Kulik 1986 PG-13 95 Action, Drama Action 821961324
87 When the Lights Go On Again William K. Howard 1944 PG-13 103 Drama, Military and War MilitaryandWar 740089392

Conclusion¶

Despite this algorithm taking many factors into account, there are several factors that this model needs to consider, which hinder its accuracy. This includes the movie's cast, the public demand for it, and any elements outside of the film itself that influence its gross revenue. This is shown through the model only having an accuracy of .522, which is fairly low.

In [40]:
X_train_sc = scaler.transform(X_train)
model = KNeighborsRegressor(n_neighbors=2)
model.fit(X_train_sc, y)
y_train_pred=model.predict(X_train_sc)
accuracy = (y_train_pred == y).mean()
accuracy
Out[40]:
0.5224839400428265

Despite the low accuracy of this model, this model does reveal several interesting factors about the film industry. This includes that a movie's director has the most substantial influence on a movie's gross revenue, compared to its rating, genre, runtime, and release year. It also has shown that action and PG-13 movies typically make more money than movies of different genres and ratings.