nhslogo CS4132 Data Analytics

What makes a movie successful? by [Kai Wen]¶

Important Note: Please keep your report concise and relevant (i.e. show only relevant steps and visualizations used to answer your research questions).

Table of Content (with relevant hyperlinks to sections)¶

  1. Motivation & Background
  2. Summary of Research Questions & Results
  3. Dataset
  4. Methodology
    1. Data Acquisition
    2. Data Cleaning
    3. Data Exploration and Analysis
  5. Results
    1. Research Question 1
    2. Research Question 2
    3. Research Question 3
    4. Research Question 4
    5. Research Question 5
  6. References

Motivation and Background¶

Give an overview of the project, motivation, background and goals.

Movies are a huge portion of the entertainment industry, with the movie industry approximately worth more than $100 billion globally. Movies are a popular and widespread form of entertainment amongst people, with top movies grossing millions of dollars at the box office. However, as movies require a lot of initial capital to produce, they can be a risky endeavour for investors and production companies alike.

Hence, it is important for the movie industry to find out the different factors that affect the revenue that a movie generate, and maximise the quality of production, as well as the optimal time to release a movie.

Summary of Research Questions & Results¶

Repeat your research questions in a numbered list. After each research question, clearly state the answer/conclusion you determined. Do not give details or justifications yet — just the answer
  1. What are the best genres for a movie?
    • Adventure is the best genre. The other top 5 genres are Animation, Sci-Fi, Action, Family.
  2. What other factors affect a movie's revenue?
    • There is a large positive correlation between budget and movie revenue.
    • English is the best original language for a movie.
    • Chinese, Latin, Japanese, Korean, Spanish and English are the best languages for a movie to be spoken in.
    • The keyword which makes the most revenue is "superhero".
    • One should also base our movie on existing literature (eg. books, comics, real life story), or be a sequel to pre-existing movies to increase the chance of success.
  3. Who should you hire to produce a movie?
    • The actor which contributes highest revenue to a movie is Robert Downey Jr. For actresses, it is Emma Watson. For directors, it is Michael Bay. For writers, it is Larry Lieber.
    • Directors bring the most to a movie's success, and writers bring the least. Hence, one should focus most budget on hiring directors, and least on writers.
    • Gender plays no difference in how much revenue they bring to a movie, hence one should focus budget equally on hiring actors and actresses.
    • Walt Disney Pictures is the best production company, so one should seek to emulate them or hire them.
  4. When should one release a movie?

    • In June, July, December
    • on a Friday
    • on the 1st or 25th of the month.
  5. Does popularity correlate with revenue?

    • There is a weak positive correlation between revenue and popularity.

Dataset¶

Numbered list of dataset (with downloadable links) and a brief but clear description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
  1. IMDB metadata.csv (contains metadata about each movie from IMDB)
  2. IMDB director writers.csv (contains all the directors and writers for each movie)
  3. IMDB ratings.csv (contains average rating and vote counts for each movie)
  4. IMDB names.csv (contains information about names in crew / principals, eg. name, profession, titles they are known for)
  5. IMDB principal cast crew.csv (contains all the principle cast / crew for each movie)
  6. tmdb.xlsx (contains more metadata about each movie from TMDB)
  7. language-codes.json (contains language codes for each country)

For 1-5, downloaded from https://www.imdb.com/interfaces/ (data at https://datasets.imdbws.com/)

For 6, webscraped from https://www.themoviedb.org/ using TMDB api

For 7, taken from https://gist.github.com/josantonius/b455e315bc7f790d14b136d61d9ae469

Link to uploaded datasets here https://nushighedu-my.sharepoint.com/:f:/g/personal/h1910065_nushigh_edu_sg/Ekr9HahnBB1Ekkw3ntGH2m8BvUrSyQ15IFgTv-qbemV2kA?e=kpqog9

Methodology¶

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc).

Data Acquisition ¶

Display the data which will be used in the project. The data should be saved in .xlsx or .csv format to be submitted with the project. If webscraping has been done to obtain your data, save your webscraping code in another jupyter notebook as appendix to be submitted separately from the report. Import and display each dataset in a dataframe. For each dataset, give a brief overview of the data it contains, and explain the meaning of columns that are relevant to the project.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from scipy import stats
from wordcloud import WordCloud

Dataset 1: IMDB metadata.csv

This dataset contains metadata about imdb entries, including the genres, format (tv, movie) in titleType, release year in startYear, runtime, is it an adult title. tconst is the unique IMDB id for each entry.

In [2]:
imdb_metadata = pd.read_csv('datasets/IMDB metadata.csv', index_col=0,na_values="\\N")
print(imdb_metadata.tconst.value_counts().count())
imdb_metadata.head()
C:\Users\kaiwen\AppData\Local\Temp\ipykernel_23184\2907941367.py:1: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
  imdb_metadata = pd.read_csv('datasets/IMDB metadata.csv', index_col=0,na_values="\\N")
9126554
Out[2]:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
0 tt0000001 short Carmencita Carmencita 0.0 1894.0 NaN 1.0 Documentary,Short
1 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0.0 1892.0 NaN 5.0 Animation,Short
2 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0.0 1892.0 NaN 4.0 Animation,Comedy,Romance
3 tt0000004 short Un bon bock Un bon bock 0.0 1892.0 NaN 12.0 Animation,Short
4 tt0000005 short Blacksmith Scene Blacksmith Scene 0.0 1893.0 NaN 1.0 Comedy,Short

Dataset 2: IMDB director writers.csv

This dataset contains the id of directors and writers for each entry. More information can be found about the directors and writers in Dataset 4.

In [3]:
imdb_directors_writers = pd.read_csv('datasets/IMDB director writers.csv', index_col=0,na_values="\\N")
imdb_directors_writers.head()
Out[3]:
tconst directors writers
0 tt0000001 nm0005690 NaN
1 tt0000002 nm0721526 NaN
2 tt0000003 nm0721526 NaN
3 tt0000004 nm0721526 NaN
4 tt0000005 nm0005690 NaN

Dataset 3: IMDB ratings.csv

This dataset contains the average IMDB rating and number of votes for each entry.

In [4]:
imdb_ratings = pd.read_csv('datasets/IMDB ratings.csv', index_col=0,na_values="\\N")
print(imdb_ratings.tconst.value_counts().count())
imdb_ratings.head()
1258745
Out[4]:
tconst averageRating numVotes
0 tt0000001 5.7 1903
1 tt0000002 5.9 255
2 tt0000003 6.5 1698
3 tt0000004 5.7 167
4 tt0000005 6.2 2518

Dataset 4: IMDB names.csv

This dataset contains information about all the people referenced in other IMDB datasets, using the name id (nconst), including their name, birth and death year, primary profession, and movies they are known for.

In [5]:
imdb_names = pd.read_csv("datasets/IMDB names.csv", na_values="\\N", index_col=0)
imdb_names.head()
Out[5]:
nconst primaryName birthYear deathYear primaryProfession knownForTitles
0 nm0000001 Fred Astaire 1899.0 1987.0 soundtrack,actor,miscellaneous tt0031983,tt0050419,tt0053137,tt0072308
1 nm0000002 Lauren Bacall 1924.0 2014.0 actress,soundtrack tt0038355,tt0071877,tt0117057,tt0037382
2 nm0000003 Brigitte Bardot 1934.0 NaN actress,soundtrack,music_department tt0054452,tt0049189,tt0057345,tt0056404
3 nm0000004 John Belushi 1949.0 1982.0 actor,soundtrack,writer tt0077975,tt0072562,tt0078723,tt0080455
4 nm0000005 Ingmar Bergman 1918.0 2007.0 writer,director,actor tt0050976,tt0083922,tt0060827,tt0050986

Dataset 5: IMDB principal cast crew.csv

This dataset contains information about the principal cast crew of each movie. Each row is a person, and tconst is the movie id.

In [6]:
imdb_cast_crew = pd.read_csv("datasets/IMDB principal cast crew.csv", na_values="\\N", index_col=0)
imdb_cast_crew.head()
Out[6]:
tconst ordering nconst category job characters
0 tt0000001 1 nm1588970 self NaN ["Self"]
1 tt0000001 2 nm0005690 director NaN NaN
2 tt0000001 3 nm0374658 cinematographer director of photography NaN
3 tt0000002 1 nm0721526 director NaN NaN
4 tt0000002 2 nm1335271 composer NaN NaN

Dataset 6: tmdb.xlsx

This dataset contains more movie metadata from webscraping TMDB. It contains the imdb_id (for merging with IMDB database), status (indicator if the TMDB entry existed), budget, revenue, release_date, original_language, TMDB popularity, production_companies, spoken_languages, movie keywords.

In [7]:
tmdb = pd.read_excel("datasets/tmdb.xlsx", index_col=0, header=0, na_values="\\N")
tmdb
Out[7]:
imdb_id status budget revenue release_date original_language popularity production_companies production_countries spoken_languages keywords
0 tt0002130 Released 0.0 0.0 1911-07-01 it 6.472 [[['Milano Films'], ['']]] ['IT'] [] ['hell', 'afterlife', 'purgatory', 'devil', 'd...
1 tt0002844 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 tt0003014 Released 0.0 0.0 1913-10-27 sv 3.308 [[['Svenska Biografteatern'], ['SE']]] ['SE'] [] ['foster parents', 'widow', 'run away', 'reuni...
3 tt0003037 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 tt0003165 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ...
35930 tt3706352 Released 0.0 0.0 2022-04-08 en 162.576 [[['Nick Wechsler Productions'], ['US']], [['C... ['US'] ['en'] ['spy', 'cia', 'based on novel or book', 'cons...
35931 tt6710474 Released 25000000.0 97000000.0 2022-03-24 en 190.306 [[['A24'], ['US']], [['AGBO'], ['US']], [['Ley... ['US'] ['cn', 'zh', 'en'] ['mother', 'philosophy', 'martial arts', 'kung...
35932 tt8323668 Released 0.0 0.0 2022-06-15 en 79.847 [[['Netter Productions'], ['US']], [['Levantin... ['US'] ['en'] ['michigan', 'small town', 'massachusetts', 'h...
35933 tt15560626 Released 0.0 0.0 2022-01-21 ml 3.905 [[['Shane Nigam Films'], ['IN']], [['Plan T Fi... ['IN'] ['ml'] []
35934 tt11680472 Released 0.0 0.0 2022-06-17 pa 0.600 [[['Humble Motion Pictures'], ['']]] ['IN'] ['pa'] []

35935 rows × 11 columns

Dataset 7: language-codes.json

This dataset contains the language codes as the index, and its associated country.

In [8]:
language_codes = pd.DataFrame([pd.read_json("datasets/language-codes.json",  typ='series')]).T
language_codes
Out[8]:
0
aa Afar
ab Abkhazian
af Afrikaans
am Amharic
ar Arabic
... ...
zh-hk Chinese (Hong Kong SAR)
zh-mo Chinese (Macau SAR)
zh-sg Chinese (Singapore)
zh-tw Chinese (Taiwan)
zu Zulu

217 rows × 1 columns

Data Cleaning¶

For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.
In [9]:
language_codes.insert(0,"code", language_codes.index)
language_codes.columns = ["code", "country"]
languages_codes = language_codes.reset_index(drop=True)
languages_codes
Out[9]:
code country
0 aa Afar
1 ab Abkhazian
2 af Afrikaans
3 am Amharic
4 ar Arabic
... ... ...
212 zh-hk Chinese (Hong Kong SAR)
213 zh-mo Chinese (Macau SAR)
214 zh-sg Chinese (Singapore)
215 zh-tw Chinese (Taiwan)
216 zu Zulu

217 rows × 2 columns

In [10]:
# IMDB contains 9 million entries, however, not all of them are movies. IMDB tells us the type of entry under "titleType".
# As we can see, most of the entries are TV episodes and only 617619 are movies.
# We will drop all the entries that are not movies. 
# We will also delete the titleType and endYear columns. Movies do not have an endYear, only tv episodes or other such entries have endYears.

print(imdb_metadata.titleType.value_counts())
imdb_movie_metadata = (imdb_metadata.loc[imdb_metadata.titleType == "movie"]).reset_index(drop=True)
del imdb_movie_metadata["titleType"]
del imdb_movie_metadata["endYear"]
imdb_movie_metadata
tvEpisode       6871485
short            882541
movie            617619
video            263832
tvSeries         229177
tvMovie          137074
tvMiniSeries      44596
tvSpecial         37840
videoGame         31760
tvShort           10628
tvPilot               2
Name: titleType, dtype: int64
Out[10]:
tconst primaryTitle originalTitle isAdult startYear runtimeMinutes genres
0 tt0000009 Miss Jerry Miss Jerry 0.0 1894.0 45.0 Romance
1 tt0000502 Bohemios Bohemios 0.0 1905.0 100.0 NaN
2 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 0.0 1906.0 70.0 Action,Adventure,Biography
3 tt0000591 The Prodigal Son L'enfant prodigue 0.0 1907.0 90.0 Drama
4 tt0000615 Robbery Under Arms Robbery Under Arms 0.0 1907.0 NaN Drama
... ... ... ... ... ... ... ...
617614 tt9916622 Rodolpho Teóphilo - O Legado de um Pioneiro Rodolpho Teóphilo - O Legado de um Pioneiro 0.0 2015.0 57 Documentary
617615 tt9916680 De la ilusión al desconcierto: cine colombiano... De la ilusión al desconcierto: cine colombiano... 0.0 2007.0 100 Documentary
617616 tt9916706 Dankyavar Danka Dankyavar Danka 0.0 2013.0 NaN Comedy
617617 tt9916730 6 Gunn 6 Gunn 0.0 2017.0 116 NaN
617618 tt9916754 Chico Albuquerque - Revelações Chico Albuquerque - Revelações 0.0 2013.0 49 Documentary

617619 rows × 7 columns

In [11]:
# here, we merge the IMDB datasets together based on tconst (IMDB id)

imdb = pd.merge(imdb_movie_metadata, imdb_ratings, how="left", on="tconst")
imdb = pd.merge(imdb, imdb_directors_writers, how="left", on="tconst")
imdb.info()
imdb.runtimeMinutes = pd.to_numeric(imdb.runtimeMinutes) #change type of runtimeMinutes to float
imdb
<class 'pandas.core.frame.DataFrame'>
Int64Index: 617619 entries, 0 to 617618
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          617619 non-null  object 
 1   primaryTitle    617619 non-null  object 
 2   originalTitle   617619 non-null  object 
 3   isAdult         617619 non-null  float64
 4   startYear       536060 non-null  float64
 5   runtimeMinutes  389711 non-null  object 
 6   genres          545941 non-null  object 
 7   averageRating   280695 non-null  float64
 8   numVotes        280695 non-null  float64
 9   directors       552359 non-null  object 
 10  writers         467056 non-null  object 
dtypes: float64(4), object(7)
memory usage: 56.5+ MB
Out[11]:
tconst primaryTitle originalTitle isAdult startYear runtimeMinutes genres averageRating numVotes directors writers
0 tt0000009 Miss Jerry Miss Jerry 0.0 1894.0 45.0 Romance 5.3 198.0 nm0085156 nm0085156
1 tt0000502 Bohemios Bohemios 0.0 1905.0 100.0 NaN 4.5 14.0 nm0063413 nm0063413,nm0657268,nm0675388
2 tt0000574 The Story of the Kelly Gang The Story of the Kelly Gang 0.0 1906.0 70.0 Action,Adventure,Biography 6.0 779.0 nm0846879 nm0846879
3 tt0000591 The Prodigal Son L'enfant prodigue 0.0 1907.0 90.0 Drama 4.7 19.0 nm0141150 nm0141150
4 tt0000615 Robbery Under Arms Robbery Under Arms 0.0 1907.0 NaN Drama 4.5 23.0 nm0533958 nm0092809,nm0533958
... ... ... ... ... ... ... ... ... ... ... ...
617614 tt9916622 Rodolpho Teóphilo - O Legado de um Pioneiro Rodolpho Teóphilo - O Legado de um Pioneiro 0.0 2015.0 57.0 Documentary NaN NaN nm9272491,nm9272490 nm9272490,nm9272491
617615 tt9916680 De la ilusión al desconcierto: cine colombiano... De la ilusión al desconcierto: cine colombiano... 0.0 2007.0 100.0 Documentary NaN NaN nm0652213 nm0652213,nm10538576
617616 tt9916706 Dankyavar Danka Dankyavar Danka 0.0 2013.0 NaN Comedy NaN NaN nm7764440 nm7933903
617617 tt9916730 6 Gunn 6 Gunn 0.0 2017.0 116.0 NaN 8.4 6.0 nm10538612 nm10538612
617618 tt9916754 Chico Albuquerque - Revelações Chico Albuquerque - Revelações 0.0 2013.0 49.0 Documentary NaN NaN nm9272490,nm8349149 nm8349149,nm9272490

617619 rows × 11 columns

In [12]:
# Dealing with missing values in IMDB datasets
# I drop rows with null entries in startYear, runtimeMinutes, genres, avgrating and numvotes
# This is movies have missing values in these columns typically because they are not so well known, or were released from a very long time ago, and we are not interested in analysing these movies.
# Also, I have a lot of data already and I do not need so much 

imdb = imdb.dropna(subset=["genres", "startYear", "runtimeMinutes", "averageRating", "numVotes"]).reset_index(drop=True)

# Use info and describe to check our dataset
imdb.info()
imdb.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246764 entries, 0 to 246763
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          246764 non-null  object 
 1   primaryTitle    246764 non-null  object 
 2   originalTitle   246764 non-null  object 
 3   isAdult         246764 non-null  float64
 4   startYear       246764 non-null  float64
 5   runtimeMinutes  246764 non-null  float64
 6   genres          246764 non-null  object 
 7   averageRating   246764 non-null  float64
 8   numVotes        246764 non-null  float64
 9   directors       244834 non-null  object 
 10  writers         221119 non-null  object 
dtypes: float64(5), object(6)
memory usage: 20.7+ MB
Out[12]:
isAdult startYear runtimeMinutes averageRating numVotes
count 246764.000000 246764.000000 246764.000000 246764.000000 2.467640e+05
mean 0.015436 1993.997208 94.364210 6.142329 3.994537e+03
std 0.123279 25.675571 143.554644 1.322219 3.643640e+04
min 0.000000 1894.000000 1.000000 1.000000 5.000000e+00
25% 0.000000 1977.000000 81.000000 5.400000 2.200000e+01
50% 0.000000 2004.000000 91.000000 6.300000 7.700000e+01
75% 0.000000 2014.000000 103.000000 7.000000 3.870000e+02
max 1.000000 2022.000000 51420.000000 10.000000 2.620772e+06
In [13]:
# Make sure that we do not include documentaries in our dataset, as they are not movies
imdb = imdb[imdb['genres'].str.contains('Documentary') == False]

# As seen from .describe(), there is something wrong with runtime minutes as the minimum time is 1min, which is clearly not a legitimate movie
# According to the Academy of Motion Picture Arts and Sciences, the American Film Institute and the British Film Institute, 
# a feature film is one that runs for more than 40 minutes
# Hence, we will drop all movies which are <= 40min
imdb = imdb[imdb['runtimeMinutes'] > 40]

# We are only interested in movies which are not obscure.
# Some movies in IMDB dataset only have a few votes, which can skew our results.
# Hence, we will drop all movies which have less than 1000 votes.

imdb = imdb[imdb['numVotes'] >= 1000]

imdb.info()
imdb = imdb.sort_values(by = "startYear").reset_index(drop=True)
imdb
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35935 entries, 16 to 246762
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          35935 non-null  object 
 1   primaryTitle    35935 non-null  object 
 2   originalTitle   35935 non-null  object 
 3   isAdult         35935 non-null  float64
 4   startYear       35935 non-null  float64
 5   runtimeMinutes  35935 non-null  float64
 6   genres          35935 non-null  object 
 7   averageRating   35935 non-null  float64
 8   numVotes        35935 non-null  float64
 9   directors       35842 non-null  object 
 10  writers         35658 non-null  object 
dtypes: float64(5), object(6)
memory usage: 3.3+ MB
Out[13]:
tconst primaryTitle originalTitle isAdult startYear runtimeMinutes genres averageRating numVotes directors writers
0 tt0002130 Dante's Inferno L'Inferno 0.0 1911.0 71.0 Adventure,Drama,Fantasy 7.0 2888.0 nm0078205,nm0209738,nm0655824 nm0019604
1 tt0002844 Fantômas: In the Shadow of the Guillotine Fantômas - À l'ombre de la guillotine 0.0 1913.0 54.0 Crime,Drama 6.9 2271.0 nm0275421 nm0019855,nm0275421,nm0816232
2 tt0003014 Ingeborg Holm Ingeborg Holm 0.0 1913.0 96.0 Drama 7.0 1226.0 nm0803705 nm0472236,nm0803705
3 tt0003037 Fantomas: The Man in Black Juve contre Fantômas 0.0 1913.0 61.0 Crime,Drama 6.9 1568.0 nm0275421 nm0019855,nm0275421,nm0816232
4 tt0003165 Fantômas: The Dead Man Who Killed Le mort qui tue 0.0 1913.0 90.0 Crime,Drama,Mystery 6.9 1233.0 nm0275421 nm0019855,nm0275421,nm0816232
... ... ... ... ... ... ... ... ... ... ... ...
35930 tt11497716 The Legend The Legend 0.0 2022.0 161.0 Action,Drama,Romance 2.6 1311.0 nm1504071,nm11226131 nm1474787
35931 tt21026586 Karan Arjun Karan Arjun 0.0 2022.0 106.0 Action,Thriller 7.9 1860.0 nm13757693 nm13746589
35932 tt8130518 Dark Cloud Dark Cloud 0.0 2022.0 81.0 Horror,Sci-Fi,Thriller 4.6 4211.0 nm5133487 nm7389445
35933 tt10160886 Sneakerella Sneakerella 0.0 2022.0 112.0 Comedy,Family,Musical 4.5 1331.0 nm0020491 nm4686553,nm0330719,nm2922750,nm1760850,nm2309894
35934 tt11680472 Posti Posti 0.0 2022.0 126.0 Comedy,Drama 9.2 1174.0 nm2378766 nm2378766

35935 rows × 11 columns

In [14]:
# imdb.to_excel("imdb_cleaned.xlsx")
In [15]:
# merge tmdb and imdb datasets
imdb.columns = ["imdb_id"] + list(imdb.columns[1:])
all_mdb = imdb.merge(tmdb, on = "imdb_id", how = "left")
all_mdb
Out[15]:
imdb_id primaryTitle originalTitle isAdult startYear runtimeMinutes genres averageRating numVotes directors ... status budget revenue release_date original_language popularity production_companies production_countries spoken_languages keywords
0 tt0002130 Dante's Inferno L'Inferno 0.0 1911.0 71.0 Adventure,Drama,Fantasy 7.0 2888.0 nm0078205,nm0209738,nm0655824 ... Released 0.0 0.0 1911-07-01 it 6.472 [[['Milano Films'], ['']]] ['IT'] [] ['hell', 'afterlife', 'purgatory', 'devil', 'd...
1 tt0002844 Fantômas: In the Shadow of the Guillotine Fantômas - À l'ombre de la guillotine 0.0 1913.0 54.0 Crime,Drama 6.9 2271.0 nm0275421 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 tt0003014 Ingeborg Holm Ingeborg Holm 0.0 1913.0 96.0 Drama 7.0 1226.0 nm0803705 ... Released 0.0 0.0 1913-10-27 sv 3.308 [[['Svenska Biografteatern'], ['SE']]] ['SE'] [] ['foster parents', 'widow', 'run away', 'reuni...
3 tt0003037 Fantomas: The Man in Black Juve contre Fantômas 0.0 1913.0 61.0 Crime,Drama 6.9 1568.0 nm0275421 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 tt0003165 Fantômas: The Dead Man Who Killed Le mort qui tue 0.0 1913.0 90.0 Crime,Drama,Mystery 6.9 1233.0 nm0275421 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35930 tt11497716 The Legend The Legend 0.0 2022.0 161.0 Action,Drama,Romance 2.6 1311.0 nm1504071,nm11226131 ... Released 0.0 0.0 2022-07-28 ta 1.121 [[['The Legend Saravana Stores Creations'], ['... ['IN'] ['ta'] []
35931 tt21026586 Karan Arjun Karan Arjun 0.0 2022.0 106.0 Action,Thriller 7.9 1860.0 nm13757693 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
35932 tt8130518 Dark Cloud Dark Cloud 0.0 2022.0 81.0 Horror,Sci-Fi,Thriller 4.6 4211.0 nm5133487 ... Released 0.0 0.0 2022-05-09 en 243.426 [] ['US'] ['en'] []
35933 tt10160886 Sneakerella Sneakerella 0.0 2022.0 112.0 Comedy,Family,Musical 4.5 1331.0 nm0020491 ... Released 0.0 0.0 2022-05-11 en 61.582 [[['Walt Disney Pictures'], ['US']], [['Jane S... ['US'] ['en'] ['musical']
35934 tt11680472 Posti Posti 0.0 2022.0 126.0 Comedy,Drama 9.2 1174.0 nm2378766 ... Released 0.0 0.0 2022-06-17 pa 0.600 [[['Humble Motion Pictures'], ['']]] ['IN'] ['pa'] []

35935 rows × 21 columns

In [16]:
# Dealing with missing values in TMDB dataset
all_mdb.info()

# NaN in "Status" column means that there was no result for that entry when we webscraped TMDB
# As there are only ~200 NaNs in "Status", we can drop these entries

all_mdb = all_mdb.dropna(subset = ["status"]).reset_index(drop=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 35935 entries, 0 to 35934
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   imdb_id               35935 non-null  object 
 1   primaryTitle          35935 non-null  object 
 2   originalTitle         35935 non-null  object 
 3   isAdult               35935 non-null  float64
 4   startYear             35935 non-null  float64
 5   runtimeMinutes        35935 non-null  float64
 6   genres                35935 non-null  object 
 7   averageRating         35935 non-null  float64
 8   numVotes              35935 non-null  float64
 9   directors             35842 non-null  object 
 10  writers               35658 non-null  object 
 11  status                35776 non-null  object 
 12  budget                35776 non-null  float64
 13  revenue               35776 non-null  float64
 14  release_date          35762 non-null  object 
 15  original_language     35776 non-null  object 
 16  popularity            35776 non-null  float64
 17  production_companies  35776 non-null  object 
 18  production_countries  35776 non-null  object 
 19  spoken_languages      35776 non-null  object 
 20  keywords              35776 non-null  object 
dtypes: float64(8), object(13)
memory usage: 6.0+ MB
In [17]:
# Data cleaning the data webscraped from TMDB, turning strings into lists

# Turn genres, directors, writers into lists.

all_mdb.loc[:,"genres"] = all_mdb.loc[:,"genres"].str.split(",")
all_mdb.loc[:,"directors"] = all_mdb.loc[:,"directors"].str.split(",")
all_mdb.loc[:,"writers"] = all_mdb.loc[:,"writers"].str.split(",")

# I webscraped the following columns as lists, but
# Pandas stores lists as strings in a csv, which is what we read from, so we must convert it back to a list using literal_eval, which is safer than eval
# We do this for production companies, production countries, spoken languages, and keywords.

# As literal_eval can't process NaN values, we convert it to an empty list. After the eval, we convert the empty lists back to NaN.

all_mdb.loc[:, "production_companies"] = all_mdb.loc[:, "production_companies"].replace(np.nan, "[]").apply(lambda x: literal_eval(str(x)))
all_mdb.loc[:, "production_companies"].loc[all_mdb.loc[:, "production_companies"].apply(lambda x: x == [])] = np.nan

all_mdb.loc[:, "production_countries"] = all_mdb.loc[:, "production_countries"].replace(np.nan, "[]").apply(lambda x: literal_eval(str(x)))
all_mdb.loc[:, "production_countries"].loc[all_mdb.loc[:, "production_countries"].apply(lambda x: x == [])] = np.nan

all_mdb.loc[:, "spoken_languages"] = all_mdb.loc[:, "spoken_languages"].replace(np.nan, "[]").apply(lambda x: literal_eval(str(x)))
all_mdb.loc[:, "spoken_languages"].loc[all_mdb.loc[:, "spoken_languages"].apply(lambda x: x == [])] = np.nan

all_mdb.loc[:, "keywords"] = all_mdb.loc[:, "keywords"].replace(np.nan, "[]").apply(lambda x: literal_eval(str(x)))
all_mdb.loc[:, "keywords"].loc[all_mdb.loc[:, "keywords"].apply(lambda x: x == [])] = np.nan
C:\Users\kaiwen\AppData\Local\Temp\ipykernel_23184\270989945.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_mdb.loc[:, "production_companies"].loc[all_mdb.loc[:, "production_companies"].apply(lambda x: x == [])] = np.nan
C:\Users\kaiwen\AppData\Local\Temp\ipykernel_23184\270989945.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_mdb.loc[:, "production_countries"].loc[all_mdb.loc[:, "production_countries"].apply(lambda x: x == [])] = np.nan
C:\Users\kaiwen\AppData\Local\Temp\ipykernel_23184\270989945.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_mdb.loc[:, "spoken_languages"].loc[all_mdb.loc[:, "spoken_languages"].apply(lambda x: x == [])] = np.nan
C:\Users\kaiwen\AppData\Local\Temp\ipykernel_23184\270989945.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_mdb.loc[:, "keywords"].loc[all_mdb.loc[:, "keywords"].apply(lambda x: x == [])] = np.nan
In [18]:
all_mdb.info()
all_mdb.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35776 entries, 0 to 35775
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   imdb_id               35776 non-null  object 
 1   primaryTitle          35776 non-null  object 
 2   originalTitle         35776 non-null  object 
 3   isAdult               35776 non-null  float64
 4   startYear             35776 non-null  float64
 5   runtimeMinutes        35776 non-null  float64
 6   genres                35776 non-null  object 
 7   averageRating         35776 non-null  float64
 8   numVotes              35776 non-null  float64
 9   directors             35683 non-null  object 
 10  writers               35525 non-null  object 
 11  status                35776 non-null  object 
 12  budget                35776 non-null  float64
 13  revenue               35776 non-null  float64
 14  release_date          35762 non-null  object 
 15  original_language     35776 non-null  object 
 16  popularity            35776 non-null  float64
 17  production_companies  33062 non-null  object 
 18  production_countries  34858 non-null  object 
 19  spoken_languages      35247 non-null  object 
 20  keywords              27977 non-null  object 
dtypes: float64(8), object(13)
memory usage: 5.7+ MB
Out[18]:
imdb_id primaryTitle originalTitle isAdult startYear runtimeMinutes genres averageRating numVotes directors ... status budget revenue release_date original_language popularity production_companies production_countries spoken_languages keywords
0 tt0002130 Dante's Inferno L'Inferno 0.0 1911.0 71.0 [Adventure, Drama, Fantasy] 7.0 2888.0 [nm0078205, nm0209738, nm0655824] ... Released 0.0 0.0 1911-07-01 it 6.472 [[[Milano Films], []]] [IT] NaN [hell, afterlife, purgatory, devil, death, bas...
1 tt0003014 Ingeborg Holm Ingeborg Holm 0.0 1913.0 96.0 [Drama] 7.0 1226.0 [nm0803705] ... Released 0.0 0.0 1913-10-27 sv 3.308 [[[Svenska Biografteatern], [SE]]] [SE] NaN [foster parents, widow, run away, reunion, pov...
2 tt0003419 The Student of Prague Der Student von Prag 0.0 1913.0 85.0 [Drama, Fantasy, Horror] 6.4 2119.0 [nm0753233, nm0917467] ... Released 0.0 0.0 1913-08-22 de 2.453 [[[Deutsche Bioscope], [DE]]] [DE] NaN [prague, czech republic, mysterious man, silen...
3 tt0004707 Tillie's Punctured Romance Tillie's Punctured Romance 0.0 1914.0 82.0 [Comedy] 6.3 3531.0 [nm0784407, nm0071658] ... Released 0.0 0.0 1914-11-14 en 3.555 [[[Keystone Film Company], [US]]] [US] NaN [revenge, millionaire, silent film, curtain call]
4 tt0004465 The Perils of Pauline The Perils of Pauline 0.0 1914.0 199.0 [Action, Adventure, Drama] 6.3 1037.0 [nm0309163, nm0533289] ... Released 0.0 0.0 1914-03-31 en 1.515 [[[The General and Eclectic Film Company], []]] [US] NaN [silent film, movie serial, damsel in distress...

5 rows × 21 columns

We define successful movies as movies which revenue is at least 2.5x of their budget, revenue & budget > 0.

In [19]:
successful_movies = all_mdb.loc[(all_mdb.budget > 0) & (all_mdb.revenue > 0) & (all_mdb.revenue / all_mdb.budget >= 2.5)].reset_index(drop=True)
successful_movies.head()
Out[19]:
imdb_id primaryTitle originalTitle isAdult startYear runtimeMinutes genres averageRating numVotes directors ... status budget revenue release_date original_language popularity production_companies production_countries spoken_languages keywords
0 tt0005078 The Cheat The Cheat 0.0 1915.0 59.0 [Drama, Romance] 6.5 2572.0 [nm0001124] ... Released 17311.0 137365.0 1915-12-13 en 2.375 [[[Paramount], [US]], [[Jesse L. Lasky Feature... [US] NaN NaN
1 tt0004972 The Birth of a Nation The Birth of a Nation 0.0 1915.0 195.0 [Drama, History, War] 6.2 24669.0 [nm0000428] ... Released 100000.0 11000000.0 1915-02-08 en 11.033 [[[Epoch Film Co.], []]] [US] NaN [ku klux klan, southern usa, army, racist, sil...
2 tt0006333 20,000 Leagues Under the Sea 20,000 Leagues Under the Sea 0.0 1916.0 85.0 [Action, Adventure, Sci-Fi] 6.1 1802.0 [nm0665737] ... Released 200000.0 8000000.0 1916-12-24 en 2.897 [[[The Universal Film Mfg. Co.], []]] [BS, US] [en] [revenge, captain nemo]
3 tt0009369 Mickey Mickey 0.0 1918.0 93.0 [Comedy, Drama] 6.1 1124.0 [nm0428059, nm0949648] ... Released 250000.0 8000000.0 1918-08-01 en 1.098 [[[Mabel Normand Feature Film Company], []]] [US] [en] [new york city, horse race, fish out of water,...
4 tt0012349 The Kid The Kid 0.0 1921.0 68.0 [Comedy, Drama, Family] 8.3 125518.0 [nm0000122] ... Released 250000.0 2500000.0 1921-01-21 en 11.362 [[[Charles Chaplin Productions], []]] [US] NaN [angel, suicide attempt, fistfight, slapstick ...

5 rows × 21 columns

In [20]:
successful_movies.describe()
Out[20]:
isAdult startYear runtimeMinutes averageRating numVotes budget revenue popularity
count 3216.000000 3216.000000 3216.000000 3216.000000 3.216000e+03 3.216000e+03 3.216000e+03 3216.000000
mean 0.000933 1998.135572 112.077425 6.749627 1.614920e+05 2.918425e+07 1.492110e+08 41.580716
std 0.030533 19.711077 23.025145 0.999630 2.384920e+05 4.544593e+07 2.350687e+08 124.345361
min 0.000000 1915.000000 59.000000 1.300000 1.000000e+03 1.000000e+00 1.400000e+01 0.600000
25% 0.000000 1988.000000 96.000000 6.200000 2.066275e+04 3.000000e+06 1.902573e+07 10.142000
50% 0.000000 2004.000000 108.000000 6.800000 7.690500e+04 1.100000e+07 6.516250e+07 18.334500
75% 0.000000 2013.000000 124.000000 7.500000 2.002195e+05 3.325000e+07 1.737172e+08 39.056750
max 1.000000 2022.000000 321.000000 9.400000 2.592594e+06 3.800000e+08 2.847246e+09 4156.973000

However, as we see from the .describe(), the revenue and budget is slightly wrong. The min budget is 0, and the min revenue is 14, which is obviously wrong. For revenue, as 10 million is already the 25% percentile, we can drop revenue below 1 million. For budget, as 3 million is already the 1st percentile, we can drop budget below half a million.

In [21]:
successful_movies = successful_movies.loc[successful_movies.revenue >= 1000000].reset_index(drop=True)
successful_movies = successful_movies.loc[successful_movies.budget >= 500000].reset_index(drop=True)
successful_movies.describe()
Out[21]:
isAdult startYear runtimeMinutes averageRating numVotes budget revenue popularity
count 2948.0 2948.000000 2948.000000 2948.000000 2.948000e+03 2.948000e+03 2.948000e+03 2948.000000
mean 0.0 1999.452849 113.193012 6.753290 1.731187e+05 3.181980e+07 1.618810e+08 44.515838
std 0.0 18.062896 22.633454 0.980159 2.446566e+05 4.658079e+07 2.414239e+08 129.421131
min 0.0 1925.000000 68.000000 1.300000 1.000000e+03 5.000000e+05 1.277800e+06 0.722000
25% 0.0 1990.000000 97.000000 6.200000 2.712625e+04 4.000000e+06 2.624780e+07 11.215500
50% 0.0 2005.000000 108.500000 6.800000 8.805100e+04 1.300000e+07 7.627828e+07 19.938000
75% 0.0 2013.000000 125.000000 7.500000 2.157662e+05 3.762500e+07 1.875960e+08 42.200750
max 0.0 2022.000000 321.000000 9.200000 2.592594e+06 3.800000e+08 2.847246e+09 4156.973000
In [22]:
successful_movies
Out[22]:
imdb_id primaryTitle originalTitle isAdult startYear runtimeMinutes genres averageRating numVotes directors ... status budget revenue release_date original_language popularity production_companies production_countries spoken_languages keywords
0 tt0015864 The Gold Rush The Gold Rush 0.0 1925.0 95.0 [Adventure, Comedy, Drama] 8.2 110578.0 [nm0000122] ... Released 923000.0 4000000.0 1925-07-12 en 13.315 [[[United Artists], [US]], [[Charles Chaplin P... [US] NaN [river, gold, dance, worker, cabin, gold rush,...
1 tt0019344 Sadie Thompson Sadie Thompson 0.0 1928.0 97.0 [Drama] 7.2 2708.0 [nm0909825] ... Released 1000000.0 7000000.0 1928-01-07 en 0.954 [[[Gloria Swanson Pictures], []]] [US] NaN [suicide, south seas, missionary, religion, si...
2 tt0019777 The Cocoanuts The Cocoanuts 0.0 1929.0 96.0 [Adventure, Comedy, Musical] 6.8 7717.0 [nm0282984, nm0763798] ... Released 500000.0 1800000.0 1929-05-23 en 4.499 [[[Paramount], [US]]] [US] [en] [musical, jewel thief, pre-code, real estate]
3 tt0023196 Mata Hari Mata Hari 0.0 1931.0 89.0 [Crime, Drama, Romance] 6.6 3724.0 [nm0280463] ... Released 558000.0 2227000.0 1931-12-26 en 5.058 [[[Metro-Goldwyn-Mayer], [US]]] [US] [en] [spy, world war i, espionage, female spy, pre-...
4 tt0021749 City Lights City Lights 0.0 1931.0 87.0 [Comedy, Drama, Romance] 8.5 183052.0 [nm0000122] ... Released 1500000.0 4250000.0 1931-02-01 en 14.868 [[[United Artists], [US]], [[Charles Chaplin P... [US] NaN [suicide attempt, operation, blindness and imp...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2943 tt1877830 The Batman The Batman 0.0 2022.0 176.0 [Action, Crime, Drama] 7.9 561381.0 [nm0716257] ... Released 185000000.0 770836163.0 2022-03-01 en 860.769 [[[Warner Bros. Pictures], [US]], [[6th & Idah... [US] [en] [crime fighter, secret identity, nightclub, po...
2944 tt10698680 K.G.F: Chapter 2 K.G.F: Chapter 2 0.0 2022.0 168.0 [Action, Crime, Drama] 8.4 115241.0 [nm6073824] ... Released 13000000.0 90410749.0 2022-04-14 kn 13.101 [[[Excel Entertainment], [IN]], [[Vaaraahi Cha... [IN] [kn] [undercover, sequel, one man army, don, action...
2945 tt9419884 Doctor Strange in the Multiverse of Madness Doctor Strange in the Multiverse of Madness 0.0 2022.0 126.0 [Action, Adventure, Fantasy] 7.0 340344.0 [nm0000600] ... Released 200000001.0 955200000.0 2022-05-04 en 1310.241 [[[Marvel Studios], [US]], [[Kevin Feige Produ... [US] [cn, en, es] [magic, superhero, based on comic, alternate r...
2946 tt8115900 The Bad Guys The Bad Guys 0.0 2022.0 100.0 [Adventure, Animation, Comedy] 6.8 29665.0 [nm2010048] ... Released 80000000.0 245000000.0 2022-03-17 en 459.358 [[[Universal Pictures], [US]], [[DreamWorks An... [US] [en, da] [based on graphic novel, duringcreditsstinger]
2947 tt10648342 Thor: Love and Thunder Thor: Love and Thunder 0.0 2022.0 118.0 [Action, Adventure, Comedy] 6.7 150494.0 [nm0169806] ... Released 250000000.0 758004063.0 2022-07-06 en 4156.973 [[[Marvel Studios], [US]], [[Kevin Feige Produ... [US] [en] [ex-girlfriend, hero, greek mythology, sequel,...

2948 rows × 21 columns

EDA¶

For each research questions shortlisted, outline your methodology in answering them. Discuss interesting observations or results discovered. Please note to only show EDA that's relevant to answering the question at hand. If you have done any data modeling, include in this section.

Research Question 1: What are the best genres for a movie?¶

We first find what genres are more likely to be successful, or the percentage of successful movies in each genre.

In [23]:
# make two dataframe containing revenue and genre of movie, one with all movies and one with successful movies.
genres = pd.DataFrame([all_mdb.revenue, all_mdb.genres]).T.dropna()
genres = genres.loc[genres.revenue > 0].reset_index(drop=True)
genres = genres.explode('genres').reset_index(drop=True)

successful_movies_genres = pd.DataFrame([successful_movies.revenue, successful_movies.genres]).T.explode('genres').reset_index(drop=True)

genres.head()
Out[23]:
revenue genres
0 137365.0 Drama
1 137365.0 Romance
2 11000000.0 Drama
3 11000000.0 History
4 11000000.0 War
In [24]:
successful_movies_genres.head()
Out[24]:
revenue genres
0 4000000.0 Adventure
1 4000000.0 Comedy
2 4000000.0 Drama
3 7000000.0 Drama
4 1800000.0 Adventure
In [25]:
successful_genres_percentage = (successful_movies_genres.genres.value_counts() / genres.genres.value_counts() * 100).round(decimals=2).to_frame().sort_values(by="genres", ascending=False)
successful_genres_percentage = successful_genres_percentage.replace(np.nan, 0) 
#As NaN means there is division by zero error, also there are no Adult movies in successful_movies

genres_percentage = (genres.genres.value_counts() / genres.genres.count() * 100).round(decimals = 2)
# percentage of genres in all movies

fig = plt.figure(figsize=(15,12))
ax0 = fig.add_subplot(1, 2, 1)

ax0 = sns.barplot(y=genres_percentage.index.values ,x=genres_percentage.values, orient='h')
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_title("Percentage of frequency of genres in all movies")
ax0.set_xlabel("Percentage of frequency (%)")
ax0.set_ylabel("Genres")

ax1 = fig.add_subplot(1, 2, 2) 
ax1 = sns.barplot(y=successful_genres_percentage.index.values ,x=successful_genres_percentage.genres, orient='h')
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title("Percentage of successful movies for each genre")
ax1.set_xlabel("Percentage of successful movies (%)")
ax1.set_ylabel("Genres")
plt.show()

Next, we find how genre affects revenue, by plotting a pointplot of the mean revenue of each genre and then a boxplot of revenue by genre. (We take the logarithm of revenue as the revenue has many orders of magnitude.)

In [26]:
# Mean revenue by genre
genre_mean = genres.groupby("genres").revenue.mean().sort_values(ascending=False)
genre_mean
Out[26]:
genres
Adventure    1.598268e+08
Animation    1.460725e+08
Sci-Fi       1.331758e+08
Action       1.084569e+08
Fantasy      1.066539e+08
Family       8.922234e+07
Adult        7.166667e+07
Comedy       5.674864e+07
Thriller     5.457347e+07
Mystery      4.538046e+07
Crime        4.150252e+07
Biography    4.020940e+07
Music        3.992448e+07
History      3.864902e+07
Musical      3.694949e+07
Drama        3.673604e+07
Romance      3.665500e+07
Sport        3.609680e+07
Horror       3.484590e+07
Western      3.050453e+07
War          2.848186e+07
Film-Noir    3.780342e+06
Name: revenue, dtype: float64
In [27]:
plt.figure(figsize=(20,12)) 
sns.pointplot( y=np.log10(genre_mean.values.astype('float')), x=genre_mean.index.values)
plt.title("Pointplot of mean revenue by genre")
plt.show()

#Sort boxplot by descending median
median_order = genres.groupby(by=["genres"])["revenue"].median().sort_values(ascending=False).index
plt.figure(figsize=(20,12)) 
sns.boxplot( y=np.log10(genres["revenue"].astype('float')), x=genres["genres"], order = median_order)
plt.title("Boxplot of revenue by genre")
plt.show()

Research Question 2: What other factors affect a movie's revenue?¶

We will investigate quantitative factors like budget and runtime, and qualitative factors like original language, spoken language, and keywords in the movie.

Firstly, quantitative factors. Does increased budget result in increased revenue?

In [28]:
factor_revenue = all_mdb   # drop all with 0 revenue, as these may skew our results
factor_revenue.revenue = factor_revenue.revenue.replace(0,np.nan)
factor_revenue = factor_revenue.dropna(subset=["revenue"])
In [29]:
def reg_residual_subplot(x_name, y_name, x_df,y_df, x_lim, y_lim):
    # transparent = True or False
    fig = plt.figure(figsize=(20,4))
    ax0 = fig.add_subplot(1, 2, 1)
    
    if (len(x_lim) == 0):
        x_lim = (None, None)
    if (len(y_lim) == 0):
        y_lim = (None, None)

        
    sns.regplot(x=x_df.astype(float), y=y_df.astype(float), scatter_kws={'alpha':0.3}, line_kws={'color': 'red'}, ax=ax0)
    ax0.set_xlim(x_lim)
    ax0.set_ylim(y_lim)
    ax0.set_xlabel(x_name)
    ax0.set_ylabel(y_name)
    ax0.set_title("Linear Regression Plot of " + x_name + " against " + y_name)
    
    ax1 = fig.add_subplot(1, 2, 2)
    
    sns.residplot(x=x_df.astype(float), y=y_df.astype(float), scatter_kws={'alpha':0.3}, ax=ax1)
    ax1.set_xlim(x_lim)
    ax1.set_ylim(y_lim)
    ax1.set_xlabel(x_name)
    ax1.set_ylabel(y_name)
    ax1.set_title("Residual Plot of " + x_name + " against " + y_name)
    
    coef = pd.DataFrame(x_df.astype(float))
    coef.columns = ["x"]
    coef["y"] = y_df.astype(float)
    coef = coef.dropna()
    
    pearson_coef, p_value = stats.pearsonr(coef["x"], coef["y"])
    print("Pearson's Coefficient is " + str(pearson_coef) + " P-value is " + str(p_value))
    
    plt.show()
In [30]:
reg_residual_subplot("Budget", "Revenue", all_mdb.budget, all_mdb.revenue, (), ())
Pearson's Coefficient is 0.7467009218280032 P-value is 0.0
In [31]:
reg_residual_subplot("Budget", "Revenue", factor_revenue.budget, factor_revenue.revenue, (), ())
Pearson's Coefficient is 0.7467009218280032 P-value is 0.0
In [32]:
reg_residual_subplot("Budget", "Revenue", successful_movies.budget, successful_movies.revenue, (), ())
Pearson's Coefficient is 0.8524386941593549 P-value is 0.0

From this plot, we observe a weak positive correlation between budget and revenue. From the residual plot, we observe that the variance of the residuals decreases with budget, so prediction of revenue will be more accurate when budget is large, and less accurate with a smaller budget.

Successful movies have a stronger correlation between budget and revenue, as the points are more clustered around the line.

Next, we investigate runtime vs revenue.

In [33]:
reg_residual_subplot("Runtime", "Revenue", all_mdb.runtimeMinutes, all_mdb.revenue, (0,300), (0,1.0e9))
Pearson's Coefficient is 0.17891001036238946 P-value is 1.292560240450447e-79
In [34]:
reg_residual_subplot("Runtime", "Revenue", factor_revenue.runtimeMinutes, factor_revenue.revenue, (0,300), (0,1.0e9))
Pearson's Coefficient is 0.17891001036238946 P-value is 1.292560240450447e-79
In [35]:
reg_residual_subplot("Runtime", "Revenue", successful_movies.runtimeMinutes, successful_movies.revenue, (0,300),(0,1.5e9))
Pearson's Coefficient is 0.20170933194289647 P-value is 1.9362003135175291e-28

We observe an extremely weak correlation between runtime and revenue. The graphs show that movies with higher revenue tend to be around the 40min-200min cluster, which could be because audiences are less inclined to watch movies beyond 200 minutes as the movie would be too long.

Successful movies cluster at the 60min - 180min spot.

Now, we will investigate the qualitative factors - original language, spoken language, keywords

In [36]:
def plot_correlation_map(df):
    corr = df.corr()
    _ , ax = plt.subplots( figsize =(12,10))
    cmap = sns.diverging_palette(240, 10, as_cmap = True )
    _ = sns.heatmap(corr,cmap = cmap,square=True, cbar_kws={ 'shrink' : .9 }, ax=ax, annot = True, annot_kws = { 'fontsize' : 12 })
In [37]:
plot_correlation_map(all_mdb[['budget','revenue','runtimeMinutes','numVotes']])
In [38]:
def barchart_genre_subplot(title,x_name,y_name, df, factor,all_movies): 
    # Factor: "Original Language"
    # all_movies: True, is using all movies dataset, False, is using successful movies dataset
    
    genres = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror',
       'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Sport', 'Thriller', 'War', 'Western', 'Reality-TV', 'Adult']
    
    languages = ["English","Hindi","French","Japanese","Spanish","Russian","German","Italian","Danish","Korean", "Chinese"]
    colours = ["tab:blue","tab:orange","tab:green","tab:red","tab:purple","tab:brown","tab:pink","tab:gray","tab:olive","tab:cyan","black"]

    colours = pd.DataFrame(list(zip(languages, colours)), columns =["country","colours"])
     
    nrow = 4
    ncol = 6

    fig = plt.figure(figsize=(20,10))
    gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
    axes = gs.subplots(sharex=True, sharey=True)
    
    count = 0

    for r in range(nrow):
        for c in range(ncol):
            
            if (factor == "Original Language"):
                a = df.loc[(genres[count])].dropna()
                if (title == "Percentage frequency of original language by genre in successful movies"):
                    a = a.loc[a.frequency > 5]
                elif (all_movies == False):
                    a = a.loc[a.frequency > 20]
                else:
                    a = a.head(3)
                if a.frequency.sum() == 0:
                    axes[r,c].set_title(genres[count])
                    axes[r,c].set_ylabel("") 
                    axes[r,c].set_xlabel("")            
                    count+=1
                    continue
                bar_colour = pd.merge(a, colours, on="country", how="left")
                a.T.plot(kind='bar',ax = axes[r,c],color = bar_colour.colours.tolist())

            axes[r,c].set_title(genres[count])
            axes[r,c].set_ylabel("") 
            axes[r,c].set_xlabel("")
            axes[r,c].get_legend().remove()
            
            for bars in axes[r,c].containers:
                axes[r,c].bar_label(bars)
            
            count+=1
            if (count == len(df.index.get_level_values(0).unique())):
                break
                
    for ax in axes.flat:
        ax.label_outer()

    lines = []
    labels = []
  
    for ax in fig.axes:
        Line, Label = ax.get_legend_handles_labels()
        lines.extend(Line)
        labels.extend(Label)
  
    labels_df = pd.DataFrame(list(zip(lines,labels)), columns =["lines","labels"])
    labels_df = labels_df.drop_duplicates(subset = "labels") 
    fig.legend(labels_df.lines, labels_df.labels, loc='upper right')
    
    fig.supxlabel(x_name)
    fig.supylabel(y_name)
    plt.suptitle(title)
        
    plt.show()    

I find the frequency of each original language by genre, for both successful and all movies. I also find the percentage frequency for both.

In [39]:
all_original_lang = pd.DataFrame([all_mdb.original_language, all_mdb.genres]).T.dropna().explode("genres").reset_index(drop=True)
all_original_lang.columns = ["code", "genres"]
all_original_lang = all_original_lang.merge(language_codes, how="left", on="code")
all_original_lang = all_original_lang.groupby("genres").country.value_counts().to_frame()
all_original_lang.columns = ["frequency"]

percent_all_original_lang = (all_original_lang['frequency'] / all_original_lang.groupby('genres')['frequency'].transform('sum') * 100).round(decimals=2).to_frame()

successful_original_lang = pd.DataFrame([successful_movies.original_language, successful_movies.genres]).T.dropna().explode("genres").reset_index(drop=True)
successful_original_lang.columns = ["code", "genres"]
successful_original_lang = successful_original_lang.merge(language_codes, how="left", on="code")
successful_original_lang = successful_original_lang.groupby("genres").country.value_counts().to_frame()
successful_original_lang.columns = ["frequency"]

percent_successful_original_lang = (successful_original_lang['frequency'] / successful_original_lang.groupby('genres')['frequency'].transform('sum') * 100).round(decimals=2).to_frame()
In [40]:
barchart_genre_subplot("Frequency of original language by genre in all movies", "Original Languages", "Frequency of language", all_original_lang, "Original Language", True)
In [41]:
barchart_genre_subplot("Frequency of original language by genre in successful movies", "Original Languages", "Frequency of language", successful_original_lang, "Original Language", False)
In [42]:
barchart_genre_subplot("Percentage frequency of original language by genre in all movies", "Original Languages", "Percentage frequency of language", percent_all_original_lang, "Original Language", True)
In [43]:
barchart_genre_subplot("Percentage frequency of original language by genre in successful movies", "Original Languages", "Percentage frequency of language", percent_successful_original_lang, "Original Language", False)
In [44]:
frequent_10_original_lang = all_mdb.loc[all_mdb.isin(all_mdb.original_language.value_counts().head(10).index.values).original_language]

# Change country codes to language name
language_codes.columns = ["original_language", "country"]
frequent_10_original_lang = pd.merge(frequent_10_original_lang, language_codes, how = "left", on="original_language")

# sort boxplots by descending median
median_order = frequent_10_original_lang.groupby("country")["revenue"].median().sort_values(ascending=False).index
sns.boxplot(x=frequent_10_original_lang.country, y=np.log10(frequent_10_original_lang.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Language")
plt.show()

This graph is arranged in order of descending median, hence we can see that English has the highest median, is slightly right-skewed and has the biggest range and IQR, with many outliers to the bottom. Turkish has the smallest median. Hindi has the smallest IQR, and Tamil has the smallest range, with 0 outliers.

Next, we plot a boxplot of revenue over the top 10 languages with highest median, and compare the results.

In [45]:
#all_mdb.spoken_languages.dropna().reset_index(drop=True).to_frame().explode("spoken_languages")

top_10_original_lang = all_mdb.loc[all_mdb.isin(all_mdb.groupby("original_language")["revenue"].median().sort_values(ascending=False).head(10).index.values).original_language]

# Change country codes to language name
language_codes.columns = ["original_language", "country"]
top_10_original_lang = pd.merge(top_10_original_lang, language_codes, how = "left", on="original_language")

# sort boxplots by descending median
median_order = top_10_original_lang.groupby("country")["revenue"].median().sort_values(ascending=False).head(10).index
sns.boxplot(x=top_10_original_lang.country, y=np.log10(top_10_original_lang.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Language")
plt.show()

We can clearly see that Abkhazian, Afrikaans, Malay are outliers as there are not enough data points to form a box on the boxplot, which skews our results. Hence, we remove them from the boxplot.

In [46]:
top_10_original_lang = all_mdb.loc[all_mdb.isin(all_mdb.groupby("original_language")["revenue"].median().sort_values(ascending=False).head(13).index.values).original_language]

# Change country codes to language name
language_codes.columns = ["original_language", "country"]
top_10_original_lang = pd.merge(top_10_original_lang, language_codes, how = "left", on="original_language")

# sort boxplots by descending median, remove the outliers
median_order = top_10_original_lang.groupby("country")["revenue"].median().sort_values(ascending=False).head(13).index.tolist()
median_order.remove("Abkhazian")
median_order.remove("Afrikaans")
median_order.remove("Malay")

sns.boxplot(x=top_10_original_lang.country, y=np.log10(top_10_original_lang.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Original Language")
plt.show()

Comparing the boxplot using most frequent language vs this boxplot which uses the highest medians, we see that the frequency of an original language does not directly correlate with its revenue. Surprisingly, in this boxplot, English has the second highest median, after Chinese. However, this may be due to the fact that there are much less Chinese movies than English movies, hence Chinese movies have a smaller sample size.

Overall, we can conclude that English is the best or one of the best languages to use for a movie's original language to earn the most revenue.

Now, we move on to the next qualitative factor, spoken languages.

In [47]:
spoken_languages = all_mdb.revenue.to_frame()
spoken_languages["spoken_languages"] = all_mdb.spoken_languages
spoken_languages = spoken_languages.dropna().explode("spoken_languages").reset_index(drop=True)

# Change country codes to language name
language_codes.columns = ["spoken_languages", "country"]
spoken_languages = pd.merge(spoken_languages, language_codes, how = "left", on="spoken_languages")

spoken_languages
Out[47]:
revenue spoken_languages country
0 8000000.0 en English
1 8000000.0 en English
2 274827.0 en English
3 1000000.0 en English
4 2600000.0 en English
... ... ... ...
16034 3900000.0 en English
16035 3900000.0 ta Tamil
16036 1564696.0 en English
16037 15400000.0 en English
16038 15400000.0 ro Romanian

16039 rows × 3 columns

In [48]:
# sort boxplots by descending median
median_order = spoken_languages.groupby("country")["revenue"].median().sort_values(ascending=False).head(10).index.tolist()

sns.stripplot(x=spoken_languages.country, y=np.log10(spoken_languages.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Spoken Language")
plt.show()

The top 10 languages with highest median all have less than 10 entries, indicating that these languages are quite obscure and are outliers which skew our data. Hence, we will not consider languages with less than 100 movie entries.

In [49]:
# sort boxplots by descending median
filtered_spoken_languages = spoken_languages.loc[spoken_languages.isin(spoken_languages.groupby("country")["revenue"].count().loc[spoken_languages.groupby("country")["revenue"].count() >= 100].index).country]
median_order = filtered_spoken_languages.groupby("country")["revenue"].median().sort_values(ascending=False).index.tolist()

sns.boxplot(x=filtered_spoken_languages.country, y=np.log10(filtered_spoken_languages.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Spoken Language")
plt.show()

This graph is arranged in order of decreasing median, hence we can conclude that Chinese-speaking movies have the highest median revenue and Turkish-speaking movies have the lowest median revenue. The boxplots have many outliers towards the bottom, especially English, and English has the largest range. Hindi has the smallest IQR and also some outliers towards the top.

Next, we investigate the last qualitative factor, keywords.

In [50]:
keywords = all_mdb.copy()
keywords.keywords = keywords.keywords.dropna()
keywords = keywords.explode("keywords").reset_index(drop=True)
keywords.keywords.value_counts()[keywords.keywords.value_counts() > 500].to_frame()

# There are 21074 total keywords, but most of them (20251) are under 100 uses.
Out[50]:
keywords
based on novel or book 1965
woman director 1885
murder 1632
revenge 871
based on true story 719
new york city 697
biography 691
love 688
musical 586
sequel 580
police 556
friendship 547
parent child relationship 501
In [51]:
# Percentage Frequency of keywords in all movies
(keywords.keywords.value_counts()[keywords.keywords.value_counts() > 500] / keywords.keywords.count() * 100).round(decimals=2).to_frame()
Out[51]:
keywords
based on novel or book 1.09
woman director 1.04
murder 0.90
revenge 0.48
based on true story 0.40
new york city 0.39
biography 0.38
love 0.38
musical 0.32
sequel 0.32
police 0.31
friendship 0.30
parent child relationship 0.28
In [52]:
successful_keywords = successful_movies.copy()
successful_keywords.keywords = successful_keywords.keywords.dropna()
successful_keywords = successful_keywords.explode("keywords").reset_index(drop=True)

successful_keywords.keywords.value_counts().loc[successful_keywords.keywords.value_counts() >= 100].to_frame()
#There are 8559 total keywords, most of them (7948) are under 10 uses.
Out[52]:
keywords
based on novel or book 323
duringcreditsstinger 212
sequel 171
murder 164
new york city 145
aftercreditsstinger 132
woman director 122
revenge 120
based on true story 109
friendship 107
parent child relationship 101
In [53]:
# Percentage Frequency of keywords in successful movies
(successful_keywords.keywords.value_counts().loc[successful_keywords.keywords.value_counts() >= 100] / successful_keywords.keywords.count() * 100).round(decimals=2).to_frame()
Out[53]:
keywords
based on novel or book 1.05
duringcreditsstinger 0.69
sequel 0.55
murder 0.53
new york city 0.47
aftercreditsstinger 0.43
woman director 0.40
revenge 0.39
based on true story 0.35
friendship 0.35
parent child relationship 0.33
In [123]:
#Keywords with more than 50 uses in successful movies, ranked by mean revenue
good_keywords = successful_keywords.keywords.value_counts().loc[successful_keywords.keywords.value_counts() >= 50].index.tolist()
successful_keywords_revenue = successful_movies.loc[:,["revenue","keywords"]].explode("keywords").dropna().groupby("keywords").revenue.mean().sort_values(ascending=False).to_frame().reset_index()
successful_keywords_revenue.loc[successful_keywords_revenue .isin(good_keywords).keywords].head(10).reset_index(drop=True)
Out[123]:
keywords revenue
0 superhero 6.404721e+08
1 based on comic 6.013988e+08
2 magic 4.993211e+08
3 aftercreditsstinger 4.244748e+08
4 based on young adult novel 4.230694e+08
5 sequel 3.668738e+08
6 duringcreditsstinger 3.455082e+08
7 alien 3.204866e+08
8 spy 2.801209e+08
9 dystopia 2.447500e+08
In [54]:
genres_keywords = keywords.explode("genres").groupby("genres").keywords.value_counts().to_frame()
genres_keywords.columns = ["frequency"]
genres_keywords.head()
Out[54]:
frequency
genres keywords
Action martial arts 442
revenge 381
based on novel or book 287
murder 284
sequel 261
In [55]:
successful_genres_keywords = successful_keywords.explode("genres").groupby("genres").keywords.value_counts().to_frame()
successful_genres_keywords.columns = ["frequency"]
successful_genres_keywords.head()
Out[55]:
frequency
genres keywords
Action sequel 90
duringcreditsstinger 78
superhero 78
based on comic 77
martial arts 76
In [56]:
def wordcloud_subplot(df, title):
    nrow = 4
    ncol = 6
    
    genres = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror',
       'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Sport', 'Thriller', 'War', 'Western', 'Reality-TV', 'Adult']
    
    fig = plt.figure(figsize=(16,10))
    gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
    axes = gs.subplots(sharex=True, sharey=True)
    
    count = 0

    for r in range(nrow):
        for c in range(ncol):
            
            genre = genres[count]
            wc = WordCloud(background_color="white")
            wc.generate_from_frequencies(df.loc[genre].sort_values(by="frequency",ascending=False).head(10).to_dict()["frequency"])
 
        #fig = plt.figure()
#fig.set_figwidth(14) # set width
#fig.set_figheight(18) # set height
 
            axes[r,c].imshow(wc, interpolation="bilinear")
            axes[r,c].axis("off")
                
            axes[r,c].set_title(genre)
            
            count+=1
            if (count == len(df.index.get_level_values(0).unique())):
                axes[r,c].axis("off")
                break
                
    for ax in axes.flat:
                ax.label_outer()
    
    plt.suptitle(title)
    
    plt.show()
In [57]:
wordcloud_subplot(genres_keywords, "Word Clouds of popular keywords by genre in all movies")
In [58]:
wordcloud_subplot(successful_genres_keywords, "Word Clouds of popular keywords by genre in successful movies")

Research Question 3: Who should you hire to produce a movie?¶

In this question, we investigate actors, directors, writers and production companies. For actors, directors, and writers, we determine who brought the most value to a movie, using a statistic called Value Above Replacement (VAR). VAR of a person = Average net profit from person / Average net profit of movies.

In [59]:
actor = imdb_cast_crew.loc[imdb_cast_crew.category == "actor"].loc[:,["tconst","nconst"]].reset_index(drop=True)
actor.columns = ["imdb_id", "nconst"]
actress = imdb_cast_crew.loc[imdb_cast_crew.category == "actress"].loc[:,["tconst","nconst"]].reset_index(drop=True)
actress.columns = ["imdb_id", "nconst"]

# Merge with All_MDB dataset to get budget, revenue, startYear for each entry
actor = pd.merge(actor,all_mdb, on="imdb_id").loc[:,["imdb_id","nconst","budget","revenue", "startYear"]]
actress = pd.merge(actress,all_mdb, on="imdb_id").loc[:,["imdb_id","nconst","budget","revenue", "startYear"]]

# drop entries with 0 budget - they are probably incorrect
actor.budget = actor.budget.replace(0,np.nan).dropna()
actor = actor.dropna().reset_index(drop=True)
actress.budget = actress.budget.replace(0,np.nan).dropna()
actress = actress.dropna().reset_index(drop=True)

# find the Net Profit: revenue - budget
actor["Net Profit"] = actor["revenue"] - actor["budget"]
actress["Net Profit"] = actress["revenue"] - actress["budget"]

# Merge with IMDB_names dataset to get the name of each actor, and their deathYear
actor = pd.merge(actor,imdb_names, on="nconst").loc[:,["imdb_id","nconst", "primaryName", "budget","revenue","Net Profit", "startYear", "deathYear"]]
actress = pd.merge(actress,imdb_names, on="nconst").loc[:,["imdb_id","nconst", "primaryName", "budget","revenue","Net Profit", "startYear", "deathYear"]]

# Drop entries where the movie is before 2000, and drop actors who are dead (with a non-NaN death year)
# This is as we want to see who we should hire, so the actors should ideally be alive and the movies they starred in should be recent
# Drop death year because it is all NaN values
actor = actor.loc[(actor.startYear >= 2000) & (actor.deathYear.isna())].loc[:,:"startYear"].reset_index(drop=True)
actress = actress.loc[(actress.startYear >= 2000) & (actress.deathYear.isna())].loc[:,:"startYear"].reset_index(drop=True)

# Only include actors who starred in more than 10 movies, so as to remove outliers which may skew the results
actor_count = actor.groupby(['primaryName'],  as_index=False)["Net Profit"].count()
actor = actor.loc[actor.isin(actor_count.loc[actor_count["Net Profit"] > 10].primaryName.values).primaryName].reset_index(drop=True)
actress_count = actress.groupby(['primaryName'],  as_index=False)["Net Profit"].count()
actress = actress.loc[actress.isin(actress_count.loc[actress_count["Net Profit"] > 10].primaryName.values).primaryName].reset_index(drop=True)

# Rank actors by VAR
actor_ranked = actor.groupby(['primaryName'],  as_index=False)["Net Profit"].mean().sort_values(by='Net Profit', ascending=False).reset_index(drop=True)
actor_ranked['VAR'] = (actor_ranked['Net Profit']/actor_ranked['Net Profit'].mean())
actress_ranked = actress.groupby(['primaryName'],  as_index=False)["Net Profit"].mean().sort_values(by='Net Profit', ascending=False).reset_index(drop=True)
actress_ranked['VAR'] = (actress_ranked['Net Profit']/actress_ranked['Net Profit'].mean())
# calculate net profit
# we only care about actors and actresses i guess
In [60]:
fig = plt.figure(figsize=(20,10))

ax0 = fig.add_subplot(1, 2, 1)
ax0 = sns.barplot(x=(actor_ranked.VAR.head(20).astype(float)).round(decimals=2), y=actor_ranked.primaryName.head(20),orient="h")
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_ylabel("")
ax0.set_xlabel("")
ax0.set_title("Actors")
    
ax1 = fig.add_subplot(1, 2, 2)
ax1 = sns.barplot(x=(actress_ranked.VAR.head(20).astype(float)).round(decimals=2), y=actress_ranked.primaryName.head(20),orient="h")
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_ylabel("")
ax0.set_xlabel("")
ax1.set_title("Actresses")

plt.suptitle("VAR of top 20 actors by gender")
fig.supylabel("Name of actor")
fig.supxlabel("VAR")
plt.show()

On average, actors have a higher VAR than actresses (they contribute more to movie revenue), but Emma Watson has the highest VAR of 5.01%. Robert Downey Jr is the actor with the highest VAR at 4.44%.

In [61]:
def VAR(job, title): # job = writers or directors
    profession = job
    
    job = all_mdb.loc[:,["imdb_id", job, "budget", "revenue", "startYear"]].explode(job)
    job.columns = ["imdb_id", "nconst", "budget", "revenue", "startYear"]
    job.budget = job.budget.replace(0,np.nan)
    job = job.dropna(subset=["budget"])
    job["Net Profit"] = job["revenue"] - job["budget"]
    
    job = pd.merge(job, imdb_names, on="nconst").loc[:,["imdb_id","nconst", "primaryName", "budget","revenue","Net Profit", "startYear", "deathYear"]]
    job = job.loc[(job.startYear >= 2000) & (job.deathYear.isna())].loc[:,:"startYear"].reset_index(drop=True)
    
    job_count = job.groupby(['primaryName'],  as_index=False)["Net Profit"].count()
    job = job.loc[job.isin(job_count.loc[job_count["Net Profit"] > 10].primaryName.values).primaryName].reset_index(drop=True)

    job_ranked = job.groupby(['primaryName'],  as_index=False)["Net Profit"].mean().sort_values(by='Net Profit', ascending=False).reset_index(drop=True)
    job_ranked['VAR'] = (job_ranked['Net Profit']/job_ranked['Net Profit'].mean())
    
    # graph
    plt.figure(figsize=(20,10))

    ax = sns.barplot(x=(job_ranked.VAR.head(20).astype(float)).round(decimals=2), y=job_ranked.primaryName.head(20),orient="h")
    for i in ax.containers:
        ax.bar_label(i,)

    plt.ylabel("Name of " + profession)
    plt.xlabel("VAR")
    plt.title(title)
    
    return job_ranked
In [62]:
directors_ranked = VAR("directors", "VAR of top 20 directors")
In [63]:
writers_ranked = VAR("writers", "VAR of top 20 writers")
In [64]:
actor_ranked["Job"] = "Actors"
actress_ranked["Job"] = "Actresses"
directors_ranked["Job"] = "Directors"
writers_ranked["Job"] = "Writers"

job_ranked = pd.concat([actor_ranked, actress_ranked, directors_ranked, writers_ranked])

sns.boxplot(x = job_ranked.Job, y = job_ranked.VAR)
Out[64]:
<AxesSubplot:xlabel='Job', ylabel='VAR'>
In [65]:
sns.violinplot(x = job_ranked.Job, y = job_ranked.VAR)
Out[65]:
<AxesSubplot:xlabel='Job', ylabel='VAR'>

From the boxplot, we can see that directors have the highest median VAR, and writers have the lowest median VAR. All boxplots have outliers towards the top, especially for Actors and Actresses. All boxplots are left-skewed except for directors (directors is right-skewed). Writers have the biggest IQR, and directors have the smallest IQR. We observe that the violinplots of Actors and Actresses are very similar, except that Actresses have higher outliers towards the top (Emma Watson).

In [66]:
production_companies = all_mdb.loc[:,["production_companies", "revenue", "budget", "popularity", "averageRating"]].explode("production_companies").dropna().explode("production_companies").reset_index(drop=True).explode("production_companies")
companies = production_companies.loc[0::2]
countries = production_companies.loc[1::2]
countries.loc[:,"production_companies"] = countries.production_companies.replace("", np.nan)

production_companies = companies.copy()
production_companies.insert(1, "production_country", countries.production_companies.values)

# We only want to get the top 25 companies in terms of number of movies released, and drop the rest
companies_list = production_companies.groupby("production_companies", as_index=False).revenue.count().sort_values(by="revenue", ascending=False).reset_index(drop=True).head(25).production_companies.tolist()
production_companies = production_companies.loc[production_companies.isin(companies_list).production_companies].reset_index(drop=True)

companies_scatter = production_companies.groupby("production_companies", as_index=False).agg({'revenue':'mean','budget':'mean','popularity':'mean','averageRating':'mean'})
C:\Users\kaiwen\AppData\Local\Temp\ipykernel_23184\1100111065.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  countries.loc[:,"production_companies"] = countries.production_companies.replace("", np.nan)
In [67]:
#plt.figure(figsize=(15,15))
fig = plt.figure(figsize=(15,15))
# plotting scatter plot
sc = sns.regplot(x=companies_scatter.revenue,y=companies_scatter.popularity)

# Loop for annotation of all points
for i in range(len(companies_scatter.revenue)):
    plt.annotate(companies_scatter.production_companies[i], (companies_scatter.revenue[i] + 0.4, companies_scatter.popularity[i] + 0.4))

pearson_coef, p_value = stats.pearsonr(companies_scatter.revenue, companies_scatter.popularity)
print("Pearson's Coefficient is " + str(pearson_coef) + " P-value is " + str(p_value))
plt.show()
Pearson's Coefficient is 0.8327430006416182 P-value is 2.415699718774758e-07
In [68]:
sns.residplot(x=companies_scatter.revenue,y=companies_scatter.popularity)
plt.show()

Research Question 4: When should one release a movie?¶

We evaluate revenue by year, month and day. Firstly, by year.

In [69]:
# Dataframe of all movies' revenue and year released
revenue_year = pd.DataFrame([all_mdb.release_date, all_mdb.revenue]).T
revenue_year.loc[:,"revenue"] = revenue_year.loc[:,"revenue"].replace(0,np.nan)
revenue_year = revenue_year.dropna().reset_index(drop=True)
revenue_year.loc[:,"release_date"] = revenue_year.loc[:,"release_date"].str.slice(0,4).astype(int)
revenue_year

# Dataframe of successful movies' revenue and year released
successful_revenue_year = pd.DataFrame([successful_movies.release_date, successful_movies.revenue]).T
successful_revenue_year.loc[:,"revenue"] = successful_revenue_year.loc[:,"revenue"].replace(0,np.nan)
successful_revenue_year = successful_revenue_year.dropna().reset_index(drop=True)
successful_revenue_year.loc[:,"release_date"] = successful_revenue_year.loc[:,"release_date"].str.slice(0,4).astype(int)
successful_revenue_year


fig = plt.figure(figsize=(20,10))

ax0 = fig.add_subplot(1, 2, 1)
ax0 = sns.lineplot(x = revenue_year.release_date, y = np.log10(revenue_year.revenue.astype(float)))
ax0.set_title("Revenue of all movies by year")
ax0.set_xlabel("Year")

ax1 = fig.add_subplot(1, 2, 2) 
ax1 = sns.lineplot(x = successful_revenue_year.release_date, y = np.log10(successful_revenue_year.revenue.astype('float')))
ax1.set_title("Revenue of successful movies by year")
ax1.set_xlabel("Year")

plt.show()

There is a clear positive correlation between revenue of movies and year, and this correlation is more pronounced in the successful movies dataset.

Next, we evaluate the months where most movies were released, and the months where most successful movies were released.

In [70]:
all_month = all_mdb.release_date.dropna().str.slice(5,7).value_counts().sort_index().to_frame()
all_month.columns = ["Movies Released"]
all_month.index = ["Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
all_month.head()
Out[70]:
Movies Released
Jan 3154
Feb 2599
Mar 3010
Apr 2731
May 2589
In [71]:
successful_month = successful_movies.release_date.dropna().str.slice(5,7).value_counts().sort_index().to_frame()
successful_month.columns = ["Movies Released"]
successful_month.index = ["Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
successful_month.head()
Out[71]:
Movies Released
Jan 195
Feb 201
Mar 237
Apr 187
May 243
In [72]:
fig = plt.figure(figsize=(20,5))

ax0 = fig.add_subplot(1, 2, 1)
ax0 = sns.barplot(x = all_month.index.values.flatten(), y = all_month.values.flatten())
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_title("(All movies) Number of movies released by month")
ax0.set_xlabel("Month")
ax0.set_ylabel("Number of movies released")

ax1 = fig.add_subplot(1, 2, 2)
ax1 = sns.barplot(x = successful_month.index.values.flatten(), y = successful_month.values.flatten())
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title("(Successful movies) Number of movies released by month")
ax1.set_xlabel("Month")
ax1.set_ylabel("Number of movies released")

plt.show()

Next, we investigate the best weekday to release a movie, by plotting the distribution of movie releases by weekdays.

In [73]:
all_day = pd.to_datetime(all_mdb.release_date.dropna()).apply(lambda t: t.weekday()).value_counts().sort_index().to_frame()
all_day.columns = ["Movies Released"]
all_day.index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
all_day
Out[73]:
Movies Released
Mon 1911
Tue 2553
Wed 5949
Thu 6880
Fri 13879
Sat 2958
Sun 1632
In [74]:
successful_day = pd.to_datetime(successful_movies.release_date.dropna()).apply(lambda t: t.weekday()).value_counts().sort_index().to_frame()
successful_day.columns = ["Movies Released"]
successful_day.index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
successful_day
Out[74]:
Movies Released
Mon 94
Tue 181
Wed 678
Thu 658
Fri 1136
Sat 118
Sun 83
In [75]:
fig = plt.figure(figsize=(20,10))

ax0 = fig.add_subplot(2, 2, 1)
ax0 = sns.barplot(x = all_day.index.values.flatten(), y = all_day.values.flatten())
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_title("(All movies) Number of movies released by weekday")
ax0.set_xlabel("Weekday")
ax0.set_ylabel("Number of movies released")

ax1 = fig.add_subplot(2, 2, 2)
ax1 = sns.barplot(x = successful_day.index.values.flatten(), y = successful_day.values.flatten())
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title("(Successful movies) Number of movies released by weekday")
ax1.set_xlabel("Weekday")
ax1.set_ylabel("Number of movies released")

ax3 = fig.add_subplot(2, 2, 3)
ax3 = sns.barplot(x = all_day.index.values.flatten(), y = (all_day.values.flatten()/all_day.values.sum() * 100).round(decimals=2))
for i in ax3.containers:
    ax3.bar_label(i,)
ax3.set_title("(All movies) Percentage of movies released by weekday")
ax3.set_xlabel("Weekday")
ax3.set_ylabel("Percentage of movies released")

ax4 = fig.add_subplot(2, 2, 4)
ax4 = sns.barplot(x = successful_day.index.values.flatten(), y = (successful_day.values.flatten()/successful_day.values.sum() * 100).round(decimals=2))
for i in ax4.containers:
    ax4.bar_label(i,)
ax4.set_title("(Successful movies) Percentage of movies released by weekday")
ax4.set_xlabel("Weekday")
ax4.set_ylabel("Percentage of movies released")

fig.tight_layout()

plt.show()

We can see that the distribution of percentage of movies released by weekday is roughly the same in both successful movies and all movies. The graphs show a bell curve shape, with Friday by far the day with most number of movie releases with 38%, in both successful and all movies, followed by Thursday and Wednesday. The day with least overall and least successful movies released is Sunday, with 4.56% in all movies and 2.82% in successful movies.

This indicates a correlation between day of week and success, although it may not be causation.

Lastly, we investigate the best date of month to release a movie.

In [76]:
all_date = pd.to_datetime(all_mdb.release_date.dropna()).apply(lambda t: t.day).value_counts().sort_index().to_frame()
all_date.columns = ["Movies Released"]

successful_date = pd.to_datetime(successful_movies.release_date.dropna()).apply(lambda t: t.day).value_counts().sort_index().to_frame()
successful_date.columns = ["Movies Released"]
In [77]:
fig = plt.figure(figsize=(20,10))

ax0 = fig.add_subplot(2, 2, 1)
ax0 = sns.barplot(x = all_date.index.values.flatten(), y = all_date.values.flatten())
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_title("(All movies) Number of movies released by date of month")
ax0.set_xlabel("Date of month")
ax0.set_ylabel("Number of movies released")

ax1 = fig.add_subplot(2, 2, 2)
ax1 = sns.barplot(x = successful_date.index.values.flatten(), y = successful_date.values.flatten())
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title("(Successful movies) Number of movies released by date of month")
ax1.set_xlabel("Date of month")
ax1.set_ylabel("Number of movies released")

ax3 = fig.add_subplot(2, 2, 3)
ax3 = sns.barplot(x = all_date.index.values.flatten(), y = (all_date.values.flatten()/all_date.values.sum() * 100).round(decimals=2))
for i in ax3.containers:
    ax3.bar_label(i,)
ax3.set_title("(All movies) Percentage of movies released by date of month")
ax3.set_xlabel("Date of month")
ax3.set_ylabel("Percentage of movies released")

ax4 = fig.add_subplot(2, 2, 4)
ax4 = sns.barplot(x = successful_date.index.values.flatten(), y = (successful_date.values.flatten()/successful_date.values.sum() * 100).round(decimals=2))
for i in ax4.containers:
    ax4.bar_label(i,)
ax4.set_title("(Successful movies) Percentage of movies released by date of month")
ax4.set_xlabel("Date of month")
ax4.set_ylabel("Percentage of movies released")

fig.tight_layout()

plt.show()

We notice that the distribution of movie releases by date of month is different across all and successful movies.

For all movies, the 1st of month is by far the highest at 7.63% of all movies, and 31st of month is lowest at 1.72% (perhaps due to the fact that half of the months do not have the 31st), with the rest of the dates having a roughly equal distribution. The second lowest is 30th at 2.61%.

For successful movies, the 25th of month is the highest at 4.58%, with the 1st of month being second place at 4.51%. Again, the lowest is the 31st of month at 1.53%. The rest of the dates have a slight bell curve shape on the graph, but this could be due to the scale being 2x smaller than in all movies, so the rest of the dates also have roughly equal distribution.

Research Question 5: Does popularity correlate with revenue?¶

In this question, we investigate two types of popularity - one by IMDB rating, and one by TMDB's popularity.

IMDB rating is a simpler metric: it is the average rating of a movie by IMDB users, however, it does not account for other factors that make up popularity.

TMDB's popularity is a more complex metric, consisting of (according to its website), number of votes for the day, number of views for the day, number of users who marked it as a "favourite" for the day, number of users who added it to their "watchlist" for the day, release date, number of total votes, previous days score. As we can see, it is a more nuanced metric and hence a better metric of popularity.

Firstly, we investigate how popularity changes over years, then how popularity changes against revenue, by genre.

In [78]:
def popularity(factor, metric): # where factor is "Year", "Revenue". metric is "IMDB rating", "TMDB popularity", 

    nrow = 4
    ncol = 6
    
    fig = plt.figure(figsize=(16,10))
    gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
    axes = gs.subplots(sharex=True, sharey=True)
    
    count = 0

    for r in range(nrow):
        for c in range(ncol):
            
            if (factor == "Year"): 
                a = popularity_revenue.loc[popularity_revenue.genres == popularity_revenue.genres.unique()[count]].dropna(subset=["release_date"])
            
                if (metric == "IMDB rating"):
                    sns.scatterplot(x=a.release_date.astype(int),y=a.averageRating,ax=axes[r,c], alpha = 0.3)
                    axes[r,c].set_ylim(0, 10)
                elif (metric == "TMDB popularity"):
                    sns.scatterplot(x=a.release_date.astype(int),y=a.popularity,ax=axes[r,c], alpha = 0.3)
                    axes[r,c].set_ylim(0, 500) # limit TMDB popularity to 500 because the outliers skew data
                    
            elif (factor == "Revenue"):
                popularity_revenue.loc[:,"revenue"] = popularity_revenue.revenue.replace(0,np.nan)
                a = popularity_revenue.loc[popularity_revenue.genres == popularity_revenue.genres.unique()[count]].dropna(subset=["revenue"])
                a.revenue = np.log10((a.revenue).astype(float))
                a.popularity = a.popularity.astype(float)
                a.averageRating = a.averageRating.astype(float)
                
                if (metric == "IMDB rating"):
                    sns.regplot(x=a.revenue.astype(float),y=a.averageRating,ax=axes[r,c], scatter_kws={'alpha':0.3}, line_kws={'color': 'red'})
                    axes[r,c].set_ylim(0, 10)
                
                elif (metric == "TMDB popularity"): 
                    sns.regplot(x=a.revenue.astype(float),y=a.popularity,ax=axes[r,c], scatter_kws={'alpha':0.3}, line_kws={'color': 'red'})
                    axes[r,c].set_ylim(0, 500)
                
            axes[r,c].set_title(popularity_revenue.genres.unique()[count])
            axes[r,c].set_ylabel("") 
            axes[r,c].set_xlabel("")
            
            count+=1
            if (count == 23):
                break
                
    for ax in axes.flat:
                ax.label_outer()
    

    if (factor == "Year"):
        fig.supxlabel("Release Year")
        fig.supylabel(metric)
        plt.suptitle(metric +" of genres over time")
    elif (factor == "Revenue"):
        fig.supxlabel("Movie Revenue")
        fig.supylabel(metric)
        plt.suptitle(metric + " of genres against revenue")
    plt.show()
    
    
In [79]:
popularity_revenue = pd.DataFrame([all_mdb.popularity, all_mdb.averageRating, all_mdb.numVotes, all_mdb.genres, all_mdb.revenue, all_mdb.release_date]).T.explode("genres").reset_index(drop=True)
popularity_revenue.loc[:,"release_date"] = popularity_revenue.release_date.str.slice(0,4)

popularity("Year", "IMDB rating")

From these scatterplots of IMDB rating of genres over time, we can observe that over time, the range of IMDB ratings increase, and that older movies tend to have higher ratings (as the spread is smaller). This may be due to the fact that there are lesser older movies than newer movies, so older movies have a smaller sample size. By this graph, we can also see that IMDB ratings do not reflect the popularity of movies, as we would not expect older movies to have the same popularity as newer movies, and instead may better reflect the quality of movies (eg. how good the audience found it to be).

In [80]:
popularity("Year", "TMDB popularity")

From these scatterplots of TMDB popularity of genres over time, we observe that there is a clear exponential upward trend of popularity over time in nearly all genres, except for Western, Film Noir, Musical and Adult. This may be due to movies getting more and more "mainstream" and popular, and technological advancements in the movie scene making it more exciting to watch a movie. From this, we can also infer that TMDB popularity is a passable reflection of a movie's popularity in the real world.

In [81]:
popularity("Revenue", "IMDB rating")

From these linear regression plots of IMDB ratings against revenue by genre, we can see that there is no strong correlation between movie revenue and IMDB rating, as many points are just clustered together in 1 big cluster in the center.

In [82]:
popularity("Revenue", "TMDB popularity")

From these scatterplots of popularity of genres against revenue, we can see that there is a positive trend between movie revenue and popularity across most genres, although this trend seems to be more exponential instead of linear. This trend seems more pronounced in genres like Adventure, Drama, Fantasy, Animation and Sci-Fi.

In [83]:
reg_residual_subplot("Revenue", "Popularity", np.log10(popularity_revenue.revenue.astype(float)), popularity_revenue.popularity.astype(float), (), (0,500))
Pearson's Coefficient is 0.17577145298091046 P-value is 1.756177612781108e-191

From this plot, we observe a weak positive correlation between popularity against revenue. From the residual plot, we observe that the variance of the residuals increases with revenue, so prediction will be less accurate when revenue is large.

Results Findings & Conclusion¶

For each research question, summarize in 2-3 visualizations which will answer the question. Intrepret the results accordingly and give your observation and conclusion. The visualizations should be well presented (apply what you have learnt in Chapter 9 on data communication). The plots shown here could be an enhanced version of the EDA plots, or presented in another format.

This function produces a linear regression plot and residual plot, and prints the Pearson's coefficient and P-value.

In [84]:
def reg_residual_subplot(x_name, y_name, x_df,y_df, x_lim, y_lim, title):
    # transparent = True or False
    fig = plt.figure(figsize=(20,4))
    ax0 = fig.add_subplot(1, 2, 1)
    
    if (len(x_lim) == 0):
        x_lim = (None, None)
    if (len(y_lim) == 0):
        y_lim = (None, None)

        
    sns.regplot(x=x_df.astype(float), y=y_df.astype(float), scatter_kws={'alpha':0.3}, line_kws={'color': 'red'}, ax=ax0)
    ax0.set_xlim(x_lim)
    ax0.set_ylim(y_lim)
    ax0.set_xlabel(x_name)
    ax0.set_ylabel(y_name)
    ax0.set_title("Linear Regression Plot of " + y_name + " against " + x_name)
    
    ax1 = fig.add_subplot(1, 2, 2)
    
    sns.residplot(x=x_df.astype(float), y=y_df.astype(float), scatter_kws={'alpha':0.3}, ax=ax1)
    ax1.set_xlim(x_lim)
    ax1.set_ylim(y_lim)
    ax1.set_xlabel(x_name)
    ax1.set_ylabel(y_name)
    ax1.set_title("Residual Plot of " + y_name + " against " + x_name)
    
    coef = pd.DataFrame(x_df.astype(float))
    coef.columns = ["x"]
    coef["y"] = y_df.astype(float)
    coef = coef.dropna()
    
    pearson_coef, p_value = stats.pearsonr(coef["x"], coef["y"])
    print("Pearson's Coefficient is " + str(pearson_coef) + " P-value is " + str(p_value))
    
    fig.suptitle(title)
    
    plt.show()

This function produces a subplot containing bar plots of frequency of the inputted factor by genre.

In [85]:
def barchart_genre_subplot(title,x_name,y_name, df, factor,all_movies): 
    # Factor: "Original Language"
    # all_movies: True, is using all movies dataset, False, is using successful movies dataset
    
    genres = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror',
       'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Sport', 'Thriller', 'War', 'Western', 'Reality-TV', 'Adult']
    
    languages = ["English","Hindi","French","Japanese","Spanish","Russian","German","Italian","Danish","Korean", "Chinese"]
    colours = ["tab:blue","tab:orange","tab:green","tab:red","tab:purple","tab:brown","tab:pink","tab:gray","tab:olive","tab:cyan","black"]

    colours = pd.DataFrame(list(zip(languages, colours)), columns =["country","colours"])
     
    nrow = 4
    ncol = 6

    fig = plt.figure(figsize=(20,10))
    gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
    axes = gs.subplots(sharex=True, sharey=True)
    
    count = 0

    for r in range(nrow):
        for c in range(ncol):
            
            if (factor == "Original Language"):
                a = df.loc[(genres[count])].dropna()
                if (title == "Percentage frequency of original language by genre in successful movies"):
                    a = a.loc[a.frequency > 5]
                elif (all_movies == False):
                    a = a.loc[a.frequency > 20]
                else:
                    a = a.head(3)
                if a.frequency.sum() == 0:
                    axes[r,c].set_title(genres[count])
                    axes[r,c].set_ylabel("") 
                    axes[r,c].set_xlabel("")            
                    count+=1
                    continue
                bar_colour = pd.merge(a, colours, on="country", how="left")
                a.T.plot(kind='bar',ax = axes[r,c],color = bar_colour.colours.tolist())

            axes[r,c].set_title(genres[count])
            axes[r,c].set_ylabel("") 
            axes[r,c].set_xlabel("")
            axes[r,c].get_legend().remove()
            
            for bars in axes[r,c].containers:
                axes[r,c].bar_label(bars)
            
            count+=1
            if (count == len(df.index.get_level_values(0).unique())):
                break
                
    for ax in axes.flat:
        ax.label_outer()

    lines = []
    labels = []
  
    for ax in fig.axes:
        Line, Label = ax.get_legend_handles_labels()
        lines.extend(Line)
        labels.extend(Label)
  
    labels_df = pd.DataFrame(list(zip(lines,labels)), columns =["lines","labels"])
    labels_df = labels_df.drop_duplicates(subset = "labels") 
    fig.legend(labels_df.lines, labels_df.labels, loc='upper right')
    
    fig.supxlabel(x_name)
    fig.supylabel(y_name)
    plt.suptitle(title)
        
    plt.show()    

This function generates a wordcloud subplot by genre.

In [167]:
def wordcloud_subplot(df, title):
    nrow = 4
    ncol = 6
    
    genres = ['Action', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Drama', 'Family', 'Fantasy', 'Film-Noir', 'History', 'Horror',
       'Music', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
       'Sport', 'Thriller', 'War', 'Western', 'Reality-TV', 'Adult']
    
    fig = plt.figure(figsize=(16,10))
    gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
    axes = gs.subplots(sharex=True, sharey=True)
    
    count = 0

    for r in range(nrow):
        for c in range(ncol):
            
            genre = genres[count]
            wc = WordCloud(background_color="white")
            wc.generate_from_frequencies(df.loc[genre].sort_values(by="frequency",ascending=False).head(10).to_dict()["frequency"])
 
            axes[r,c].imshow(wc, interpolation="bilinear")
            axes[r,c].axis("off")
                
            axes[r,c].set_title(genre)
            
            count+=1
            if (count == len(df.index.get_level_values(0).unique())):
                axes[r,c].axis("off")
                break
                
    for ax in axes.flat:
                ax.label_outer()
    
    plt.suptitle(title)
    
    plt.show()

This function plots a barplot of writers / directors ranked by Value Above Replacement (VAR).

In [168]:
def VAR(job, title): # job = writers or directors
    profession = job
    
    job = all_mdb.loc[:,["imdb_id", job, "budget", "revenue", "startYear"]].explode(job)
    job.columns = ["imdb_id", "nconst", "budget", "revenue", "startYear"]
    job.budget = job.budget.replace(0,np.nan)
    job = job.dropna(subset=["budget"])
    job["Net Profit"] = job["revenue"] - job["budget"]
    
    job = pd.merge(job, imdb_names, on="nconst").loc[:,["imdb_id","nconst", "primaryName", "budget","revenue","Net Profit", "startYear", "deathYear"]]
    job = job.loc[(job.startYear >= 2000) & (job.deathYear.isna())].loc[:,:"startYear"].reset_index(drop=True)
    
    job_count = job.groupby(['primaryName'],  as_index=False)["Net Profit"].count()
    job = job.loc[job.isin(job_count.loc[job_count["Net Profit"] > 10].primaryName.values).primaryName].reset_index(drop=True)

    job_ranked = job.groupby(['primaryName'],  as_index=False)["Net Profit"].mean().sort_values(by='Net Profit', ascending=False).reset_index(drop=True)
    job_ranked['VAR'] = (job_ranked['Net Profit']/job_ranked['Net Profit'].mean())
    
    # graph
    plt.figure(figsize=(20,10))

    ax = sns.barplot(x=(job_ranked.VAR.head(20).astype(float)).round(decimals=2), y=job_ranked.primaryName.head(20),orient="h")
    for i in ax.containers:
        ax.bar_label(i,)

    plt.ylabel("Name of " + profession)
    plt.xlabel("VAR")
    plt.title(title)
    
    return job_ranked

This function produces a scatterplot of factor by genre.

In [215]:
def popularity(factor, metric, title): # where factor is "Year", "Revenue". metric is "IMDB rating", "TMDB popularity", 

    nrow = 4
    ncol = 6
    
    fig = plt.figure(figsize=(16,10))
    gs = fig.add_gridspec(nrow, ncol, hspace=0.2, wspace=0)
    axes = gs.subplots(sharex=True, sharey=True)
    
    count = 0

    for r in range(nrow):
        for c in range(ncol):
            
            if (factor == "Year"): 
                a = popularity_revenue.loc[popularity_revenue.genres == popularity_revenue.genres.unique()[count]].dropna(subset=["release_date"])
            
                if (metric == "IMDB rating"):
                    sns.scatterplot(x=a.release_date.astype(int),y=a.averageRating,ax=axes[r,c], alpha = 0.3)
                    axes[r,c].set_ylim(0, 10)
                elif (metric == "TMDB popularity"):
                    sns.scatterplot(x=a.release_date.astype(int),y=a.popularity,ax=axes[r,c], alpha = 0.3)
                    axes[r,c].set_ylim(0, 500) # limit TMDB popularity to 500 because the outliers skew data
                    
            elif (factor == "Revenue"):
                popularity_revenue.loc[:,"revenue"] = popularity_revenue.revenue.replace(0,np.nan)
                a = popularity_revenue.loc[popularity_revenue.genres == popularity_revenue.genres.unique()[count]].dropna(subset=["revenue"])
                a.revenue = np.log10((a.revenue).astype(float))
                a.popularity = a.popularity.astype(float)
                a.averageRating = a.averageRating.astype(float)
                
                if (metric == "IMDB rating"):
                    sns.regplot(x=a.revenue.astype(float),y=a.averageRating,ax=axes[r,c], scatter_kws={'alpha':0.3}, line_kws={'color': 'red'})
                    axes[r,c].set_ylim(0, 10)
                
                elif (metric == "TMDB popularity"): 
                    sns.regplot(x=a.revenue.astype(float),y=a.popularity,ax=axes[r,c], scatter_kws={'alpha':0.3}, line_kws={'color': 'red'})
                    axes[r,c].set_ylim(0, 500)
                
            axes[r,c].set_title(popularity_revenue.genres.unique()[count])
            axes[r,c].set_ylabel("") 
            axes[r,c].set_xlabel("")
            
            count+=1
            if (count == 23):
                break
                
    for ax in axes.flat:
                ax.label_outer()
    

    if (factor == "Year"):
        fig.supxlabel("Release Year")
        fig.supylabel(metric)
        plt.suptitle("Popularity of genres over time")
    elif (factor == "Revenue"):
        fig.supxlabel("Movie Revenue")
        fig.supylabel("Popularity")
        plt.suptitle(title)
    plt.show()
    
    

Research Question 1: What are the best genres for a movie?¶

Firstly, we find what genres are more likely to be successful, or the percentage of successful movies in each genre.

In [86]:
fig = plt.figure(figsize=(15,12))
ax0 = fig.add_subplot(1, 2, 1)

ax0 = sns.barplot(y=genres_percentage.index.values ,x=genres_percentage.values, orient='h')
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_title("Percentage of frequency of genres in all movies")
ax0.set_xlabel("Percentage of frequency (%)")
ax0.set_ylabel("Genres")

ax1 = fig.add_subplot(1, 2, 2) 
ax1 = sns.barplot(y=successful_genres_percentage.index.values ,x=successful_genres_percentage.genres, orient='h')
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title("Percentage of successful movies for each genre")
ax1.set_xlabel("Percentage of successful movies (%)")
ax1.set_ylabel("Genres")

fig.suptitle("Genres with more movies do not correlate with success. \nAdventure, Family, Animation, Sci-Fi and Horror have the highest chance of being successful.")
plt.show()

The first bar chart shows the percentage of frequency of genres in all movies.

The second bar chart shows the percentage of successful movies for each genre. Film-Noir is distinctly first, with a success rate of 44.19%, and the top 5 is Film Noir, Adventure, Family, Musical and Animation. Adult is last at 0% of adult movies being successful, and the bottom 5 is Adult, History, War, Biography and Sport.

However, some of these results could be due to a smaller sample size. Film-Noir only has 43 movies, so this could skew the percentage of success and explain why it has the highest success rate. Similarly, "Musical" genre only has 156, potentially explaining why it is in the top 5. Disregarding these two genres, the top 5 most successful genres would be Adventure, Family, Animation, Sci-Fi and Horror (in order), which have sufficiently large sample sizes.

Comparing the first and second bar chart, we can also see that the frequency of a genre does not directly correlate to a movie's success, as the top 5 genres in the first bar graph are not in the top 5 genres in the second bar graph.

Next, we find the distribution of revenue for each genre.

In [87]:
plt.figure(figsize=(20,7)) 
sns.pointplot( y=np.log10(genre_mean.values.astype('float')), x=genre_mean.index.values)
plt.title("Adventure, Animation, Sci-Fi, Action, Fantasy have the highest average revenue.")
plt.xlabel("Genres")
plt.ylabel("Revenue (by powers of 10)")
plt.show()

median_order = genres.groupby(by=["genres"])["revenue"].median().sort_values(ascending=False).index.tolist()
median_order.remove("Adult")
plt.figure(figsize=(20,7)) 
sns.boxplot( y=np.log10(genres["revenue"].astype('float')), x=genres["genres"], order = median_order)
plt.xlabel("Genres")
plt.ylabel("Revenue (by powers of 10)")
plt.title("Adventure, Animation, Family, Action, Sci-Fi have the highest median revenue.")
plt.show()

From the pointplot, we see that Film-Noir has the lowest mean revenue by far, and the boxplot confirms that Film-Noir has the lowest mean, smallest and lowest IQR, and little outliers. Hence, we can conclude that Film-Noir is the worst genre for revenue.

From the pointplot, we can see that the mean of Crime, Biography, Music, History, Musical, Drama, Romance, Sport, Horror is around the same.

From the pointplot, we see that the genres with top 5 mean are Adventure, Animation, Sci-Fi, Action and Fantasy. From the boxplot, genres with top 5 median are Adventure, Animation, Family, Action and Sci-Fi, which is somewhat similar.

The boxplot has many outliers towards the bottom and most of the boxplots are not skewed.

Hence, we can conclude that Adventure is the genre which makes the most revenue, followed by Animation, Family, Action, Sci-Fi. This mostly agrees with our prior results on the percentage of successful movies for each genre.

In conclusion, Adventure is the best genre as it makes the most average and median revenue and has the highest chance of success. The other top 5 genres are Animation, Sci-Fi, Action, Family.

Research Question 2: What other factors affect a movie's revenue?¶

We investigate quantitative and qualitative factors that may affect a movie's revenue.

Firstly, quantitative factors.

In [88]:
reg_residual_subplot("Budget", "Revenue", all_mdb.budget, all_mdb.revenue, (), (), "There is a strong positive correlation between budget and revenue.")
Pearson's Coefficient is 0.7467009218280032 P-value is 0.0

As Pearson's Coefficient is 0.74, which is close to 1 and the P-value is 0, there is a large positive correlation between budget and revenue with a strong certainty in the result. This strong positive correlation can be explained by movies with more budget generally have better scripts, clothing, visuals and actors, which makes movies more engaging for audiences to watch, hence raking in more revenue.

We observe that the variance of the residuals increases with budget, so the correlation between budget and revenue is less accurate as budget increases. This means that beyond a certain budget, the budget yields diminishing returns on revenue.

In [89]:
reg_residual_subplot("Runtime", "Revenue", all_mdb.runtimeMinutes, all_mdb.revenue, (0,300), (0,1.0e9),"There is a weak positive correlation between budget and revenue.")
Pearson's Coefficient is 0.17891001036238946 P-value is 1.292560240450447e-79

As Pearson's Coefficient is 0.17, which is close to 1 and the P-value is 1.29e-79, which is close to 0, there is a weak positive correlation between budget and revenue with a strong certainty in the result.

The linear regression plot shows that movies with higher revenue tend to be around the 70min-170min cluster, which could be because audiences are less inclined to watch movies beyond 200 minutes as the movie would be too long.

This weak correlation can be explained. Audiences are generally willing to watch movies between 70min-170min, which is not too short but not too long, and have no preference for runtimes in this range.

For quantitative factors, budget has a strong positive correlation with revenue but runtime has a weak positive correlation with revenue.

Next, qualitative factors.

In [90]:
barchart_genre_subplot("Percentage frequency of original language by genre in all movies", "Original Languages", "Percentage frequency of language", percent_all_original_lang, "Original Language", True)
In [91]:
barchart_genre_subplot("Percentage frequency of original language by genre in successful movies", "Original Languages", "Percentage frequency of language", percent_successful_original_lang, "Original Language", False)

The qualitative factor we investigate here is original language of the movie.

We can see that in the subplot with all movies, English has the highest frequency, with the genre with highest frequency of English being Reality-TV with 100% and genre with lowest frequency of English being History at 49.67%. Japanese, Hindi, and French also appear with relatively high frequencies in certain genres.

However in the subplot with only successful movies, we see that English dominates even further, and is the main original language of all genres. Genre with highest frequency of English is Western at 92.86%, genre with lowest frequency of English is History at 67.14%. Hindi is the second most frequent language, with Musical having the highest frequency of Hindi at 19.61%. History appears to be a genre with more diverse languages, as it also has other languages like Chinese, Korean, and Hindi. Japanese makes up 9.6% of Animation, and Italian makes up 7.14% of Western.

As a result, we conclude that English is the best original language for a movie, which may be due to its global prominence (English being the "language" of today's globalised world). However, other languages like Japanese, Hindi, Chinese, Korean and Italian perform relatively well too for certain genres.

In [92]:
top_10_original_lang = all_mdb.loc[all_mdb.isin(all_mdb.groupby("original_language")["revenue"].median().sort_values(ascending=False).head(13).index.values).original_language]

# Change country codes to language name
language_codes.columns = ["original_language", "country"]

top_10_original_lang = pd.merge(top_10_original_lang, language_codes, how = "left", on="original_language")

# sort boxplots by descending median, remove the outliers
median_order = top_10_original_lang.groupby("country")["revenue"].median().sort_values(ascending=False).head(13).index.tolist()
median_order.remove("Abkhazian")
median_order.remove("Afrikaans")
median_order.remove("Malay")

plt.figure(figsize=(15,7))
sns.boxplot(x=top_10_original_lang.country, y=np.log10(top_10_original_lang.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Original Language")
plt.title("Chinese, followed by English, is the original language with the highest median.")

plt.show()

English has the second highest median, after Chinese. However, this may be due to the fact that there are much less Chinese movies than English movies, hence Chinese movies have a smaller sample size.

Overall, we can conclude that English is the best or one of the best languages to use for a movie's original language to earn the most revenue.

There can be multiple spoken languages in a movie, but only 1 original language. Next, we investigate spoken languages in a movie.

In [93]:
# sort boxplots by descending median
filtered_spoken_languages = spoken_languages.loc[spoken_languages.isin(spoken_languages.groupby("country")["revenue"].count().loc[spoken_languages.groupby("country")["revenue"].count() >= 100].index).country]
median_order = filtered_spoken_languages.groupby("country")["revenue"].median().sort_values(ascending=False).index.tolist()

plt.figure(figsize=(15,7))
sns.boxplot(x=filtered_spoken_languages.country, y=np.log10(filtered_spoken_languages.revenue).astype(float), order = median_order)
plt.xticks(rotation=90)
plt.xlabel("Spoken Language")
plt.title("Chinese-speaking movies have the highest median revenue.")
plt.show()

This graph is arranged in order of decreasing median, hence we can conclude that Chinese-speaking movies have the highest median revenue and Turkish-speaking movies have the lowest median revenue. The boxplots have many outliers towards the bottom, especially English, and English has the largest range. Hindi has the smallest IQR and also some outliers towards the top.

We can conclude that the languages to speak in a movie to earn more revenue are Chinese, Latin, Japanese, Korean, Spanish and English. However, as Chinese, Latin and Korean have a small sample size, the results may not be so accurate as they may be skewed by a few movies with high revenue.

Lastly, the last qualitative factor, keywords in a movie.

In [166]:
good_keywords = successful_keywords.keywords.value_counts().loc[successful_keywords.keywords.value_counts() >= 50].index.tolist()
successful_keywords_revenue = successful_movies.loc[:,["revenue","keywords"]].explode("keywords").dropna().groupby("keywords").revenue.mean().sort_values(ascending=False).to_frame().reset_index()
successful_keywords_revenue = successful_keywords_revenue.loc[successful_keywords_revenue .isin(good_keywords).keywords].head(10).reset_index(drop=True)
successful_keywords_revenue
Out[166]:
keywords revenue
0 superhero 6.404721e+08
1 based on comic 6.013988e+08
2 magic 4.993211e+08
3 aftercreditsstinger 4.244748e+08
4 based on young adult novel 4.230694e+08
5 sequel 3.668738e+08
6 duringcreditsstinger 3.455082e+08
7 alien 3.204866e+08
8 spy 2.801209e+08
9 dystopia 2.447500e+08

These are the keywords in successful moves that made the most mean revenue. However, the frequencies of these keywords in successful movies is quite low, so the keywords have a small sample size and may not affect revenue too much.

The use of keywords vary by genre, as different genres have different plot types. Hence, we plot wordclouds of popular keywords by genre in successful movies.

In [164]:
wordcloud_subplot(successful_genres_keywords, "Word Clouds of popular keywords by genre in successful movies")

We should use the corresponding keywords for the genre our movie is in, to increase the chance of success. We should also base our movie on existing literature (eg. books, comics, real life story) to increase the chance of success, as we notice that this keyword appears frequently across all genres. Similarly, being a sequel increases the chance of a movie's success.

Research Question 3: Who should you hire to produce a movie? ¶

In this question, we investigate actors, directors, writers and production companies. For actors, directors, and writers, we determine who brought the most value to a movie, using a statistic called Value Above Replacement (VAR). VAR of a person = Average net profit from person / Average net profit of movies.

In [169]:
fig = plt.figure(figsize=(20,10))

ax0 = fig.add_subplot(1, 2, 1)
ax0 = sns.barplot(x=(actor_ranked.VAR.head(20).astype(float)).round(decimals=2), y=actor_ranked.primaryName.head(20),orient="h")
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_ylabel("")
ax0.set_xlabel("")
ax0.set_title("Actors")
    
ax1 = fig.add_subplot(1, 2, 2)
ax1 = sns.barplot(x=(actress_ranked.VAR.head(20).astype(float)).round(decimals=2), y=actress_ranked.primaryName.head(20),orient="h")
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_ylabel("")
ax0.set_xlabel("")
ax1.set_title("Actresses")

plt.suptitle("VAR of top 20 actors by gender")
fig.supylabel("Name of actor")
fig.supxlabel("VAR")
plt.show()

Emma Watson has the highest VAR over both genders of 5.01%. Robert Downey Jr is the male actor with the highest VAR at 4.44%. You should aim to hire these actors/actresses when producing your movie, as these actors/actresses bring in a lot of value to a movie (VAR > 1).

In [171]:
directors_ranked = VAR("directors", "VAR of top 20 directors")
In [172]:
writers_ranked = VAR("writers", "VAR of top 20 writers")

Similarly, you should aim to hire these writers and directors.

But which one of these - writers, directors and actors - bring the most to a movie's success?

In [195]:
fig = plt.figure(figsize=(15,7))

ax0 = fig.add_subplot(1, 2, 1)
ax0 = sns.boxplot(x = job_ranked.Job, y = job_ranked.VAR)

ax1 = fig.add_subplot(1, 2, 2)
ax1 = sns.violinplot(x = job_ranked.Job, y = job_ranked.VAR)

fig.suptitle("Generally, Directors bring the most to a movie's success, as they have the highest median VAR.")
plt.show()

From the boxplot, we can see that directors have the highest median VAR. Directors also has the smallest IQR and is right-skewed, showing that the general quality and VAR of directors is high and varies less than actors/actresses and writers. As such, we can conclude that Directors bring the most to a movie's success, and one should focus on spending budget on hiring a directors.

Writers bring the least to a movie's success, as they have the smallest median VAR and also the largest IQR. Hence, one should focus least on spending budget on hiring a writer.

All boxplots have outliers towards the top, especially for Actors and Actresses. All boxplots are left-skewed except for directors (directors is right-skewed).

We observe that the violinplots of Actors and Actresses are very similar, except that Actresses have higher outliers towards the top (due to Emma Watson). Hence, we can conclude that gender plays no difference in how much success/revenue they would bring to the movie, and we should focus equally on spending budget on hiring an actor vs an actress.

Lastly, we investigate production companies.

In [196]:
#plt.figure(figsize=(15,15))
fig = plt.figure(figsize=(15,15))
# plotting scatter plot
sc = sns.regplot(x=companies_scatter.revenue,y=companies_scatter.popularity)

# Loop for annotation of all points
for i in range(len(companies_scatter.revenue)):
    plt.annotate(companies_scatter.production_companies[i], (companies_scatter.revenue[i] + 0.4, companies_scatter.popularity[i] + 0.4))

pearson_coef, p_value = stats.pearsonr(companies_scatter.revenue, companies_scatter.popularity)
print("Pearson's Coefficient is " + str(pearson_coef) + " P-value is " + str(p_value))
plt.show()
Pearson's Coefficient is 0.8327430006416182 P-value is 2.415699718774758e-07

We plot the mean average scores of the top 25 companies who released the most movies across Revenue and Popularity to find out which companies are underperforming in terms of revenue and popularity.

From the regression plot, we can see that Walt Disney Pictures is the best production companies in terms of revenue and popularity, scoring exceptionally high in revenue and popularity across their released movies. Companies such as CNC, BBC Films, Fox Searchlight Pictures are the worst production companies (out of the top 25), underperforming in both revenue and popularity.

Companies above the line under-perform in revenue, whereas companies under the line under-perform in popularity.

Ideally, we should seek to emulate the better production companies like Walt Disney Pictures, or hire them to produce our movie.

Research Question 4: When should one release a movie?¶

To answer this, we evaluate revenue by month, day of week and date of month. Firstly, by month.

In [199]:
fig = plt.figure(figsize=(20,5))

ax0 = fig.add_subplot(1, 2, 1)
ax0 = sns.barplot(x = all_month.index.values.flatten(), y = all_month.values.flatten())
for i in ax0.containers:
    ax0.bar_label(i,)
ax0.set_title("(All movies) Number of movies released by month")
ax0.set_xlabel("Month")
ax0.set_ylabel("Number of movies released")

ax1 = fig.add_subplot(1, 2, 2)
ax1 = sns.barplot(x = successful_month.index.values.flatten(), y = successful_month.values.flatten())
for i in ax1.containers:
    ax1.bar_label(i,)
ax1.set_title("(Successful movies) Number of movies released by month")
ax1.set_xlabel("Month")
ax1.set_ylabel("Number of movies released")

plt.show()

1st graph (All movies):

We can see that September and October are the months with most movie releases, and July is the month with the least releases. There is a general downward trend from January to July, then a concave downwards curve from July to November (increasing from July to Sep, decreasing from Sep to Nov), then an increase from November to December.

This may be due to dump months. Dump months are a real phenomenom in the movie industry, shaped by consumer demands, where studios release their lower-quality movies which they don't want to get too much attention for. This is as August and September is the end of the holidays for North America, and students go back to school, and people have lesser time on average to watch movies. Hence, there is an increase in movie releases from August to October.

2nd graph (Successful movies):

Surprisingly, this graph is very different from the 1st graph. December and June are the months with the most movie releases, and April is the month with the least movie releases. There is an upward trend from January to March, dropping at April, then increasing from April to June, then going on a downward trend from June to Nov, then a huge spike at December.

The spike in June can be explained by June is the summer holidays, where many people are free and have time to relax, increasing their chances of watching movies, increasing movie revenue. December is the winter holidays. Also, the cutoff for many award seasons (Academy Awards, Golden Globes etc) is the end of the year, so studios release their biggest hits near the end of the year, hoping that recency bias would get their movie more awards.

Generally, one would want to release movies during June, July or in December to maximise revenue.

Next, we investigate the best weekday to release a movie.

In [206]:
fig = plt.figure(figsize=(15,7))

ax3 = fig.add_subplot(1, 2, 1)
ax3 = sns.barplot(x = all_day.index.values.flatten(), y = (all_day.values.flatten()/all_day.values.sum() * 100).round(decimals=2))
for i in ax3.containers:
    ax3.bar_label(i,)
ax3.set_title("(All movies) Percentage of movies released by weekday")
ax3.set_xlabel("Weekday")
ax3.set_ylabel("Percentage of movies released")

ax4 = fig.add_subplot(1,2,2)
ax4 = sns.barplot(x = successful_day.index.values.flatten(), y = (successful_day.values.flatten()/successful_day.values.sum() * 100).round(decimals=2))
for i in ax4.containers:
    ax4.bar_label(i,)
ax4.set_title("(Successful movies) Percentage of movies released by weekday")
ax4.set_xlabel("Weekday")
ax4.set_ylabel("Percentage of movies released")

fig.suptitle("Friday is clearly the best day to release a movie.")
plt.show()

We can see that the distribution of percentage of movies released by weekday is roughly the same in both successful movies and all movies. The graphs show a bell curve shape, with Friday by far the day with most number of movie releases with 38%, in both successful and all movies, followed by Thursday and Wednesday. The day with least overall and least successful movies released is Sunday, with 4.56% in all movies and 2.82% in successful movies.

The reason so many successful movies are released on Friday is that Friday is the day before the weekend, releasing it on Friday allows people to watch it in the weekend, and the weekend is a time when people have no work and are more free, increasing the chances of them watching a movie.

Hence, the ideal week of day to release a movie is Friday, and the least ideal day is Sunday.

In [209]:
fig = plt.figure(figsize=(20,5))

ax3 = fig.add_subplot(1, 2, 1)
ax3 = sns.barplot(x = all_date.index.values.flatten(), y = (all_date.values.flatten()/all_date.values.sum() * 100).round(decimals=2))
for i in ax3.containers:
    ax3.bar_label(i,)
ax3.set_title("(All movies) Percentage of movies released by date of month")
ax3.set_xlabel("Date of month")
ax3.set_ylabel("Percentage of movies released")

ax4 = fig.add_subplot(1,2,2)
ax4 = sns.barplot(x = successful_date.index.values.flatten(), y = (successful_date.values.flatten()/successful_date.values.sum() * 100).round(decimals=2))
for i in ax4.containers:
    ax4.bar_label(i,)
ax4.set_title("(Successful movies) Percentage of movies released by date of month")
ax4.set_xlabel("Date of month")
ax4.set_ylabel("Percentage of movies released")

plt.suptitle("Release your movie on the 1st or 25th of the month.")

plt.show()

We notice that the distribution of movie releases by date of month is different across all and successful movies.

For all movies, the 1st of month is by far the highest at 7.63% of all movies, and 31st of month is lowest at 1.72% (perhaps due to the fact that half of the months do not have the 31st), with the rest of the dates having a roughly equal distribution. The second lowest is 30th at 2.61%.

For successful movies, the 25th of month is the highest at 4.58%, with the 1st of month being second place at 4.51%. Again, the lowest is the 31st of month at 1.53%. The rest of the dates have a slight bell curve shape on the graph, but this could be due to the scale being 2x smaller than in all movies, so the rest of the dates also have roughly equal distribution.

Hence, you should release your movie on the 1st or 25th of the month.

Research Question 5: Does popularity correlate with revenue?¶

In this question, we investigate popularity, which is TMDB's popularity metric.

TMDB's popularity is a complex metric, consisting of (according to its website), number of votes for the day, number of views for the day, number of users who marked it as a "favourite" for the day, number of users who added it to their "watchlist" for the day, release date, number of total votes, previous days score. As we can see, it is a more nuanced metric and hence a better metric of popularity than IMDB rating.

Firstly, we investigate how popularity changes over years, then how popularity changes against revenue, by genre.

In [216]:
popularity("Revenue", "TMDB popularity", "There is a positive correlation between revenue and popularity.")

From these scatterplots of popularity of genres against revenue, we can see that there is a positive trend between movie revenue and popularity across most genres, although this trend seems to be more exponential instead of linear. This trend seems more pronounced in genres like Adventure, Drama, Fantasy, Animation and Sci-Fi.

In [218]:
reg_residual_subplot("Revenue", "Popularity", np.log10(popularity_revenue.revenue.astype(float)), popularity_revenue.popularity.astype(float), (), (0,500),"")
Pearson's Coefficient is 0.17577145298091046 P-value is 1.756177612781108e-191

From this plot, we observe a positive correlation between popularity against revenue. As the pearson's coefficient is 0.175, which is close to 0, the correlation is weak, as the P-value is very small, the certainty in the result is strong. From the residual plot, we observe that the variance of the residuals increases with revenue, so the correlation between popularity and revenue is less accurate as revenue increases.

Recommendations or Further Works¶

State any recommendations, improvements or further works.

In conclusion, the success of a movie is complex and there are many factors that affect movie revenue. These trends show a correlation and not necessarily causation, as there are many factors in play. However, it is still important to take note of these correlations when producing and releasing a movie.

Further investigation can be done on other factors like the diversity of the movie cast and how many prior awards the cast has won. Instead of having revenue as simply total global gross, it could be further broken down into revenue by country, which would improve analysis of movie factors and make analysis more specific to each country's culture. It could also be broken down into revenue by platform (eg. box office, streaming sites). Sentiment analysis could be done on reviews to better gauge how viewers felt about the movie, and the reasons why.

References¶

Cite any references made, and links where you obtained the data. You may wish to read about how to use markdown in Jupyter notebook to make your report easier to read. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet
  • https://www.oscars.org/sites/oscars/files/93aa_rules.pdf
  • https://developers.themoviedb.org/3/getting-started/popularity
  • https://tvtropes.org/pmwiki/pmwiki.php/Main/DumpMonths
In [ ]: