Kaggle has a number of datasets that one can download and play with. I recently found a dataset that contains detailed script data for all episodes of The Simpsons up to the beginning of season 28. There are three files with information about the characters, the episodes, and all lines of dialog. In particular, the dialog file is quite detailed. Scenes are defined, and the speaker and timestamp are given for each spoken line. In my previous post, I looked at the spoken words of all the characters to determine who had the greatest vocabularies or most sophisticated speech. Here, I will use the scene descriptions to study the relationships between characters.
You can download the dataset here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-whitegrid')
characters = pd.read_csv('simpsons_characters.csv', index_col='id').sort_index()
#episodes = pd.read_csv('simpsons_episodes.csv', index_col='id').sort_index()
lines = pd.read_csv('simpsons_script_lines.csv', index_col='id').sort_index()
Character relationships¶
I will define an interaction between two characters as any scene where the two characters both have a line in that scene. The strength of a relationship will be determined by the number of scene interactions they have.
Over the time covered by the dataset, there are 6,722 characters! With tens of thousands of relationships among them, I will simplify the calculations by only considering the fifty characters with the greatest number of lines. This will significantly cut down on the complexity of the graphs below, but will still include all of the most well-known characters. Here are the top 50:
import math
import itertools
from collections import defaultdict
#use only 50 characters to prevent overly complicated graph
short_list = lines.groupby('character_id').count().sort_values(by='normalized_text', ascending=False)
short_list_ids = [int(id) for id in short_list.index[:50]]
short_list_names = characters.loc[short_list_ids, 'name'].reset_index()
l = list(short_list_names.name)
list1 = l[:25]
list2 = l[25:]
for l1, l2 in zip(list1, list2):
print('{:<30}{}'.format(l1, l2))
To determine the relationships, I will go through the lines file, one line at a time. Each time a new scene is detected (a non-speaking line), each possible pair from the set of characters from the previous scene will add one to their interaction count.
Below are two histograms that show the number of character pair relationships that are defined by a given number of interactions. On the left, you have the full histogram, showing that some relationships consist of thousands of scene interactions, but there are few of those. The vast majority consist of no more than a few dozen, which can be seen in the zoomed-in histogram on the right.
graph_data = defaultdict(lambda: 0)
speakers = []
def process_line(x=None):
global speakers
if x is None:
speakers = set(speakers)
for i, j in itertools.combinations(speakers, 2):
if i < j:
graph_data[(i, j)] += 1
else:
graph_data[(j, i)] += 1
return
else:
if x['speaking_line'] == False:
speakers = set(sorted(speakers))
for i, j in itertools.combinations(speakers, 2):
if i < j:
graph_data[(i, j)] += 1
else:
graph_data[(j, i)] += 1
speakers = []
return
else:
if not math.isnan(x.character_id) and int(x.character_id) in short_list_ids:
speakers.append(int(x.character_id))
lines.apply(lambda x: process_line(x), axis=1)
process_line()
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(list(graph_data.values()), bins=100)
ax[0].set_xlabel('Number of scenes together')
ax[0].set_ylabel('Number of character pairs')
ax[1].hist(list(graph_data.values()), bins=100, range=(0, 100))
ax[1].set_xlabel('Number of scenes together')
ax[1].set_ylabel('Number of character pairs');
Let's look at a few of the top relationships along with the number of scene interactions.
graph_data_names = {(short_list_names.loc[short_list_names.id == k[0], 'name'].max(),
short_list_names.loc[short_list_names.id == k[1], 'name'].max()):v for (k, v) in graph_data.items()}
relationships = sorted(graph_data_names.items(), key=lambda x: x[1], reverse=True)
for r in relationships[:25]:
print('{:<30s}{:<30s}{}'.format(r[0][0], r[0][1], r[1]))
As we saw in the previous post, the Simpsons family members dominate this list, so let's filter out the Simpsons and see what other relationships stand out.
non_simpsons_relationships = [r for r in relationships if 'Simpson' not in r[0][0] and 'Simpson' not in r[0][1]]
for r in non_simpsons_relationships[:25]:
print('{:<30s}{:<30s}{}'.format(r[0][0], r[0][1], r[1]))
Of course, Lenny and Carl are always seen together, and Smithers is always at Mr. Burns' side ready to serve. Moe knows his main customers, Lenny, Carl, Barney (and Homer), quite well.
The problem with this analysis is that some relationships seem really important just because some characters appear much more than others. Instead, let's normalize the strength of relationships by dividing the number of interactions by the number of scenes in which the characters have appeared individually. This will give us a number related to the percentage of the time characters appear together.
#enumerate scenes by negating speaking_line and taking cumulative sum
lines['location_number'] = lines['speaking_line'].transform(lambda x: not x).cumsum()
#group by character and count number of unique scenes
scene_counts = lines.groupby('character_id')[['location_number']].nunique()
import math
#create a dictionary to lookup scenes appeared in for each character
character_scenes = {}
def add(x):
character_scenes[x.character_id] = x.location_number
scene_counts.reset_index().apply(lambda x: add(x), axis=1);
#normalize relationship strength by dividing by number of scenes
#actually, divide by the geometric mean of the two characters' number of scenes
graph_data_normalized = {k:v/math.sqrt((character_scenes[k[0]]*character_scenes[k[1]])) for k, v in graph_data.items()}
relationships_normalized = sorted(graph_data_normalized.items(), key=lambda x: x[1], reverse=True)
relationships_normalized_names = [((short_list_names.loc[short_list_names.id == k[0][0], 'name'].max(),
short_list_names.loc[short_list_names.id == k[0][1], 'name'].max()), k[1]) for k in relationships_normalized]
for r in relationships_normalized_names[:35]:
print('{:<30s}{:<30s}{}'.format(r[0][0], r[0][1], r[1]))
This may be a better way of looking at the strength of relationships. Before, we knew that Lenny and Carl appeared together many times, but here, we see that they really are always together. Patty and Selma move above Smithers and Mr. Burns, showing that they appear together a greater percentage of the time. I wouldn't say there are any big surprises in the list, though. I mean, when do you not see Lenny and Carl, Patty and Selma, Smithers and Mr. Burns, or Jimbo, Kearney, and Dolph together? The only minor surprise to me was that Homer and Marge came in at number 5 on the list. Although they obviously appear together quite often, Homer interacts with so many other characters in other locations such as the nuclear power plant and Moe's Tavern.
We can visualize the character relationships using graphs in a couple of ways. First, we'll use Python's networkx package and display the most important characters and relationships in a circular layout. Then, we'll see the data using d3.js's force layout. For both, weak relationships are filtered out to prevent a jumble of lines.
The following graph shows the unnormalized relationship data. Edge widths are proportional to relationship strength (total number of scenes two characters have appeared in together). Here, the thickest lines represent the most important relationships to the show overall.
#remove keys with value 1 to lower complexity of graph
graph_data_small = {k:v for k, v in graph_data.items() if v > 100}
group_1 = ['Homer Simpson', 'Marge Simpson', 'Bart Simpson', 'Lisa Simpson']
group_2 = ['Seymour Skinner', 'Milhouse Van Houten', 'Nelson Muntz', 'Edna Krabappel-Flanders', 'Martin Prince', 'Otto Mann',
'Ralph Wiggum', 'Jimbo Jones', 'Kearney Zzyzwicz', 'DOLPH', 'Groundskeeper Willie', 'Gary Chalmers']
group_3 = ['Waylon Smithers', 'Lenny Leonard', 'Carl Carlson', 'C. Montgomery Burns']
# create nodes list
#nodes = pd.read_csv('simpsons_characters.csv')
short_nodes = characters[characters.index.isin(short_list_ids)]
nodes_list = []
def add_node(x):
global nodes_list
if int(x['id']) in short_list.index:
if x['name'] in group_1:
group = 1
elif x['name'] in group_2:
group = 2
elif x['name'] in group_3:
group = 3
else:
group = 4
nodes_list.append({'group':group, 'name':x['name']})
short_nodes = short_nodes.reset_index()
short_nodes.apply(lambda x: add_node(x), axis=1);
links_list = []
for (i, j), strength in graph_data_small.items():
links_list.append({'source':int(short_nodes.loc[short_nodes.id==i].index[0]),
'target':int(short_nodes.loc[short_nodes.id==j].index[0]),
'value':strength})
import json
#data for force layout
json_prep = {"nodes":nodes_list, "links":links_list}
json_dump = json.dumps(json_prep, indent=1, sort_keys=True)
filename_out = 'simpsons_graph.json'
json_out = open(filename_out,'w')
json_out.write(json_dump)
json_out.close()
#data for normalized relationships force layout
links_list = []
for (i, j), strength in graph_data_normalized.items():
if strength > 0.1:
links_list.append({'source':int(short_nodes.loc[short_nodes.id==i].index[0]),
'target':int(short_nodes.loc[short_nodes.id==j].index[0]),
'value':100*strength})
json_prep = {"nodes":nodes_list, "links":links_list}
json_dump = json.dumps(json_prep, indent=1, sort_keys=True)
filename_out = 'simpsons_graph_norm.json'
json_out = open(filename_out,'w')
json_out.write(json_dump)
json_out.close()
import networkx as nx
g = nx.Graph()
g.add_weighted_edges_from([(short_nodes.loc[short_nodes.id == k[0], 'name'].max(), short_nodes.loc[short_nodes.id == k[1], 'name'].max(), v/200) for k, v in graph_data.items() if v >= 150], weight='weight')
pos = nx.circular_layout(g)
fig, ax = plt.subplots(figsize=(16, 16))
weights = [g[u][v]['weight'] for u,v in g.edges()]
nx.draw_circular(g, with_labels=False, node_size=1500, width=weights, ax=ax)
for k, v in pos.items():
theta = math.atan2(v[1], v[0])
r = math.sqrt(v[0]**2 + v[1]**2)
r_new = 1.10*r
x = r_new * math.cos(theta)
y = r_new * math.sin(theta)
plt.text(x, y, s=k, size=12, horizontalalignment='center')
Now, let's look at a graph using the normalized relationship data. Here, edge thickness is proportional to the tendency of two characters to appear together. In other words, these are the relationships that would be important to the characters themselves.
g_norm = nx.Graph()
g_norm.add_weighted_edges_from([(short_nodes.loc[short_nodes.id == k[0], 'name'].max(), short_nodes.loc[short_nodes.id == k[1], 'name'].max(), 10*v) for k, v in graph_data_normalized.items() if v >= 0.1], weight='weight')
pos = nx.circular_layout(g_norm)
fig, ax = plt.subplots(figsize=(16, 16))
weights = [g_norm[u][v]['weight'] for u,v in g_norm.edges()]
nx.draw_circular(g_norm, with_labels=False, node_size=1500, width=weights, ax=ax)
for k, v in pos.items():
theta = math.atan2(v[1], v[0])
r = math.sqrt(v[0]**2 + v[1]**2)
r_new = 1.15*r
x = r_new * math.cos(theta)
y = r_new * math.sin(theta)
plt.text(x, y, s=k, size=12, horizontalalignment='center')
Next, I will use a force layout from d3.js. I modified the code from this example. These are cool because they are interactive. Mouseover the nodes to see the characters they represent. Also, you can click and drag nodes around if it makes it easier to see relationships. Nodes are given a charge, causing them to repel each other, except when the edges hold them together. You can define the desired distance for the edges, and in this case, edge lengths are inversely proportional to relationship strength. If nodes snap back in place after being moved, it indicates a strong force holding them in one place. Due to the filtering, some nodes appear unconnected.
I defined groups consisting of the Simpsons family, people from Springfield Elementary, people from the power plant, and everyone else with different colors. We will see if those groups tend to congregate in the graph.
This time, let's use weights from the normalized relationships.
In both graphs, Simpsons family members (darker blue) end up in the center, as would be expected. School characters (light blue) are closer to Bart and Lisa in the top graph, but less so in the bottom one, although each time you load the page, the nodes end up in a different position.
Final Thoughts¶
This was my second post analyzing The Simpsons dialog data. In the first, I looked at vocabulary and sophistication of speech. There were some interesting surprises there. In this one, I would say that most of the conclusions were intuitive for any regular viewer of the show. Personally, however, it was fun to work on this because I learned more about visualizing graphs.
I really enjoyed working with this dataset, as The Simpsons is one of my favorite shows. Probably, I will work on some other projects for a while, but maybe I will revisit this data set in the future. There are many other questions one could ask, such as:
1) How do character relationships change over the course of the 28 seasons?
2) How many locations does a character appear at, and which are his favorites?
3) How many new characters are introduced per episode/per season?
4) How does viewership and IMDB rating change over time?
Comments
comments powered by Disqus