Kaggle has a number of datasets that one can download and play with. I recently found a dataset that contains detailed script data for all episodes of The Simpsons up to the beginning of season 28. There are three files with information about the characters, the episodes, and all lines of dialog. In particular, the dialog file is quite detailed. Scenes are defined, and the speaker and timestamp are given for each spoken line. In my previous post, I looked at the spoken words of all the characters to determine who had the greatest vocabularies or most sophisticated speech. Here, I will use the scene descriptions to study the relationships between characters.

You can download the dataset here.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-whitegrid')

characters = pd.read_csv('simpsons_characters.csv', index_col='id').sort_index()
#episodes = pd.read_csv('simpsons_episodes.csv', index_col='id').sort_index()
lines = pd.read_csv('simpsons_script_lines.csv', index_col='id').sort_index()

Character relationships

I will define an interaction between two characters as any scene where the two characters both have a line in that scene. The strength of a relationship will be determined by the number of scene interactions they have.

Over the time covered by the dataset, there are 6,722 characters! With tens of thousands of relationships among them, I will simplify the calculations by only considering the fifty characters with the greatest number of lines. This will significantly cut down on the complexity of the graphs below, but will still include all of the most well-known characters. Here are the top 50:

In [2]:
import math
import itertools
from collections import defaultdict

#use only 50 characters to prevent overly complicated graph
short_list = lines.groupby('character_id').count().sort_values(by='normalized_text', ascending=False)
short_list_ids = [int(id) for id in short_list.index[:50]]

short_list_names = characters.loc[short_list_ids, 'name'].reset_index()

l = list(short_list_names.name)
list1 = l[:25]
list2 = l[25:]

for l1, l2 in zip(list1, list2):
    print('{:<30}{}'.format(l1, l2))
Homer Simpson                 Groundskeeper Willie
Marge Simpson                 Mayor Joe Quimby
Bart Simpson                  Ralph Wiggum
Lisa Simpson                  Patty Bouvier
C. Montgomery Burns           Comic Book Guy
Moe Szyslak                   Otto Mann
Seymour Skinner               Martin Prince
Ned Flanders                  Announcer
Grampa Simpson                Jimbo Jones
Milhouse Van Houten           Lou
Chief Wiggum                  Sideshow Mel
Krusty the Clown              Professor Jonathan Frink
Nelson Muntz                  Fat Tony
Lenny Leonard                 Kearney Zzyzwicz
Apu Nahasapeemapetilon        Agnes Skinner
Waylon Smithers               Kirk Van Houten
Kent Brockman                 Snake Jailbird
Carl Carlson                  Cletus Spuckler
Edna Krabappel-Flanders       Troy McClure
Dr. Julius Hibbert            DOLPH
Selma Bouvier                 Crowd
Barney Gumble                 Todd Flanders
Rev. Timothy Lovejoy          Lionel Hutz
Sideshow Bob                  Rainier Wolfcastle
Gary Chalmers                 Narrator

To determine the relationships, I will go through the lines file, one line at a time. Each time a new scene is detected (a non-speaking line), each possible pair from the set of characters from the previous scene will add one to their interaction count.

Below are two histograms that show the number of character pair relationships that are defined by a given number of interactions. On the left, you have the full histogram, showing that some relationships consist of thousands of scene interactions, but there are few of those. The vast majority consist of no more than a few dozen, which can be seen in the zoomed-in histogram on the right.

In [4]:
graph_data = defaultdict(lambda: 0)
speakers = []

def process_line(x=None):
    global speakers
    if x is None:
        speakers = set(speakers)
        for i, j in itertools.combinations(speakers, 2):
            if i < j:
                graph_data[(i, j)] += 1
            else:
                graph_data[(j, i)] += 1
        return
    else:
        if x['speaking_line'] == False:
            speakers = set(sorted(speakers))
            for i, j in itertools.combinations(speakers, 2):
                if i < j:
                    graph_data[(i, j)] += 1
                else:
                    graph_data[(j, i)] += 1
            speakers = []
            return
        else:
            if not math.isnan(x.character_id) and int(x.character_id) in short_list_ids:
                speakers.append(int(x.character_id))        
        
lines.apply(lambda x: process_line(x), axis=1)
process_line()

fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(list(graph_data.values()), bins=100)
ax[0].set_xlabel('Number of scenes together')
ax[0].set_ylabel('Number of character pairs')
ax[1].hist(list(graph_data.values()), bins=100, range=(0, 100))
ax[1].set_xlabel('Number of scenes together')
ax[1].set_ylabel('Number of character pairs');

Let's look at a few of the top relationships along with the number of scene interactions.

In [5]:
graph_data_names = {(short_list_names.loc[short_list_names.id == k[0], 'name'].max(),
                     short_list_names.loc[short_list_names.id == k[1], 'name'].max()):v for (k, v) in graph_data.items()}
relationships = sorted(graph_data_names.items(), key=lambda x: x[1], reverse=True)

for r in relationships[:25]:
    print('{:<30s}{:<30s}{}'.format(r[0][0], r[0][1], r[1]))
Marge Simpson                 Homer Simpson                 4053
Homer Simpson                 Bart Simpson                  2904
Homer Simpson                 Lisa Simpson                  2660
Bart Simpson                  Lisa Simpson                  2529
Marge Simpson                 Lisa Simpson                  2199
Marge Simpson                 Bart Simpson                  2183
Bart Simpson                  Milhouse Van Houten           642
Homer Simpson                 Moe Szyslak                   641
Homer Simpson                 Ned Flanders                  492
Homer Simpson                 Lenny Leonard                 474
Homer Simpson                 C. Montgomery Burns           454
Lenny Leonard                 Carl Carlson                  404
Homer Simpson                 Carl Carlson                  392
Homer Simpson                 Grampa Simpson                384
Waylon Smithers               C. Montgomery Burns           382
Seymour Skinner               Bart Simpson                  361
Homer Simpson                 Chief Wiggum                  342
Bart Simpson                  Nelson Muntz                  331
Marge Simpson                 Ned Flanders                  263
Bart Simpson                  Grampa Simpson                258
Seymour Skinner               Lisa Simpson                  256
Lisa Simpson                  Milhouse Van Houten           254
Marge Simpson                 Grampa Simpson                250
Marge Simpson                 Moe Szyslak                   240
Lisa Simpson                  Grampa Simpson                232

As we saw in the previous post, the Simpsons family members dominate this list, so let's filter out the Simpsons and see what other relationships stand out.

In [6]:
non_simpsons_relationships = [r for r in relationships if 'Simpson' not in r[0][0] and 'Simpson' not in r[0][1]]
for r in non_simpsons_relationships[:25]:
    print('{:<30s}{:<30s}{}'.format(r[0][0], r[0][1], r[1]))
Lenny Leonard                 Carl Carlson                  404
Waylon Smithers               C. Montgomery Burns           382
Moe Szyslak                   Lenny Leonard                 215
Milhouse Van Houten           Nelson Muntz                  187
Chief Wiggum                  Lou                           184
Moe Szyslak                   Carl Carlson                  167
Patty Bouvier                 Selma Bouvier                 154
Moe Szyslak                   Barney Gumble                 149
Seymour Skinner               Gary Chalmers                 144
Seymour Skinner               Nelson Muntz                  118
Seymour Skinner               Edna Krabappel-Flanders       117
Seymour Skinner               Groundskeeper Willie          103
Seymour Skinner               Milhouse Van Houten           102
C. Montgomery Burns           Lenny Leonard                 92
Jimbo Jones                   Kearney Zzyzwicz              91
Jimbo Jones                   DOLPH                         87
Todd Flanders                 Ned Flanders                  80
Martin Prince                 Nelson Muntz                  80
Ned Flanders                  Moe Szyslak                   80
Ned Flanders                  Rev. Timothy Lovejoy          77
Kearney Zzyzwicz              DOLPH                         74
Seymour Skinner               Agnes Skinner                 71
C. Montgomery Burns           Carl Carlson                  69
Milhouse Van Houten           Ralph Wiggum                  69
Milhouse Van Houten           Martin Prince                 68

Of course, Lenny and Carl are always seen together, and Smithers is always at Mr. Burns' side ready to serve. Moe knows his main customers, Lenny, Carl, Barney (and Homer), quite well.

The problem with this analysis is that some relationships seem really important just because some characters appear much more than others. Instead, let's normalize the strength of relationships by dividing the number of interactions by the number of scenes in which the characters have appeared individually. This will give us a number related to the percentage of the time characters appear together.

In [47]:
#enumerate scenes by negating speaking_line and taking cumulative sum
lines['location_number'] = lines['speaking_line'].transform(lambda x: not x).cumsum()

#group by character and count number of unique scenes
scene_counts = lines.groupby('character_id')[['location_number']].nunique()

import math

#create a dictionary to lookup scenes appeared in for each character
character_scenes = {}
def add(x):
    character_scenes[x.character_id] = x.location_number
scene_counts.reset_index().apply(lambda x: add(x), axis=1);

#normalize relationship strength by dividing by number of scenes
#actually, divide by the geometric mean of the two characters' number of scenes
graph_data_normalized = {k:v/math.sqrt((character_scenes[k[0]]*character_scenes[k[1]])) for k, v in graph_data.items()}

relationships_normalized = sorted(graph_data_normalized.items(), key=lambda x: x[1], reverse=True)

relationships_normalized_names = [((short_list_names.loc[short_list_names.id == k[0][0], 'name'].max(),
                     short_list_names.loc[short_list_names.id == k[0][1], 'name'].max()), k[1]) for k in relationships_normalized]

for r in relationships_normalized_names[:35]:
    print('{:<30s}{:<30s}{}'.format(r[0][0], r[0][1], r[1]))
Lenny Leonard                 Carl Carlson                  0.6351871499745055
Patty Bouvier                 Selma Bouvier                 0.5356529839748795
Waylon Smithers               C. Montgomery Burns           0.4950995179681764
Jimbo Jones                   DOLPH                         0.49265895172303803
Marge Simpson                 Homer Simpson                 0.4750467517302158
Kearney Zzyzwicz              DOLPH                         0.4584438815757194
Jimbo Jones                   Kearney Zzyzwicz              0.43097963387009636
Chief Wiggum                  Lou                           0.4170545370435428
Bart Simpson                  Lisa Simpson                  0.415310136182364
Marge Simpson                 Lisa Simpson                  0.3613365226033065
Homer Simpson                 Bart Simpson                  0.3401679800702293
Homer Simpson                 Lisa Simpson                  0.33827454171457044
Marge Simpson                 Bart Simpson                  0.330407204490586
Seymour Skinner               Gary Chalmers                 0.29961013628753247
Bart Simpson                  Milhouse Van Houten           0.246988084265318
Moe Szyslak                   Lenny Leonard                 0.23368267629860418
Todd Flanders                 Ned Flanders                  0.2325306311157376
Milhouse Van Houten           Nelson Muntz                  0.23103998297006587
Moe Szyslak                   Barney Gumble                 0.22132515495122104
Martin Prince                 Nelson Muntz                  0.20568611284392388
Moe Szyslak                   Carl Carlson                  0.20147889884435466
Seymour Skinner               Edna Krabappel-Flanders       0.18745812008219764
Seymour Skinner               Groundskeeper Willie          0.17797017752190164
Homer Simpson                 Moe Szyslak                   0.176310767385596
Seymour Skinner               Agnes Skinner                 0.1759939951496216
Homer Simpson                 Lenny Leonard                 0.1699050897069899
Bart Simpson                  Nelson Muntz                  0.16079235663200667
Homer Simpson                 Carl Carlson                  0.15596931103910494
Homer Simpson                 Ned Flanders                  0.15349263593397242
Ned Flanders                  Rev. Timothy Lovejoy          0.14465917274517306
Seymour Skinner               Nelson Muntz                  0.14383293513940035
Milhouse Van Houten           Martin Prince                 0.13846099982517154
Seymour Skinner               Bart Simpson                  0.13701842585445598
Homer Simpson                 C. Montgomery Burns           0.12926210002750535
Milhouse Van Houten           Kirk Van Houten               0.1259768027819117

This may be a better way of looking at the strength of relationships. Before, we knew that Lenny and Carl appeared together many times, but here, we see that they really are always together. Patty and Selma move above Smithers and Mr. Burns, showing that they appear together a greater percentage of the time. I wouldn't say there are any big surprises in the list, though. I mean, when do you not see Lenny and Carl, Patty and Selma, Smithers and Mr. Burns, or Jimbo, Kearney, and Dolph together? The only minor surprise to me was that Homer and Marge came in at number 5 on the list. Although they obviously appear together quite often, Homer interacts with so many other characters in other locations such as the nuclear power plant and Moe's Tavern.

We can visualize the character relationships using graphs in a couple of ways. First, we'll use Python's networkx package and display the most important characters and relationships in a circular layout. Then, we'll see the data using d3.js's force layout. For both, weak relationships are filtered out to prevent a jumble of lines.

The following graph shows the unnormalized relationship data. Edge widths are proportional to relationship strength (total number of scenes two characters have appeared in together). Here, the thickest lines represent the most important relationships to the show overall.

In [42]:
#remove keys with value 1 to lower complexity of graph
graph_data_small = {k:v for k, v in graph_data.items() if v > 100}

group_1 = ['Homer Simpson', 'Marge Simpson', 'Bart Simpson', 'Lisa Simpson']
group_2 = ['Seymour Skinner', 'Milhouse Van Houten', 'Nelson Muntz', 'Edna Krabappel-Flanders', 'Martin Prince', 'Otto Mann',
          'Ralph Wiggum', 'Jimbo Jones', 'Kearney Zzyzwicz', 'DOLPH', 'Groundskeeper Willie', 'Gary Chalmers']
group_3 = ['Waylon Smithers', 'Lenny Leonard', 'Carl Carlson', 'C. Montgomery Burns']

# create nodes list
#nodes = pd.read_csv('simpsons_characters.csv')
short_nodes = characters[characters.index.isin(short_list_ids)]
nodes_list = []

def add_node(x):
    global nodes_list
    if int(x['id']) in short_list.index:
        if x['name'] in group_1:
            group = 1
        elif x['name'] in group_2:
            group = 2
        elif x['name'] in group_3:
            group = 3
        else:
            group = 4
        nodes_list.append({'group':group, 'name':x['name']})

short_nodes = short_nodes.reset_index()
short_nodes.apply(lambda x: add_node(x), axis=1);

links_list = []
for (i, j), strength in graph_data_small.items():
    links_list.append({'source':int(short_nodes.loc[short_nodes.id==i].index[0]),
                       'target':int(short_nodes.loc[short_nodes.id==j].index[0]),
                       'value':strength})

import json

#data for force layout
json_prep = {"nodes":nodes_list, "links":links_list}
json_dump = json.dumps(json_prep, indent=1, sort_keys=True)
filename_out = 'simpsons_graph.json'
json_out = open(filename_out,'w')
json_out.write(json_dump)
json_out.close()

#data for normalized relationships force layout
links_list = []
for (i, j), strength in graph_data_normalized.items():
    if strength > 0.1:
        links_list.append({'source':int(short_nodes.loc[short_nodes.id==i].index[0]),
                           'target':int(short_nodes.loc[short_nodes.id==j].index[0]),
                           'value':100*strength})
json_prep = {"nodes":nodes_list, "links":links_list}
json_dump = json.dumps(json_prep, indent=1, sort_keys=True)
filename_out = 'simpsons_graph_norm.json'
json_out = open(filename_out,'w')
json_out.write(json_dump)
json_out.close()

import networkx as nx

g = nx.Graph()

g.add_weighted_edges_from([(short_nodes.loc[short_nodes.id == k[0], 'name'].max(), short_nodes.loc[short_nodes.id == k[1], 'name'].max(), v/200) for k, v in graph_data.items() if v >= 150], weight='weight')

pos = nx.circular_layout(g)
fig, ax = plt.subplots(figsize=(16, 16))
weights = [g[u][v]['weight'] for u,v in g.edges()]
nx.draw_circular(g, with_labels=False, node_size=1500, width=weights, ax=ax)
for k, v in pos.items():
    theta = math.atan2(v[1], v[0])
    r = math.sqrt(v[0]**2 + v[1]**2)
    r_new = 1.10*r
    x = r_new * math.cos(theta)
    y = r_new * math.sin(theta)
    plt.text(x, y, s=k, size=12, horizontalalignment='center')

Now, let's look at a graph using the normalized relationship data. Here, edge thickness is proportional to the tendency of two characters to appear together. In other words, these are the relationships that would be important to the characters themselves.

In [43]:
g_norm = nx.Graph()

g_norm.add_weighted_edges_from([(short_nodes.loc[short_nodes.id == k[0], 'name'].max(), short_nodes.loc[short_nodes.id == k[1], 'name'].max(), 10*v) for k, v in graph_data_normalized.items() if v >= 0.1], weight='weight')

pos = nx.circular_layout(g_norm)
fig, ax = plt.subplots(figsize=(16, 16))
weights = [g_norm[u][v]['weight'] for u,v in g_norm.edges()]
nx.draw_circular(g_norm, with_labels=False, node_size=1500, width=weights, ax=ax)
for k, v in pos.items():
    theta = math.atan2(v[1], v[0])
    r = math.sqrt(v[0]**2 + v[1]**2)
    r_new = 1.15*r
    x = r_new * math.cos(theta)
    y = r_new * math.sin(theta)
    plt.text(x, y, s=k, size=12, horizontalalignment='center')

Next, I will use a force layout from d3.js. I modified the code from this example. These are cool because they are interactive. Mouseover the nodes to see the characters they represent. Also, you can click and drag nodes around if it makes it easier to see relationships. Nodes are given a charge, causing them to repel each other, except when the edges hold them together. You can define the desired distance for the edges, and in this case, edge lengths are inversely proportional to relationship strength. If nodes snap back in place after being moved, it indicates a strong force holding them in one place. Due to the filtering, some nodes appear unconnected.

I defined groups consisting of the Simpsons family, people from Springfield Elementary, people from the power plant, and everyone else with different colors. We will see if those groups tend to congregate in the graph.

This time, let's use weights from the normalized relationships.

In both graphs, Simpsons family members (darker blue) end up in the center, as would be expected. School characters (light blue) are closer to Bart and Lisa in the top graph, but less so in the bottom one, although each time you load the page, the nodes end up in a different position.

Final Thoughts

This was my second post analyzing The Simpsons dialog data. In the first, I looked at vocabulary and sophistication of speech. There were some interesting surprises there. In this one, I would say that most of the conclusions were intuitive for any regular viewer of the show. Personally, however, it was fun to work on this because I learned more about visualizing graphs.

I really enjoyed working with this dataset, as The Simpsons is one of my favorite shows. Probably, I will work on some other projects for a while, but maybe I will revisit this data set in the future. There are many other questions one could ask, such as:

1) How do character relationships change over the course of the 28 seasons?
2) How many locations does a character appear at, and which are his favorites?
3) How many new characters are introduced per episode/per season?
4) How does viewership and IMDB rating change over time?


Comments

comments powered by Disqus