Kaggle has a number of datasets that one can download and play with. I was pleasantly surprised to find one that contains detailed script data for all episodes of The Simpsons up to the beginning of season 28. There are three files with information about the characters, the episodes, and all lines of dialog. In particular, the dialog file is quite detailed. Scenes are defined, and the speaker and timestamp are given for each spoken line. This is much more detailed than other datasets I have looked for in the past for other shows, some of which have all spoken words, but no speaker attached, making analysis much more difficult.

There is a lot that can be done with this dataset, but in this post, I'd like to look at the speech of the various characters and determine whether there are significant differences in their vocabularies or sophistication of their speech.

You can download the dataset here.

In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-whitegrid')

characters = pd.read_csv('simpsons_characters.csv', index_col='id').sort_index()
#episodes = pd.read_csv('simpsons_episodes.csv', index_col='id').sort_index()
lines = pd.read_csv('simpsons_script_lines.csv', index_col='id').sort_index()

Firest, let's see what we can get from the characters file.

In [32]:
characters.head()
Out[32]:
name normalized_name gender
id
1 Marge Simpson marge simpson f
2 Homer Simpson homer simpson m
3 Seymour Skinner seymour skinner m
4 JANEY janey f
5 Todd Flanders todd flanders m

A pretty simple file that gives us character names and genders. It turns out that there are an astounding 6722 characters! Of course, most of these are minor characters that are only seen briefly. Also, looking more thoroughly at the file, you will notice that there are many variant characters based on the main characters. For example, here are some interesting versions of Homer:

In [34]:
#characters[characters['normalized_name'].str.contains('homer')][:30]
characters.loc[[125, 625, 1011, 1085, 2408], ['name']]
Out[34]:
name
id
125 Homer's Canyon Echo
625 Homer-Ape
1011 Homer's Bloody Skull
1085 Evil Homer
2408 Homer's Stomach

For this study, I am only going to look at the top 30-40 characters in terms of number of spoken lines or words, and I will only consider words spoken by the main character and not any variant characters.

Dialog

The simpsons_script_lines.csv file contains all spoken lines and scenes from every episode. Here are the first few lines (not all columns are shown).

In [35]:
lines.drop(['number', 'timestamp_in_ms', 'raw_text', 'normalized_text'], axis=1).head()
Out[35]:
episode_id speaking_line character_id location_id raw_character_text raw_location_text spoken_words word_count
id
1 1 False NaN 1.0 NaN Street NaN NaN
2 1 False NaN 2.0 NaN Car NaN NaN
3 1 True 1.0 2.0 Marge Simpson Car Ooo, careful, Homer. 3.0
4 1 True 2.0 2.0 Homer Simpson Car There's no time to be careful. 6.0
5 1 True 2.0 2.0 Homer Simpson Car We're late. 2.0

In this file, we get non-speaking lines, which establish scene locations, each followed by spoken lines of the characters in the scene. Timestamps are provided, along with character name, all spoken words, and word count. A normalized text column is also provided, in which all punctuation has been stripped from the spoken words, and all words are lowercase. Let's determine the most prolific speakers.

In [36]:
lines.drop(['number', 'timestamp_in_ms', 'raw_text'], axis=1).groupby('raw_character_text').count().sort_values(by='normalized_text', ascending=False)[:10]
Out[36]:
episode_id speaking_line character_id location_id raw_location_text spoken_words normalized_text word_count
raw_character_text
Homer Simpson 29850 29850 29850 29780 29780 27920 27917 27920
Marge Simpson 14165 14165 14165 14141 14141 13199 13197 13199
Bart Simpson 13780 13780 13780 13766 13766 13016 13014 13016
Lisa Simpson 11505 11505 11505 11488 11488 10772 10770 10772
C. Montgomery Burns 3172 3172 3172 3172 3172 3087 3086 3087
Moe Szyslak 2864 2864 2864 2862 2862 2810 2810 2810
Seymour Skinner 2443 2443 2443 2438 2438 2390 2390 2390
Ned Flanders 2146 2146 2146 2142 2142 2058 2057 2058
Grampa Simpson 1886 1886 1886 1881 1881 1807 1807 1807
Chief Wiggum 1836 1836 1836 1836 1836 1796 1796 1796

There is an interesting discrepancy depending on which column we use for counting. For Homer, for example, you get anywhere from 27917 lines up to 29850. The discrepancy must result from some NA values in certain columns.

In [37]:
lines.loc[(lines.raw_character_text == 'Homer Simpson') & (lines.word_count.isnull()), ['raw_text', 'raw_character_text', 'spoken_words']][:5]
Out[37]:
raw_text raw_character_text spoken_words
id
123 Homer Simpson: (SHRIEKS) Homer Simpson NaN
187 Homer Simpson: (SHUDDERS) Homer Simpson NaN
324 Homer Simpson: (GROANS LIKE HE'S HAVING A HEAR... Homer Simpson NaN
328 Homer Simpson: (GROAN) Homer Simpson NaN
612 Homer Simpson: (GRUNTS) Homer Simpson NaN

Now we see where the difference comes from. Homer has a lot of lines that are just grunts and groans with no spoken words! Let's go ahead and count those anyway.

In [38]:
fig, ax = plt.subplots(figsize=(12, 8))
lines.groupby('raw_character_text')['raw_text'].count().sort_values(ascending=False)[:40].plot(kind='bar')
ax.set_title('Number of lines for top 40 characters')
ax.set_xlabel('');

With nearly 30,000 lines, Homer is by far the most prolific speaker. He is followed by his three immediate family members, all of whom have many times as many lines as any other character. If we look at number of words (no grunts allowed), we see a similar picture.

In [39]:
fig, ax = plt.subplots(figsize=(12, 8))
lines.groupby('raw_character_text')['word_count'].sum().sort_values(ascending=False)[:40].plot(kind='bar')
ax.set_title('Number of words for top 40 characters')
ax.set_xlabel('');

Either way you look at it, Homer absolutely dominates the dialog in The Simpsons. All four of the main family characters are well above Mr. Burns, who comes in at fifth place. In fact, the main family speaks 604,467 words, which accounts for more than 46% of the total 1.3M words!

Sophistication of Speech

Any long time viewer of The Simpsons knows how central Homer is to the show. The more interesting question to me is whether there are big differences in the sophistication of speech or vocabulary size between the characters. Many measures have been devised to estimate the difficulty of reading text. Two of the most well known are the Flesch Reading Ease and Flesch-Kincaid Grade Level. Although these were designed for written text, I am not worried about the accuracy of these measures for the spoken words in this case. Rather, I want to see how characters compare to each other.

These measures require counts of words, sentences, and syllables. Word counts are provided, and sentence counts are easy to obtain. For syllable counts, I'm using a well-known method that counts the vowel sounds from the Carnegie Mellon Pronouncing Dictionary. Words not found in the dictionary are given 1 syllable, which will sometimes be incorrect, but I don't expect it to affect the results significantly.

In [41]:
from nltk.corpus import cmudict
prondict = cmudict.dict()

def numsyllables(word):
    try:
        pron = prondict[word][0]
        return len([s for s in pron if any([char.isdigit() for char in s])])
    except KeyError:
        return 1
    
def total_sylls(x):
    return sum([numsyllables(word) for word in x['normalized_text'].split(' ')])

lines['syllable count'] = lines.dropna(subset=['normalized_text']).apply(lambda x: total_sylls(x), axis=1)
lines['sentence count'] = lines.dropna(subset=['normalized_text'])['spoken_words'].str.count('\.')

counts = lines.groupby('raw_character_text')[['word_count', 'sentence count', 'syllable count']].sum()
counts = counts.sort_values(by='word_count', ascending=False)

counts['Flesch readability'] = 206.835 - 1.015*counts['word_count']/counts['sentence count'] - 84.6*counts['syllable count']/counts['word_count']
counts['Flesch-Kincaid grade'] = 0.39*counts['word_count']/counts['sentence count'] + 11.8*counts['syllable count']/counts['word_count'] - 15.59

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
counts['Flesch readability'][:30].sort_values().plot(kind='bar', ax=ax[0])
ax[0].set_title('Flesch readability')
ax[0].set_xlabel('')
counts['Flesch-Kincaid grade'][:30].sort_values().plot(kind='bar', ax=ax[1])
ax[1].set_title('Flesch-Kincaid grade level')
ax[1].set_xlabel('');

Here, I have plotted the Flesch readability and Flesch-Kincaid grade level for the 30 characters with the most total words spoken. The two measures have an inverse relationship. A higher readability should correspond to a lower grade level. All characters would be rated as highly readable. By either measure, three characters stand apart from the rest - Kent Brockman, the Announcer, and Mayor Joe Quimby. We see a bit more variability in the grade level. While most characters fall somewhere in the 1st-3rd grade levels, Kent Brockman reaches almost to 5th grade level. A few other characters are above 3rd grade level, including the Announcer, Mayor Joe Quimby, Gary Chalmers, Dr. Hibbert, Principal Skinner, Prof. Frink, and Comic Book Guy. Within the Simpsons family, Homer and Bart come in at the lower end of the range, while Marge and Lisa fare a bit better. Nelson Muntz comes in last among the top 30 most heard characters. No surprise there.

Vocabulary

Now, we will look at the vocabularies of the different characters. A bit of processing is required. First, the lines data will be grouped by character, and for each, all of their spoken lines will be joined to create one long string. The string will be tokenized to form a list of individual words. English stop words are removed to simplify calculations later. Total words spoken, number of unique words, and vocabulary size as a percentage of total words are determined.

The following table shows the top 10 characters for total words.

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

all_speech = lines.dropna(subset=['normalized_text']).groupby('raw_character_text')['spoken_words'].agg(' '.join)

stop_words = set(stopwords.words('english'))

def proc(x):
    tokenizer = RegexpTokenizer(r'\w+')
    #print(x)
    #print(sent_tokenize(x))
    #tokens = word_tokenize(sent_tokenize(x))
    return [word.lower() for word in tokenizer.tokenize(x) if word not in stop_words]

all_speech = pd.DataFrame(all_speech.apply(lambda x: proc(x)))

all_speech['total words'] = all_speech['spoken_words'].transform(len)
all_speech['vocab size'] = all_speech['spoken_words'].transform(lambda x: len(set(x)))
all_speech['vocab:total ratio'] = all_speech['vocab size'] / all_speech['total words']

all_speech.sort_values(by='total words', ascending=False, inplace=True)

all_speech.head(10)
Out[2]:
spoken_words total words vocab size vocab:total ratio
raw_character_text
Homer Simpson [there, time, careful, we, late, hey, norman, ... 175490 15435 0.087954
Marge Simpson [ooo, careful, homer, sorry, excuse, us, pardo... 76699 9602 0.125191
Bart Simpson [jingle, bells, batman, smells, robin, laid, a... 71209 9515 0.133621
Lisa Simpson [but, i, really, want, pony, i, really, really... 63080 10165 0.161145
C. Montgomery Burns [hello, i, proud, announce, able, increase, sa... 23117 6155 0.266254
Moe Szyslak [what, matter, homer, somebody, leave, lumpa, ... 21254 4716 0.221888
Seymour Skinner [wasn, wonderful, and, santas, many, lands, pr... 17804 4932 0.277016
Ned Flanders [just, hold, horses, son, hey, hey, simpson, d... 14964 4062 0.271451
Krusty the Clown [kill, well, right, go, ahead, there, someone,... 13588 3899 0.286944
Chief Wiggum [well, secret, city, siege, graffiti, vandal, ... 12693 3483 0.274403

The list looks a bit different if we sort by vocab size.

In [31]:
all_speech.sort_values(by='vocab size', ascending=False).head(10)
Out[31]:
spoken_words total words vocab size vocab:total ratio
raw_character_text
Homer Simpson [there, time, careful, we, late, hey, norman, ... 175490 15435 0.087954
Lisa Simpson [but, i, really, want, pony, i, really, really... 63080 10165 0.161145
Marge Simpson [ooo, careful, homer, sorry, excuse, us, pardo... 76699 9602 0.125191
Bart Simpson [jingle, bells, batman, smells, robin, laid, a... 71209 9515 0.133621
C. Montgomery Burns [hello, i, proud, announce, able, increase, sa... 23117 6155 0.266254
Seymour Skinner [wasn, wonderful, and, santas, many, lands, pr... 17804 4932 0.277016
Moe Szyslak [what, matter, homer, somebody, leave, lumpa, ... 21254 4716 0.221888
Kent Brockman [good, evening, springfield, krusty, clown, be... 10232 4117 0.402365
Ned Flanders [just, hold, horses, son, hey, hey, simpson, d... 14964 4062 0.271451
Krusty the Clown [kill, well, right, go, ahead, there, someone,... 13588 3899 0.286944

Homer has, by far, the greatest number of unique words, but mostly because he dominates the dialog of The Simpsons.

This plot shows vocab size vs. total words.

In [45]:
fig, ax = plt.subplots()
ax.scatter(all_speech['total words'], all_speech['vocab size'])
ax.set_xlabel('Total Words')
ax.set_ylabel('Vocab Size');

There is a definite correlation between total words spoken and unique vocabulary. Homer, of course, is the point in the upper right corner of the plot.

The following plot shows the vocab:total ratio vs. total words for characters with more than 500 total words spoken.

In [46]:
fig, ax = plt.subplots()
ax.scatter(all_speech.loc[all_speech['total words']>500, 'total words'],
            all_speech.loc[all_speech['total words']>500, 'vocab:total ratio'])
ax.set_xlabel('Total Words')
ax.set_ylabel('Vocab:Total Ratio');

The unique vocab to total words ratio would be expected to decline with more words because additional words are more likely to have been used before.

Homer has the greatest vocabulary, mostly because he speaks so much more than any other character. Given his Flesch-Kincaid score, however, it wouldn't seem right to crown him the vocab king. Perhaps there is a better way to measure vocabulary as it relates to sophistication of speech by taking into account the number of words spoken. Instead of summarizing the size of vocabulary for each character, let's plot a curve showing how many unique words have been used vs. the total number of words spoken. I remove English stop words and limit the calculation at 20,000 total words because most characters don't surpass that number and because the set calculations for much larger lists were taking a long time.

In [26]:
all_speech['new vocab curve'] = all_speech.ix[:30, 'spoken_words'].apply(lambda x: [len(set(x[:i])) for i in range(min(len(x), 20000))])

colors = plt.cm.nipy_spectral(np.linspace(0, 1, 10))

fig, ax = plt.subplots(figsize=(12, 12))
ax.set_xlabel('Total Words')
ax.set_ylabel('Unique Words')
for i in range(30):
    plt.plot(all_speech.ix[i, 'new vocab curve'], color=colors[i%10], label=str(all_speech.index[i]));
legend = ax.legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

In the early range up to ~5000 words, there appears to be 3 groups. In the lowest group, we find some of the most important characters, including Homer, Marge, Bart, and Moe. I am not surprised that Homer has the shallowest vocab curve, but I am a bit surprised that Marge runs neck-and-neck with Homer most of the way. I guess I expected her to be significantly ahead of him. Another surprise to me was that Moe begins to overtake all main Simpsons family members except Lisa. Bart is also well ahead of Homer and Marge, but not Lisa, of course. Lisa falls in the middle group early on, but ends with a respectable vocabulary. Within the middle group, Mr. Burns and Principal Skinner are at the top, and at the 17,500 word mark, they easily have the greatest vocabularies. The high fliers early on include the Announcer, Kent Brockman, Comic Book Guy, Prof. Frink, and Mayor Joe Quimby. All except Kent Brockman have far fewer total words than other characters, but add new words at the greatest pace. Kent Brockman really stands out, adding new words at a faster pace all the way up to the 10,000 word mark. We saw Kent Brockman, the Announcer, and Mayor Quimby before, in the Flesch measures, so it is no surprise to see them in the highest group here. Prof. Frink doesn't surprise me, but Comic Book Guy sure did!

I think this curve is a better way of measuring sophistication of speech because it lets you compare characters on an equal-word basis. The slope of the line represents the rate of using new words, which I think is a better measure of sophistication of speech.

This graph only covers up to 20,000 words per character. For the main Simpsons family members, this only accounts for the early seasons. Over the course of a show that has been on air for this long, however, characters can develop, and the writing can change. It would be interesting to check if Homer is getting any smarter over time. Let's break down Homer's 175,000 words into 10 groups of 17,500 and plot them.

In [27]:
Homer_words = []
for j in range(10):
    Homer_words.append([len(set(all_speech.loc['Homer Simpson', 'spoken_words'][17500*j:17500*j+i])) for i in range(17500)])

fig, ax = plt.subplots(figsize=(8, 8))
ax.set_xlabel('Total Words')
ax.set_ylabel('Unique Words')
for i in range(len(Homer_words)):
    plt.plot(Homer_words[i], color=colors[i%10], label='Homer group '+str(i+1));
legend = ax.legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

This is actually quite interesting because there is a definite difference between groups, and there is a pretty clear upward trend in the early groups. It would appear that Homer is indeed getting smarter over time! What about some of the other characters?

In [49]:
Marge_words = []
for j in range(5):
    Marge_words.append([len(set(all_speech.loc['Marge Simpson', 'spoken_words'][15000*j:15000*j+i])) for i in range(15000)])

colors = plt.cm.nipy_spectral(np.linspace(0, 1, 5))

fig, ax = plt.subplots(figsize=(8, 6))
ax.set_xlabel('Total Words')
ax.set_ylabel('Unique Words')
for i in range(len(Marge_words)):
    plt.plot(Marge_words[i], color=colors[i%10], label='Marge group '+str(i+1));
legend = ax.legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

Marge shows a similar pattern. Let's check Bart and Lisa to be sure.

In [16]:
Bart_words = []
for j in range(7):
    Bart_words.append([len(set(all_speech.loc['Bart Simpson', 'spoken_words'][10000*j:10000*j+i])) for i in range(10000)])
    
Lisa_words = []
for j in range(6):
    Lisa_words.append([len(set(all_speech.loc['Lisa Simpson', 'spoken_words'][10000*j:10000*j+i])) for i in range(10000)])

colors = plt.cm.nipy_spectral(np.linspace(0, 1, 7))

fig, ax = plt.subplots(2, 1, figsize=(8, 12))
ax[0].set_xlabel('Total Words')
ax[0].set_ylabel('Unique Words')
for i in range(len(Bart_words)):
    ax[0].plot(Bart_words[i], color=colors[i%10], label='Bart group '+str(i+1));
legend = ax[0].legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')
ax[1].set_xlabel('Total Words')
ax[1].set_ylabel('Unique Words')
for i in range(len(Lisa_words)):
    ax[1].plot(Lisa_words[i], color=colors[i%10], label='Lisa group '+str(i+1));
legend = ax[1].legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

Except for a small anomaly where Lisa's second group is below her first, it seems to be the general case that all the Simpsons characters become a bit more sophisticated after the first few seasons.

Final Thoughts

I really enjoyed working with this dataset, as The Simpsons is one of my favorite shows. The dialog data is really rich, with a lot of interesting things to discover. Many of the results confirmed my suspicions, such as the fact that Homer spoke the most, but there were also some surprises. I would have expected Prof. Frink to come out on top for new vocabulary rate, but in fact the clear winner was Kent Brockman. I never would have suspected that Comic Book Guy would be so sophisticated. Given that I didn't watch the show in the first few seasons, I didn't expect to see such a difference in new vocab rates over time. Maybe I should go back and watch some of the early episodes to see if it is really noticeable.

One of the great things about this dataset is that it defines each scene before the lines of dialog of the characters in it. I hope to use that to follow this up with an analysis of character relationships.


Comments

comments powered by Disqus