Kaggle has a number of datasets that one can download and play with. I was pleasantly surprised to find one that contains detailed script data for all episodes of The Simpsons up to the beginning of season 28. There are three files with information about the characters, the episodes, and all lines of dialog. In particular, the dialog file is quite detailed. Scenes are defined, and the speaker and timestamp are given for each spoken line. This is much more detailed than other datasets I have looked for in the past for other shows, some of which have all spoken words, but no speaker attached, making analysis much more difficult.

There is a lot that can be done with this dataset, but in this post, I'd like to look at the speech of the various characters and determine whether there are significant differences in their vocabularies or sophistication of their speech.

You can download the dataset here.

In [31]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-whitegrid')

characters = pd.read_csv('simpsons_characters.csv', index_col='id').sort_index()
#episodes = pd.read_csv('simpsons_episodes.csv', index_col='id').sort_index()
lines = pd.read_csv('simpsons_script_lines.csv', index_col='id').sort_index()

Firest, let's see what we can get from the characters file.

In [32]:

characters.head()

Out[32]:

	name	normalized_name	gender
id
1	Marge Simpson	marge simpson	f
2	Homer Simpson	homer simpson	m
3	Seymour Skinner	seymour skinner	m
4	JANEY	janey	f
5	Todd Flanders	todd flanders	m

A pretty simple file that gives us character names and genders. It turns out that there are an astounding 6722 characters! Of course, most of these are minor characters that are only seen briefly. Also, looking more thoroughly at the file, you will notice that there are many variant characters based on the main characters. For example, here are some interesting versions of Homer:

In [34]:

#characters[characters['normalized_name'].str.contains('homer')][:30]
characters.loc[[125, 625, 1011, 1085, 2408], ['name']]

Out[34]:

	name
id
125	Homer's Canyon Echo
625	Homer-Ape
1011	Homer's Bloody Skull
1085	Evil Homer
2408	Homer's Stomach

For this study, I am only going to look at the top 30-40 characters in terms of number of spoken lines or words, and I will only consider words spoken by the main character and not any variant characters.

Dialog¶

The simpsons_script_lines.csv file contains all spoken lines and scenes from every episode. Here are the first few lines (not all columns are shown).

In [35]:

lines.drop(['number', 'timestamp_in_ms', 'raw_text', 'normalized_text'], axis=1).head()

Out[35]:

	episode_id	speaking_line	character_id	location_id	raw_character_text	raw_location_text	spoken_words	word_count
id
1	1	False	NaN	1.0	NaN	Street	NaN	NaN
2	1	False	NaN	2.0	NaN	Car	NaN	NaN
3	1	True	1.0	2.0	Marge Simpson	Car	Ooo, careful, Homer.	3.0
4	1	True	2.0	2.0	Homer Simpson	Car	There's no time to be careful.	6.0
5	1	True	2.0	2.0	Homer Simpson	Car	We're late.	2.0

In this file, we get non-speaking lines, which establish scene locations, each followed by spoken lines of the characters in the scene. Timestamps are provided, along with character name, all spoken words, and word count. A normalized text column is also provided, in which all punctuation has been stripped from the spoken words, and all words are lowercase. Let's determine the most prolific speakers.

In [36]:

lines.drop(['number', 'timestamp_in_ms', 'raw_text'], axis=1).groupby('raw_character_text').count().sort_values(by='normalized_text', ascending=False)[:10]

Out[36]:

	episode_id	speaking_line	character_id	location_id	raw_location_text	spoken_words	normalized_text	word_count
raw_character_text
Homer Simpson	29850	29850	29850	29780	29780	27920	27917	27920
Marge Simpson	14165	14165	14165	14141	14141	13199	13197	13199
Bart Simpson	13780	13780	13780	13766	13766	13016	13014	13016
Lisa Simpson	11505	11505	11505	11488	11488	10772	10770	10772
C. Montgomery Burns	3172	3172	3172	3172	3172	3087	3086	3087
Moe Szyslak	2864	2864	2864	2862	2862	2810	2810	2810
Seymour Skinner	2443	2443	2443	2438	2438	2390	2390	2390
Ned Flanders	2146	2146	2146	2142	2142	2058	2057	2058
Grampa Simpson	1886	1886	1886	1881	1881	1807	1807	1807
Chief Wiggum	1836	1836	1836	1836	1836	1796	1796	1796

There is an interesting discrepancy depending on which column we use for counting. For Homer, for example, you get anywhere from 27917 lines up to 29850. The discrepancy must result from some NA values in certain columns.

In [37]:

lines.loc[(lines.raw_character_text == 'Homer Simpson') & (lines.word_count.isnull()), ['raw_text', 'raw_character_text', 'spoken_words']][:5]

Out[37]:

	raw_text	raw_character_text	spoken_words
id
123	Homer Simpson: (SHRIEKS)	Homer Simpson	NaN
187	Homer Simpson: (SHUDDERS)	Homer Simpson	NaN
324	Homer Simpson: (GROANS LIKE HE'S HAVING A HEAR...	Homer Simpson	NaN
328	Homer Simpson: (GROAN)	Homer Simpson	NaN
612	Homer Simpson: (GRUNTS)	Homer Simpson	NaN

Now we see where the difference comes from. Homer has a lot of lines that are just grunts and groans with no spoken words! Let's go ahead and count those anyway.

In [38]:

fig, ax = plt.subplots(figsize=(12, 8))
lines.groupby('raw_character_text')['raw_text'].count().sort_values(ascending=False)[:40].plot(kind='bar')
ax.set_title('Number of lines for top 40 characters')
ax.set_xlabel('');

With nearly 30,000 lines, Homer is by far the most prolific speaker. He is followed by his three immediate family members, all of whom have many times as many lines as any other character. If we look at number of words (no grunts allowed), we see a similar picture.

In [39]:

fig, ax = plt.subplots(figsize=(12, 8))
lines.groupby('raw_character_text')['word_count'].sum().sort_values(ascending=False)[:40].plot(kind='bar')
ax.set_title('Number of words for top 40 characters')
ax.set_xlabel('');

Either way you look at it, Homer absolutely dominates the dialog in The Simpsons. All four of the main family characters are well above Mr. Burns, who comes in at fifth place. In fact, the main family speaks 604,467 words, which accounts for more than 46% of the total 1.3M words!

Sophistication of Speech¶

Any long time viewer of The Simpsons knows how central Homer is to the show. The more interesting question to me is whether there are big differences in the sophistication of speech or vocabulary size between the characters. Many measures have been devised to estimate the difficulty of reading text. Two of the most well known are the Flesch Reading Ease and Flesch-Kincaid Grade Level. Although these were designed for written text, I am not worried about the accuracy of these measures for the spoken words in this case. Rather, I want to see how characters compare to each other.

These measures require counts of words, sentences, and syllables. Word counts are provided, and sentence counts are easy to obtain. For syllable counts, I'm using a well-known method that counts the vowel sounds from the Carnegie Mellon Pronouncing Dictionary. Words not found in the dictionary are given 1 syllable, which will sometimes be incorrect, but I don't expect it to affect the results significantly.

In [41]:

from nltk.corpus import cmudict
prondict = cmudict.dict()

def numsyllables(word):
    try:
        pron = prondict[word][0]
        return len([s for s in pron if any([char.isdigit() for char in s])])
    except KeyError:
        return 1
    
def total_sylls(x):
    return sum([numsyllables(word) for word in x['normalized_text'].split(' ')])

lines['syllable count'] = lines.dropna(subset=['normalized_text']).apply(lambda x: total_sylls(x), axis=1)
lines['sentence count'] = lines.dropna(subset=['normalized_text'])['spoken_words'].str.count('\.')

counts = lines.groupby('raw_character_text')[['word_count', 'sentence count', 'syllable count']].sum()
counts = counts.sort_values(by='word_count', ascending=False)

counts['Flesch readability'] = 206.835 - 1.015*counts['word_count']/counts['sentence count'] - 84.6*counts['syllable count']/counts['word_count']
counts['Flesch-Kincaid grade'] = 0.39*counts['word_count']/counts['sentence count'] + 11.8*counts['syllable count']/counts['word_count'] - 15.59

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
counts['Flesch readability'][:30].sort_values().plot(kind='bar', ax=ax[0])
ax[0].set_title('Flesch readability')
ax[0].set_xlabel('')
counts['Flesch-Kincaid grade'][:30].sort_values().plot(kind='bar', ax=ax[1])
ax[1].set_title('Flesch-Kincaid grade level')
ax[1].set_xlabel('');

Here, I have plotted the Flesch readability and Flesch-Kincaid grade level for the 30 characters with the most total words spoken. The two measures have an inverse relationship. A higher readability should correspond to a lower grade level. All characters would be rated as highly readable. By either measure, three characters stand apart from the rest - Kent Brockman, the Announcer, and Mayor Joe Quimby. We see a bit more variability in the grade level. While most characters fall somewhere in the 1st-3rd grade levels, Kent Brockman reaches almost to 5th grade level. A few other characters are above 3rd grade level, including the Announcer, Mayor Joe Quimby, Gary Chalmers, Dr. Hibbert, Principal Skinner, Prof. Frink, and Comic Book Guy. Within the Simpsons family, Homer and Bart come in at the lower end of the range, while Marge and Lisa fare a bit better. Nelson Muntz comes in last among the top 30 most heard characters. No surprise there.

Vocabulary¶

Now, we will look at the vocabularies of the different characters. A bit of processing is required. First, the lines data will be grouped by character, and for each, all of their spoken lines will be joined to create one long string. The string will be tokenized to form a list of individual words. English stop words are removed to simplify calculations later. Total words spoken, number of unique words, and vocabulary size as a percentage of total words are determined.

The following table shows the top 10 characters for total words.

In [2]:

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

all_speech = lines.dropna(subset=['normalized_text']).groupby('raw_character_text')['spoken_words'].agg(' '.join)

stop_words = set(stopwords.words('english'))

def proc(x):
    tokenizer = RegexpTokenizer(r'\w+')
    #print(x)
    #print(sent_tokenize(x))
    #tokens = word_tokenize(sent_tokenize(x))
    return [word.lower() for word in tokenizer.tokenize(x) if word not in stop_words]

all_speech = pd.DataFrame(all_speech.apply(lambda x: proc(x)))

all_speech['total words'] = all_speech['spoken_words'].transform(len)
all_speech['vocab size'] = all_speech['spoken_words'].transform(lambda x: len(set(x)))
all_speech['vocab:total ratio'] = all_speech['vocab size'] / all_speech['total words']

all_speech.sort_values(by='total words', ascending=False, inplace=True)

all_speech.head(10)

Out[2]:

	spoken_words	total words	vocab size	vocab:total ratio
raw_character_text
Homer Simpson	[there, time, careful, we, late, hey, norman, ...	175490	15435	0.087954
Marge Simpson	[ooo, careful, homer, sorry, excuse, us, pardo...	76699	9602	0.125191
Bart Simpson	[jingle, bells, batman, smells, robin, laid, a...	71209	9515	0.133621
Lisa Simpson	[but, i, really, want, pony, i, really, really...	63080	10165	0.161145
C. Montgomery Burns	[hello, i, proud, announce, able, increase, sa...	23117	6155	0.266254
Moe Szyslak	[what, matter, homer, somebody, leave, lumpa, ...	21254	4716	0.221888
Seymour Skinner	[wasn, wonderful, and, santas, many, lands, pr...	17804	4932	0.277016
Ned Flanders	[just, hold, horses, son, hey, hey, simpson, d...	14964	4062	0.271451
Krusty the Clown	[kill, well, right, go, ahead, there, someone,...	13588	3899	0.286944
Chief Wiggum	[well, secret, city, siege, graffiti, vandal, ...	12693	3483	0.274403

The list looks a bit different if we sort by vocab size.

In [31]:

all_speech.sort_values(by='vocab size', ascending=False).head(10)

Out[31]:

	spoken_words	total words	vocab size	vocab:total ratio
raw_character_text
Homer Simpson	[there, time, careful, we, late, hey, norman, ...	175490	15435	0.087954
Lisa Simpson	[but, i, really, want, pony, i, really, really...	63080	10165	0.161145
Marge Simpson	[ooo, careful, homer, sorry, excuse, us, pardo...	76699	9602	0.125191
Bart Simpson	[jingle, bells, batman, smells, robin, laid, a...	71209	9515	0.133621
C. Montgomery Burns	[hello, i, proud, announce, able, increase, sa...	23117	6155	0.266254
Seymour Skinner	[wasn, wonderful, and, santas, many, lands, pr...	17804	4932	0.277016
Moe Szyslak	[what, matter, homer, somebody, leave, lumpa, ...	21254	4716	0.221888
Kent Brockman	[good, evening, springfield, krusty, clown, be...	10232	4117	0.402365
Ned Flanders	[just, hold, horses, son, hey, hey, simpson, d...	14964	4062	0.271451
Krusty the Clown	[kill, well, right, go, ahead, there, someone,...	13588	3899	0.286944

Homer has, by far, the greatest number of unique words, but mostly because he dominates the dialog of The Simpsons.

This plot shows vocab size vs. total words.

In [45]:

fig, ax = plt.subplots()
ax.scatter(all_speech['total words'], all_speech['vocab size'])
ax.set_xlabel('Total Words')
ax.set_ylabel('Vocab Size');

There is a definite correlation between total words spoken and unique vocabulary. Homer, of course, is the point in the upper right corner of the plot.

The following plot shows the vocab:total ratio vs. total words for characters with more than 500 total words spoken.

In [46]:

fig, ax = plt.subplots()
ax.scatter(all_speech.loc[all_speech['total words']>500, 'total words'],
            all_speech.loc[all_speech['total words']>500, 'vocab:total ratio'])
ax.set_xlabel('Total Words')
ax.set_ylabel('Vocab:Total Ratio');

The unique vocab to total words ratio would be expected to decline with more words because additional words are more likely to have been used before.

Homer has the greatest vocabulary, mostly because he speaks so much more than any other character. Given his Flesch-Kincaid score, however, it wouldn't seem right to crown him the vocab king. Perhaps there is a better way to measure vocabulary as it relates to sophistication of speech by taking into account the number of words spoken. Instead of summarizing the size of vocabulary for each character, let's plot a curve showing how many unique words have been used vs. the total number of words spoken. I remove English stop words and limit the calculation at 20,000 total words because most characters don't surpass that number and because the set calculations for much larger lists were taking a long time.

In [26]:

all_speech['new vocab curve'] = all_speech.ix[:30, 'spoken_words'].apply(lambda x: [len(set(x[:i])) for i in range(min(len(x), 20000))])

colors = plt.cm.nipy_spectral(np.linspace(0, 1, 10))

fig, ax = plt.subplots(figsize=(12, 12))
ax.set_xlabel('Total Words')
ax.set_ylabel('Unique Words')
for i in range(30):
    plt.plot(all_speech.ix[i, 'new vocab curve'], color=colors[i%10], label=str(all_speech.index[i]));
legend = ax.legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

In the early range up to ~5000 words, there appears to be 3 groups. In the lowest group, we find some of the most important characters, including Homer, Marge, Bart, and Moe. I am not surprised that Homer has the shallowest vocab curve, but I am a bit surprised that Marge runs neck-and-neck with Homer most of the way. I guess I expected her to be significantly ahead of him. Another surprise to me was that Moe begins to overtake all main Simpsons family members except Lisa. Bart is also well ahead of Homer and Marge, but not Lisa, of course. Lisa falls in the middle group early on, but ends with a respectable vocabulary. Within the middle group, Mr. Burns and Principal Skinner are at the top, and at the 17,500 word mark, they easily have the greatest vocabularies. The high fliers early on include the Announcer, Kent Brockman, Comic Book Guy, Prof. Frink, and Mayor Joe Quimby. All except Kent Brockman have far fewer total words than other characters, but add new words at the greatest pace. Kent Brockman really stands out, adding new words at a faster pace all the way up to the 10,000 word mark. We saw Kent Brockman, the Announcer, and Mayor Quimby before, in the Flesch measures, so it is no surprise to see them in the highest group here. Prof. Frink doesn't surprise me, but Comic Book Guy sure did!

I think this curve is a better way of measuring sophistication of speech because it lets you compare characters on an equal-word basis. The slope of the line represents the rate of using new words, which I think is a better measure of sophistication of speech.

This graph only covers up to 20,000 words per character. For the main Simpsons family members, this only accounts for the early seasons. Over the course of a show that has been on air for this long, however, characters can develop, and the writing can change. It would be interesting to check if Homer is getting any smarter over time. Let's break down Homer's 175,000 words into 10 groups of 17,500 and plot them.

In [27]:

Homer_words = []
for j in range(10):
    Homer_words.append([len(set(all_speech.loc['Homer Simpson', 'spoken_words'][17500*j:17500*j+i])) for i in range(17500)])

fig, ax = plt.subplots(figsize=(8, 8))
ax.set_xlabel('Total Words')
ax.set_ylabel('Unique Words')
for i in range(len(Homer_words)):
    plt.plot(Homer_words[i], color=colors[i%10], label='Homer group '+str(i+1));
legend = ax.legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

This is actually quite interesting because there is a definite difference between groups, and there is a pretty clear upward trend in the early groups. It would appear that Homer is indeed getting smarter over time! What about some of the other characters?

In [49]:

Marge_words = []
for j in range(5):
    Marge_words.append([len(set(all_speech.loc['Marge Simpson', 'spoken_words'][15000*j:15000*j+i])) for i in range(15000)])

colors = plt.cm.nipy_spectral(np.linspace(0, 1, 5))

fig, ax = plt.subplots(figsize=(8, 6))
ax.set_xlabel('Total Words')
ax.set_ylabel('Unique Words')
for i in range(len(Marge_words)):
    plt.plot(Marge_words[i], color=colors[i%10], label='Marge group '+str(i+1));
legend = ax.legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

Marge shows a similar pattern. Let's check Bart and Lisa to be sure.

In [16]:

Bart_words = []
for j in range(7):
    Bart_words.append([len(set(all_speech.loc['Bart Simpson', 'spoken_words'][10000*j:10000*j+i])) for i in range(10000)])
    
Lisa_words = []
for j in range(6):
    Lisa_words.append([len(set(all_speech.loc['Lisa Simpson', 'spoken_words'][10000*j:10000*j+i])) for i in range(10000)])

colors = plt.cm.nipy_spectral(np.linspace(0, 1, 7))

fig, ax = plt.subplots(2, 1, figsize=(8, 12))
ax[0].set_xlabel('Total Words')
ax[0].set_ylabel('Unique Words')
for i in range(len(Bart_words)):
    ax[0].plot(Bart_words[i], color=colors[i%10], label='Bart group '+str(i+1));
legend = ax[0].legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')
ax[1].set_xlabel('Total Words')
ax[1].set_ylabel('Unique Words')
for i in range(len(Lisa_words)):
    ax[1].plot(Lisa_words[i], color=colors[i%10], label='Lisa group '+str(i+1));
legend = ax[1].legend(fontsize='x-large', frameon=True, bbox_to_anchor=(1, 1));
legend.get_frame().set_facecolor('lightgrey')

Except for a small anomaly where Lisa's second group is below her first, it seems to be the general case that all the Simpsons characters become a bit more sophisticated after the first few seasons.

Final Thoughts¶

I really enjoyed working with this dataset, as The Simpsons is one of my favorite shows. The dialog data is really rich, with a lot of interesting things to discover. Many of the results confirmed my suspicions, such as the fact that Homer spoke the most, but there were also some surprises. I would have expected Prof. Frink to come out on top for new vocabulary rate, but in fact the clear winner was Kent Brockman. I never would have suspected that Comic Book Guy would be so sophisticated. Given that I didn't watch the show in the first few seasons, I didn't expect to see such a difference in new vocab rates over time. Maybe I should go back and watch some of the early episodes to see if it is really noticeable.

One of the great things about this dataset is that it defines each scene before the lines of dialog of the characters in it. I hope to use that to follow this up with an analysis of character relationships.

An Analysis of Simpsons Dialog

Dialog¶

Sophistication of Speech¶

Vocabulary¶

Final Thoughts¶

Comments