For this post, I decided to compete in the April /r/dataisbeautiful contest. Each month on this subreddit people battle to see how can make the most out of a provided dataset. This month’s contest is the entire 9 seasons worth of scripts from tv show The Office.
Here’s the trick with these contests and with data analysis in general: you don’t have to use all of the data.
Let’s focus in on some specific aspect of what we can find there.
I thought it would be interesting to see the evolution of conversation between characters over the seasons. Surprisingly I couldn’t find many different ways to visualize this information and ultimately decided that watching the progression of word clouds season to season would be best. In the implementation of this code, I found creating nine-word clouds between each conversation of characters too dense. Solving this meant grouping seasons together into separate chunks of three seasons each.
The most difficult part of this analysis comes immediately: preparing the data. I have found this is often the case with data analyses, getting the data into a coherent order is the most frustrating component. Upon inspection of the data, we see that each line of each episode is presented along with the speaker of the line and the season. A key item of information is missing though: which character was being spoken to.
We can only infer that from the lines. So the first conversation of the series begins with Michael talking to Jim:
- Michael: All right Jim. Your quarterlies look very good. How are things at the library?
- Jim: Oh, I told you. I couldn’t close it. So…
- Michael: So you’ve come to the master for guidance? Is this what you’re saying, grasshopper?
- Jim: Actually, you called me in here, but yeah.
- Michael: All right. Well, let me show you how it’s done.
- Michael: [on the phone] Yes, I’d like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. [quick cut scene] All right. Done deal. Thank you very much, sir. You’re a gentleman and a scholar. Oh, I’m sorry. OK. I’m sorry. My mistake. [hangs up] That was a woman I was talking to, so… She had a very low voice. Probably a smoker, so… [Clears throat] So that’s the way it’s done.
From these first lines can we create some logic for identifying conversations between individuals? I would say lines 1-5 are between Michael and Jim. Line 6 is separate. It’s not between any individuals. Given this, plus a few other examples of the data we can say that conversations happen when one individual speaks then another individual speaks. If the next speaker is different from the last then those two characters are having a conversation. This is especially true if it goes on for a few lines.
Now is this reality? Are conversations between two individuals just periods when one person speaks and then another person speaks? The cynical among us might say so.
Regardless of this data, we don’t know exactly when two people are talking. We’ll have to use what we have. Let’s look at the next few lines:
- Michael: I’ve, uh, I’ve been at Dunder Mifflin for 12 years, the last four as Regional Manager. If you want to come through here… See we have the entire floor. So this is my kingdom, as far as the eye can see. This is our receptionist, Pam. Pam! Pam-Pam! Pam Beesly. Pam has been with us for… forever. Right, Pam?
- Pam: Well. I don’t know.
- Michael: If you think she’s cute now, you should have seen her a couple of years ago. [growls]
- Pam: What?
- Michael: Any messages?
- Pam: Uh, yeah. Just a fax.
- Michael: Oh! Pam, this is from Corporate. How many times have I told you? There’s a special filing cabinet for things from corporate.
- Pam: You haven’t told me.
- Michael: It’s called the wastepaper basket! Look at that! Look at that face.
- Pam: People say I am the best boss. They go, ‘God we’ve never worked in a place like this before. You’re hilarious.’ ‘And you get the best out of us.’ [shows the camera his WORLD’S BEST BOSS mug] I think that pretty much sums it up. I found it at Spencer Gifts.
So the conversation between Michael and Pam starts from the middle of line 7 to line 15. Line 16 is Michael speaking into the camera we don’t want that included in the conversation between Pam and Michael.
OK, we have seen enough, let’s get coding.
import csv import pandas as pd # The latin1 encoding will preserve the apostrophes. # Otherwise they are converted to unicode df = pd.read_csv('office.csv', encoding='latin1') # Cut down the dataframe to the essentials df = df[['season', 'line_text', 'speaker']] df.head()
Here’s the difficult part: attributing conversations. We’ve gotten our csv into a dataframe and we know the current speaker. What we need to do is look ahead and behind at the next and last speakers. Once we have that data we can make some judgments about who is speaking to whom. To accomplish this task we’ll make use of iterrows and generators, we’ll get the current and next speakers and then before we iterate over the next row we’ll set the last speaker.
What’s also important here is to set our valid flag. This flag says that to be valid the next speaker cannot be the current speaker and the last speaker cannot be the current speaker. The Office often has characters talking in a room by themselves. This might be interesting data for helping to identify character traits but we won’t use it in this tutorial.
I also started putting in some logic to identify longer conversations to make this analysis more robust but I don’t use it in any concrete manner. Please pick up the thread where I left off.
row_iterator = df.iterrows() last_idx, last = row_iterator.next() # take first item from row_iterator last_speaker = "" for idx, row in row_iterator: current_speaker = last['speaker'] next_speaker = row['speaker'] valid = not bool(next_speaker == current_speaker) and not bool(last_speaker == current_speaker) df.set_value(last_idx, 'last_speaker', last_speaker) df.set_value(last_idx, 'current_speaker', current_speaker) df.set_value(last_idx, 'next_speaker', next_speaker) df.set_value(last_idx, 'valid', valid) last = row last_idx += 1 longer_conversation = bool(last_speaker == next_speaker) last_speaker = current_speaker # combats akward transitions df.set_value(idx, 'longer_conversation', longer_conversation)
df = df[df['valid'] == True] df = df[['season', 'line_text', 'last_speaker', 'current_speaker', 'next_speaker']] df.head()
We’ve cleaned our data a fair amount. Let’s create some dataframes. Each of these dataframes has a specific purpose. The first one here identifies places where Michael is speaking to Jim the one after identifies where Jim speaks to Michael. We expect that an employee’s tone and word choice to their boss would differ than the tone and choice of words a boss uses to speak to their reports.
michael_to_jim = df[(df['current_speaker'] == "Michael") & (df['next_speaker'] == "Jim")] jim_to_michael = df[(df['current_speaker'] == "Jim") & (df['next_speaker'] == "Michael")]
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.tokenize import RegexpTokenizer from nltk.stem.porter import PorterStemmer # Create p_stemmer of class PorterStemmer p_stemmer = PorterStemmer() tokenizer = RegexpTokenizer(r'\w+') stopWords = set(stopwords.words('english')) names = ['michael', 'jim', 'pam', 'oh', 'dwight', 'good', 'know', 'ok', 'okay', 'go', 'ye', "-", "_", "s"] for index, row in michael_to_dwight.iterrows(): #pam_to_jim.iterrows(): #michael_to_jim.iterrows(): string = str.lower(row['line_text'].encode('utf-8')) # Bringing strings down to lower case is a best practice tokens = tokenizer.tokenize(string) text = [word for word in tokens if word not in stopWords] #text = [p_stemmer.stem(i.decode("utf8", "ignore")) for i in stopped_words] for tex in text: if tex not in names: dic[str(row['season'])].append(tex)
The second block of code cleans the text.
from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt def gen_word(text, titl=""): wordcloud = WordCloud(margin=4, random_state=1, background_color="white") #, #mask=mask, #contour_width=4, contour_color='white') wordcloud.generate(text) default_colors = wordcloud.to_array() plt.imshow(wordcloud.recolor(color_func=grey_color_func, random_state=1), interpolation="bilinear") plt.axis("off") #plt.suptitle(titl) plt.show()
And finally, we generate the word clouds as I did previously in this post.
text = " ".join([comment for comment in dic["7"]]) text.join([comment for comment in dic["8"]]) text.join([comment for comment in dic["9"]]) gen_word(text)
I took these word clouds to google drawings and created the final product: