Word clouds are to natural language processing (NLP) as pie charts are to statistical analyses: Celebrated by outsiders but not commonly used in practice. In certain circumstances, they are the right tool to use. I find them useful when just starting to explore textual data and I’m looking for a snapshot of the text. They can also serve as a helpful foundation for other NLP tasks (if you want to learn more about NLP see our other post about it here). Used or not, they are fun to review. This post will construct word clouds out of interesting posts across various subreddits.
Setting up is the annoying part
To complete this tutorial you should have python 2.7 installed, though 3.+ will work too with some adjustments. You’ll need to install praw, the library we’ll use to connect to Reddit, as well as wordcloud which we’ll use to generate the images. You’ll also need to install matplotlib for plotting and NLTK for the NLP manipulation we’ll complete in the latter portions of this tutorial. You should be able to
pip install praw for example with little trouble.
You’ll also need to get some Reddit API credentials. The process for obtaining these keys is here.
Let’s get our hands dirty
Import the libraries we’ll be using for this task and connect to Reddit via praw.
import matplotlib.pyplot as plt # We'll use this later # First let's connect to Reddit import praw client_id = "You'll need to get your" client_secret = "own credentials from Reddit" user_agent = "and paste them here" reddit = praw.Reddit(client_id=client_id, client_secret=client_secret, user_agent=user_agent)
Hey look now we’re connected to Reddit. Super cool. Let’s see what we can do.
for submission in reddit.subreddit('blogging').hot(limit=10): print submission.title
Weekly /r/Blogging Discussion - Check Out My Blog Post! Looking for guest posters! 1 Month into Blogging - What I Learned About Ranking on Google [Question] Would you pay to turn your blog into a podcast? Guest Posting Opportunity Blog site with best footnote feature Getting obsessed with website design too much and suddenly loose interest in writing Blog post on second result in google search, next steps? How niche is too niche? Can i create a blog about... everything?
Great so we can connect to Reddit. On the blogging subreddit, I can grab the top 10 hottest posts. This also means I’ll be able to dive into those posts and grab the comments as well as inspect the posts and comments for meta-data such as the number of upvotes on each.
To begin I’d like to find a submission to analyze. Let’s start with something simple. I searched “success” on the /r/blogging subreddit and this is the top post: https://www.reddit.com/r/Blogging/comments/7pwmzr/after_1_year_my_blog_is_meeting_my_simple/
I’m going to use that submission to create our first word-cloud.
# Top of /r/blogging submission = reddit.submission(url='https://www.reddit.com/r/Blogging/comments/7pwmzr/after_1_year_my_blog_is_meeting_my_simple/') # Upvotes on the first comment, just for fun print submission.comments.ups # Alright let's collect all the comments raw_comments =  for top_level_comment in submission.comments: if isinstance(top_level_comment, MoreComments): continue raw_comments.append(top_level_comment.body)
This code allows us to grab all the comments in this thread as unicode strings in a list. Thankfully there aren’t too many comments on this thread so I’ll just share them with you here. This is what the
raw_comments variable looks like:
[u"A chunk of good news always give me motivation. Thanks for this. \n\nMy gf and i are planning to do a collab on a baby website. She's both a programmer and a writer while I'll be doing all the graphics the website needs. \n\nI hope we end up meeting our goal (like you guys) as we are really putting efforts to do all the planning like what's the blog's goal, what niche are we in, demographics, website wireframe, UI/UX, etc. More power! \n", u'What are you sources of income?', u'What helped you most about your blog success ?', u'Very nice, grats!', u'That\u2019s great! Can I ask, how do you promote your site? Or do you most rely on search engine optimization for traffic? ', u'Congrats! It is a lot tougher than people make it out to be. Hopefully your site grows a lot more in 2018! What tactic has worked best for growing traffic steadily?', u'Congrats. Thanks for sharing', u"Fantastic milestone, well done. I'm interested in your approach to SEO for growth rather than social media - did you hit a critical mass in the number of articles you had when you noticed you were coming up in searches more often?", u'This is very inspiring as I\u2019m new to blogging and finding my way around it as well.']
Rushing to the end
If you’re impatient like me you can rush off to the end and make the word cloud right now. Let’s do that and then double back afterwards to think about what we’ve done.
# This line stuffs all the comments from our list into one string text = " ".join([comment for comment in raw_comments]) # Generate a word cloud image wordcloud = WordCloud().generate(text) # Display the generated image: # the matplotlib way: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off")
The word cloud library requires the input text be one long string hence the creation of the text variable. Passing that data into an instance of the WordCloud class I’m able to generate the cloud. Finally, I display it using the matplotlib library.
It’s fine. But is this really what we want? The data in
raw_commentsis very raw. If you squint at the word cloud you’ll see capitalization matters, which it shouldn’t. It also retains the punctuation and other odd characters of the raw data. This doesn’t feel correct.
Backtracking and thinking about it more
The word cloud we generated before doesn’t feel correct because we left so much slop in the dataset we fed into it. We’ll use some standard techniques to clean this data: We are going to tokenize the string and then find the stems of each word within the sentence. We’ll also remove the stop words while we’re at it.
What does all of that mean? The following example should make it clear. We’ll take a generic sentence, tokenize it and then find the stems of those tokens.
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords stopWords = set(stopwords.words('english')) string = "Learning to tokenize is an important part of NLP! There's a lot of goodness in these tools." string = str.lower(string) # Bringing strings down to lower case is a best practice tokens = word_tokenize(string) print "Here are the raw tokens" print tokens
Here are the raw tokens ['learning', 'to', 'tokenize', 'is', 'an', 'important', 'part', 'of', 'nlp', '!', 'there', "'s", 'a', 'lot', 'of', 'goodness', 'in', 'these', 'tools', '.']
stems =[stemmer.stem(token) for token in tokens if token not in stopWords] print "Now we find the stem words in this sentence which are", stems
Now we find the stem words in this sentence which are [u'learn', u'token', u'import', 'part', 'nlp', '!', "'s", 'lot', u'good', u'tool', '.']
So let’s get back to our visual now that we’ve done some cleaning of the data.
comments =  for sentence in raw_comments: str_sentence = sentence.encode('ascii','ignore') # Avoiding errors in unicode to string conversion lowered = str.lower(str_sentence) words_tokenized = word_tokenize(lowered) for token in words_tokenized: stemmed_token = stemmer.stem(token) if stemmed_token not in stopWords: comments.append(stemmed_token)
OK, so there’s our word cloud. You can go into further detail and remove more of that punctuation and other non-words but I don’t think they are affecting this visual too much. Notice how the stemming process removed the ‘e’ on ‘website’. That is a side-effect of the stemming process allowing for the root word to be grouped along with its derivatives.
The word cloud on the right was created from a post on how many words each character on The Office spoke each season. Small tweaks were made to reduce the influence of the most common words. This required passing in the
max_font_sizeparameter into the function. Here’s how that line changed.
wordcloud = WordCloud(max_font_size=60).generate(text)
From the entrepreneur subreddit I pulled the top post for “blogging”. I also went further and started altering the colors of the wordcloud. Andreas Mueller (creator of this library) has some code achieving this goal here. To get started you’ll need to add the following right before we generate these clouds.
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs): return "hsl(0, 0%%, %d%%)" % random.randint(60, 100) default_colors = wordcloud.to_array() plt.imshow(wordcloud.recolor(color_func=color_func, random_state=3), interpolation="bilinear") plt.axis("off") plt.show()
Now that we have the process down we can run through the paces
The following three clouds are from the blogging, data science, and relationships subreddits, respectively. I went further and manipulated the grey_color_func function to return more colors beyond the black and greys.
In these clouds, we see opportunities for improvement. There are sloppy data points that should be cleaned like the “n’t” in the first image. We see words that should be combined into one, like “site” and “post” perhaps. The stem “thi” is present in each of these word clouds and it isn’t clear what stem or stop word got through that allowed it to be in this post.
I think we learn less from looking at one individual cloud than comparing and contrasting between them. The clouds related to blogging and data science have similar syntaxes. The words “post” and “great” are prominent in each as well as words related to work output such as “content” and “work.” The final cloud from the relationships subreddit is very different. The tone and syntax reflect that submission with words like “love”, “thank”, “loss”, and “time” appearing most prominently.
Conclusion: Word clouds are a place to start
At this point you might be asking “this is all well and good but what have we actually learned?” and that’s a great question to ask. We’ve learned how to connect to Reddit and retrieve data. We’ve learned how to manipulate that textual data into the format this library needs for processing. We’ve learned how to clean that data and ultimately we derived a process for automatically generating word clouds from Reddit submission comments.
I won’t say word clouds should be a necessary part of your NLP toolkit, but in some circumstances, they are a place to start. You could do worse when exploring your textual data than generating a word cloud. I think viewing the differences in diction between the subreddits is fairly interesting. Creating a word cloud helps you explore your data. It helps you get it into a state that lends itself to further exploration and training.