Youtube is the second most visited site on the internet. That being the case it’s fertile ground for data mining. There’s plenty of data made publicly available: views, likes, keywords about the videos, a whole festival of comments for nearly every video with over 100 views, and other gems waiting to be discovered. Curious though that there appear to be few blog posts on the subject of culling this data. We’ll change that with this post. Here we use python to pull youtube video statistics into a dataframe. We’ll review the highest viewed videos and channels using the search criteria “data science”. By the end of this journey we’ll produce the following visual:
Set yourself up
In a jupyter notebook ensure that you can import the following libraries with no issues. If you have issues use pip to install the latest.
from apiclient.discovery import build import pandas as pd import seaborn as sns
The next step is to grab Youtube API credentials. The next lines in our program make use of the API key. You won’t be able to move forward without one.
DEVELOPER_KEY = "Very secret key here" YOUTUBE_API_SERVICE_NAME = "youtube" YOUTUBE_API_VERSION = "v3" YOUTUBE = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,developerKey=DEVELOPER_KEY)
And now here’s the meat of the program. Let’s publish it first then walk it through.
def youtube_search(query, max_results=50,order="relevance"): """ Arguments query: search query max_results: to return order: sort by Returns (dict) """ search_response = YOUTUBE.search().list( q=query, type="video", order = order, maxResults=max_results, part="id, snippet", # Part signifies the different types of data you want ).execute() title = [] channelId = [] channelTitle = [] categoryId = [] videoId = [] viewCount = [] commentCount = [] favoriteCount = [] category = [] tags = [] videos = [] for search_result in search_response.get("items", []): if search_result["id"]["kind"] == "youtube#video": title.append(search_result['snippet']['title']) videoId.append(search_result['id']['videoId']) response = YOUTUBE.videos().list( part='statistics, snippet', id=search_result['id']['videoId']).execute() snippet = response['items'][0]['snippet'] statistics = response['items'][0]['statistics'] channelId.append(snippet['channelId']) channelTitle.append(snippet['channelTitle']) categoryId.append(snippet['categoryId']) favoriteCount.append(int(statistics['favoriteCount'])) viewCount.append(int(statistics['viewCount'])) # Here you could go further and get likes/dislikes commentCount.append(statistics.get('commentCount', [])) tags.append(snippet.get('tags', [])) return {'channel_title':channelTitle, 'video_title':title, 'id': channelId, 'views':viewCount, 'tags':tags, 'category':categoryId, 'videoid':videoId, 'commentcount':commentCount, 'favoritecount':favoriteCount}
Here’s the rundown of what’s going on in this code.
We connect to Youtube through the global variable YOUTUBE using our API credentials. We use that global to iterate through a list of items returned by a youtube search on a query of our choosing. The search_response is that list of items returned from youtube. Before looking through search_response we instantiate a set of lists where we’ll collect views, likes, comments, and other information from each video.
Now we can start collecting data.
For all the videos in search_response we gather their video ID then ask youtube to pull some statistics on all of these videos (which are still the videos returned from our initial search). For each video, we send different parts of each to various lists. These lists are the values of the dictionary we return at the end of the function. The convenient part of organizing our data collection in this manner is how we’re able to send that data straight on to a pandas dataframe.
data = youtube_search("data science") df = pd.DataFrame(data=data) df.describe()
The results from the dataframe describe show us a few things: 1. there are 50 rows in this table (our max_results parameter); 2. views is the only numeric type column (check back in the function for why this happened); 3. the average video had 70,000+ views! while the median couldn’t break 24,000. There are some videos that were watched A LOT more than others. This data has outliers.
Top videos
In two lines we can see which videos had the highest hits for ‘data science’:
top_videos = df[['channel_title', 'video_title', 'views', 'tags']] top_videos.sort_values('views', ascending=False)
So this three-year-old video by David Langer on data science in R has been viewed almost 700,000 times. Great job David! The indefatigable Siraj Raval is in second place with an introduction to data science using python video with over 300,000 views. Given that David hasn’t published a video in years, my prediction is that he will be surpassed sometime in 2018 or 2019 for top video on data science. I don’t have firm ground to base that on, however. It’s entirely possible that David has reached some critical velocity with this video and can’t be caught.
Top channels
It’s similarly straightforward to view the top channels with data science videos. The groupby method allows us to group the channels and sum the number of views by all the videos we returned.
top_channels = df.groupby(['channel_title'])['views'].sum().reset_index() top_channels = analysis.sort_values(ascending=False,by='views') top_channels.head()
You should start being skeptical.
Isn’t it curious that David Langer and Siraj Raval are again tied for first and second? Doesn’t that seem a bit odd? Remember back to how we pulled this data. Recall that we limited the results to 50. So it’s less coincidental that David and Siraj are again in first and second. The data we limited ourselves to with 50 max results means we are leaving behind the vast majority of the tails. Now, here’s another disclaimer the median number of views in our video dataset was barely 25,000. It would take a Youtuber 24 of these well-received videos, 25,000 is a lot of views, to catch up to one of David Langer’s videos. Granted, it is a very popular video.
Ultimately I don’t think it’s odd that David and Siraj are in the top spots. Success breeds success with Youtube and the videos that are watched are more likely to be recommended are more likely to be watched…
Let’s plot some data
If you followed along with the first lines of code in this blog post then you have plotting library Seaborn ready to go. Seaborn and Matplotlib are both fantastic python visualization libraries. I personally enjoy Seaborn’s aesthetic and ease of use. So let’s go with that.
analysis['views_per_hundredthou'] =analysis['views'] / 100000 analysis = analysis[analysis['views_per_hundredthou']>1.]
Those two lines help condense our dataset. We are dealing now with channels that have received over 100,000 views. Each unit on our graph will be 100,000 views. Now let’s plot:
analysis = analysis.sort_values('views_per_hundredthou', ascending=False) ax = sns.barplot(x="channel_title", y="views_per_hundredthou", data=analysis, palette=sns.color_palette("BrBG", 12)) for item in ax.get_xticklabels(): item.set_rotation(60) ax.set_title('Data science Youtube stars') ax.set(xlabel='Channel', ylabel='Hundred thousdand views') sns.despine()
The sort_values line organizes the data so the plot is orderly. The next line uses Seaborn to construct the visual; the analysis dataframe is passed to “data” with “channel_title” being used on the x-axis while “views_per_hundredthou” is used on the y-axis. I always spend a lot of time choosing colors for graphs and this time was no different. Seaborn has an assortment of palettes to choose from, optional of course. I landed on the brown-to-blue-green “BrBG” from which I pulled 12 colors, one for each Youtube channel. Colors chosen, we then rotate the labels so they don’t overlap. Finally, we write a professional sounding title and display the chart:
That’s it! We’ve visualized Youtube statistics.
Homework
There’s a lot more data here to mine and visualize. Personally, I’m interested in sifting through the video comments and seeing what lurks there. It would also be great to expand our results past 50. At a minimum, it will take longer for our program to finish.
Special thanks to
The groundwork for this blog post could not have been laid without the work of Sudharsan Asaithambi, especially for his tutorial on manipulating Youtube data with python.