Transcript and code
Youtube is a phenomenon. Now more than ever regular people can get online upload a video and find other people to watch them. It’s close to being a meritocracy where individuals who have the ability to edit and make videos can upload them for huge hits. Recently having started a channel I’ve been interested in this ecosystem. In this video, I’m going to scrape some information about the largest youtube channels and dive into the data. Come follow along.
Ok, so this website https://socialblade.com/youtube/ has some information that I think is trustworthy about the various sizes of youtube channels and users. I’d like to grab this table here and bring it into pandas to do some analysis.
To start I need to bring in the libraries I’m going to use. I’ll start with pandas, as usual, requests to grab the html from the url, and beautifulsoup to parse the html once I have it.
import pandas as pd import requests from bs4 import BeautifulSoup
Now I like to store the url in a string. Then I’ll use the requests library to retrieve and look at at the html
url= "http://socialblade.com/youtube/" html = requests.get(url)
The html is a large document so I’ll just look at the first 500 characters.
print html.text[:500]
Then I’ll send that html over to Beautifulsoup because I don’t want to have to parse it directly myself with python. I do that with the next line.
soup = BeautifulSoup(html.text)
Now we have to investigate the html and see how the data is structured that we’re looking to pull. We need to understand how it’s structured so we can figure out how to parse it. This takes some trial and error and it’s typically the longest part of data analysis: understanding and then cleaning your data.
I use beautiful soup to pull all the html divs containing the table data I’m interested in. Then I iterate over those divs and convert the information into dictionaries. Remember the end goal here is to put this youtube data into a pandas dataframe neatly so we can analyze it. Pandas dataframes can be constructed easily from lists of dictionaries and that is one of my favorite methods for getting data into a dataframe.
bodies = soup.findAll("div", {"class": "table-body"}) def prepare_table_row(row): lst = [i.text for i in row if i != u'\n'] return dict(rank=int(lst[0]), grade=str(lst[1]), channel=str(lst[2]), videos=float(lst[3].replace(',', '')), subscribers=float(lst[4].replace(',','')), views=float(lst[5].replace(',','')))
And now because this function cleans each row of the raw data table I can iterate over the raw data in the html and put it into a list.
#The order of data in this table is: channel rank, grade, channel, videos, subscribers, views data = [] #For each table row in the body of table data for tr in bodies: datum = prepare_table_row(tr) for a in tr.find_all('a', href=True): datum['lin'] = a['href'] # Also lets grab the youtube link data.append(datum)
Now we can push all this data to a pandas dataframe
df = pd.DataFrame(data)
From here there’s a bunch of things you can do. What I’m going to do is import seaborn and visualize some of this information.
import seaborn as sns ax = sns.barplot(x=df["subscribers"], y=df["channel"])
ax = sns.barplot(x=df["views"], y=df["channel"])
Finally what’s interesting here is that I don’t really know many of these youtube celebrities. Who are they and what do they do? Surely individuals with a significant number of viewers on youtube should have their own wikipedia pages right? With that information, we can use one of my favorite libraries to learn a bit more about who these people are.
import wikipedia print wikipedia.search('JuegaGerman') print wikipedia.summary('JuegaGerman', sentences=5)
Germán Alejandro Garmendia Aranis (Spanish pronunciation: [xerˈman aleˈxandɾo ɣarˈmendja aˈɾanis]; born 25 April 1990,, known by his YouTube channels HolaSoyGerman. and JuegaGerman, is a Chilean YouTuber, comedian, and writer. He has produced a variety of songs together with his band Ancud, all available on YouTube and Spotify. His book titled #Chupa el perro came out at multiple stores of Latin America and Spain on 28 April 2016. His channels are currently the 4th and 16th most subscribed on YouTube, having a combined total of over 60.8 million subscribers.
Now let’s learn about everyone in this list.
for index, row in df.iterrows(): print wikipedia.summary(row['channel'], sentences=5) print "\n"
Felix Arvid Ulf Kjellberg ( SHEL-burg; Swedish: [ˈfeːlɪks ²ɕɛlːˌbærj] ( listen); born 24 October 1989), known online as PewDiePie ( PEW-dee-py), is a Swedish YouTube personality, web-based comedian and video producer. He is known for his Let's Play commentaries and vlogs, as well as his following on YouTube. Born in Gothenburg, Sweden, PewDiePie originally pursued a degree in industrial economics and technology management at Chalmers University of Technology. In 2010, during his time at the university, he registered a YouTube account under the name PewDiePie. The following year, he dropped out of Chalmers after becoming disinterested with his degree field, much to the dismay of his parents. T-Series is an Indian music company, founded by Gulshan Kumar in the 1980s. It is primarily known for Bollywood music soundtracks. It is also engaged in film production and distribution. In the 1990s, T-Series released many of the best-selling Bollywood soundtrack albums, including Nadeem–Shravan's Aashiqui (1990), the best-selling Indian soundtrack album of all time. Currently, the T-Series YouTube channel is the most-viewed YouTube channel in the world, with nearly 1.7 billion monthly views, as of December 2017. Justin Drew Bieber (; born March 1, 1994) is a Canadian singer and songwriter. After a talent manager discovered him through his YouTube videos covering songs in 2008 and signed to RBMG, Bieber released his debut EP, My World, in late 2009. It was certified platinum in the U.S. He became the first artist to have seven songs from a debut record chart on the Billboard Hot 100. Bieber released his first full-length studio album, My World 2.0, in 2010. It debuted at or near number one in several countries, was certified triple platinum in the U.S., and contained his single "Baby". Justin Drew Bieber (; born March 1, 1994) is a Canadian singer and songwriter. After a talent manager discovered him through his YouTube videos covering songs in 2008 and signed to RBMG, Bieber released his debut EP, My World, in late 2009. It was certified platinum in the U.S. He became the first artist to have seven songs from a debut record chart on the Billboard Hot 100. Bieber released his first full-length studio album, My World 2.0, in 2010. It debuted at or near number one in several countries, was certified triple platinum in the U.S., and contained his single "Baby". Germán Alejandro Garmendia Aranis (Spanish pronunciation: [xerˈman aleˈxandɾo ɣarˈmendja aˈɾanis]; born 25 April 1990,, known by his YouTube channels HolaSoyGerman. and JuegaGerman, is a Chilean YouTuber, comedian, and writer. He has produced a variety of songs together with his band Ancud, all available on YouTube and Spotify. His book titled #Chupa el perro came out at multiple stores of Latin America and Spain on 28 April 2016. His channels are currently the 4th and 16th most subscribed on YouTube, having a combined total of over 60.8 million subscribers. KondZilla, (stage name of Konrad Cunha Dantas) is a Brazilian music video producer, director and screenwriter. He has directed over 300 music videos.. He started his career by directing the live in Charlie Brown Jr.'s concert film "Música Popular Caiçara". KondZilla has worked with musical artists including Racionais MCs, Charlie Brown Jr, MC Guime, Preta Gil, Tati Zaqui, Karol Conká, Tropkillaz, Arnaldo Saccomani, MC Nego Blue, DJ Marlboro, MC Boy do Charmes, MC Pedrinho, MC Bola, Pikeno & Menor, Hungria Hip-Hop, among many others. One of his works, the video for “Tombei” by Karol Conká, was nominated for Best Music Video at the 2015 Multishow Awards. Edward Christopher Sheeran, (; born 17 February 1991) is an English singer, songwriter, guitarist, record producer, and actor. Sheeran was born in Halifax, West Yorkshire, and raised in Framlingham, Suffolk. He attended the Academy of Contemporary Music in Guildford as an undergraduate from the age of 18 in 2009. In early 2011, Sheeran independently released the extended play, No. 5 Collaborations Project. Robyn Rihanna Fenty (; 20 February 1988) is a Bajan-born singer, songwriter and actress. Born in Saint Michael, Barbados and raised in Bridgetown, during 2003 she recorded demo tapes under the direction of record producer Evan Rogers and signed a recording contract with Def Jam Recordings after auditioning for its then-president, hip hop producer and rapper Jay Z. In 2005, Rihanna rose to fame with the release of her debut studio album Music of the Sun and its follow-up A Girl like Me (2006), which charted on the top 10 of the US Billboard 200 and respectively produced the successful singles "Pon de Replay", "SOS" and "Unfaithful". She assumed creative control for her third studio album Good Girl Gone Bad (2007) and adopted a public image as a sex symbol, while reinventing her music. Its lead single "Umbrella" became an international breakthrough in her career, as she won her first Grammy Award at the 50th Annual Grammy Awards in 2008. After releasing four consecutive platinum studio albums, including the Grammy Award winner Unapologetic (2012), she was recognized as an icon of today's music. Dude Perfect is an American sports entertainment group which routinely uploads new video content to various YouTube channels. The group consists of twins Coby and Cory Cotton, Garrett Hilbert, Cody Jones, and Tyler Toney, all of whom are former high school basketball players and college roommates at Texas A&M University. The members of the group hold several Guinness World Records. Their YouTube videos have garnered over 4.8 billion views total and their flagship channel, "Dude Perfect," has over 28 million subscribers as of March 2018. The channel is the 7th most subscribed channel overall and the most subscribed sports channel on YouTube. Rubén Doblas Gundersen (Spanish: [ruˈβen ˈdoβlaz ˈɣundeɾsen]; born on 13 February 1990), better known by his pseudonym El Rubius or elrubiusOMG (), is a Spanish YouTube personality whose channel primarily consists of gameplays and vlogs. His channel currently has over 6.2 billion views and 28 million subscribers, making it the 7th most subscribed on YouTube, the second most subscribed channel in the Spanish language, and the most subscribed YouTube channel in Spain. A tweet from Rubius's official Twitter account was the most retweeted tweet in the world for the year 2016. The tweet was retweeted more than 1.3 million times. ...
Well, that’s all for today. Thanks for stopping by and viewing this video. If there are other data you’d like me to look at please leave it in the comments below. Thanks!