I was recently sent an email about transforming tweet data and presenting it in a simple way to represent stats about tweets by a certain category. I thought I would share how to do this:
A tweet is basically composed of text, hash tags (text prefixed with #), mentions (text prefixed with @), and lastly hyperlinks (text that follow some form of the pattern “http://_____.__”). We want to count these by some grouping – in this case we will group by user/character.
I prepared a sample data set containing some made up tweets by Sesame Street characters. You can download it by clicking: Sesame Street Faux Tweets.
Fire up R then load up our tweets into a dataframe:
# Load tweets and convert to dataframe tweets<-read.csv('sesamestreet.csv', stringsAsFactors=FALSE) tweets <- as.data.frame(tweets) |
We will use 3 libraries: stringr for string manipulation, dplyr for chaining, and ggplot2 for some graphs.
# Libraries library(dplyr) library(ggplot2) library(stringr) |
We now want to create the summaries and store them in a list or dataframe of their own. We will use dplyr to do the grouping, and stringr with some regex to apply filters on our tweets. If you do not know what is in the tweets dataframe go ahead and run head(tweets) to get an idea before moving forward.
gluten <- tweets %>% group_by(character) %>% summarise(total=length(text), hashtags=sum(str_count(text,"#(\\d|\\w)+")), mentions=sum(str_count(text,"@(\\d|\\w)+")), urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")) ) |
The code above starts with the variable we will store our list in … I called it “gluten” for no particular reason.
Phew. Now lets see what that outputs by typing “gluten” into the console:
character total hashtags mentions urls 1 Big Bird 2 5 1 0 2 Cookie Monster 3 2 1 1 3 Earnie & Bert 4 0 4 0 |
Which is exactly what we would see if we opened up the CSV file.
We can now create simple plots using the following code:
# Plots ggplot(data=gluten)+aes(x=reorder(character,-total),y=total)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Tweets") ggplot(data=gluten)+aes(x=reorder(character,-hashtags),y=hashtags)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Hash Tags") ggplot(data=gluten)+aes(x=reorder(character,-mentions),y=mentions)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Mentions") ggplot(data=gluten)+aes(x=reorder(character,-urls),y=urls)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total URLs ") |
Admittedly they’re not the prettiest plots, I got lazy ^_^’
Enjoy! If you have any questions, leave a comment!!
Hi there, I really like you posts, they’re always insightful. I just wonder what you could do if you wanted to have each separate character’s text attached to your gluten data frame in the end. Is there a means to have like a column of stacked text for each separate character inside the data frame ? Basically you’d have for the first row (Big Bird, 2, 5, 1, 0, “I miss @oscarthegrouch, he’s a nice #guy””The letter of the day is #B for #Bird #Big and #Banana”) where the text is stacked in the same entry.
thanks
Mikael
Hi Mikael,
Thanks!! Very kind of you.
Yes in principle you can by using the mutate() function from the dplyr package and paste() function. I will try it out tonight and post an update soon 🙂 if you solve it before that please do share!
Salem
Hey, thank you very much for your quick reply. I think I solved it out, but I’d really like to get your angle anyway. Here’s what seems to work for me :
gluten %
group_by(character) %>%
summarise(total=length(text),
hashtags=sum(str_count(text,”#(\\d|\\w)+”)),
mentions=sum(str_count(text,”@(\\d|\\w)+”)),
urls=sum(str_count(text,”^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)”)), text= toString(text)
)
Mikael
Yup! That works :]
Here is a solution that appends totals to each row of tweet – which means you can pivot with excel.
gluten <- tweets %.%
group_by(character,text) %.%
mutate(total=length(text),
hashtags=sum(str_count(text,"#(\\d|\\w)+")),
mentions=sum(str_count(text,"@(\\d|\\w)+")),
urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")))
Another solution is to show totals for each character and list every tweet with those totals
gluten <- tweets %.%
group_by(character) %.%
mutate(total=length(text),
hashtags=sum(str_count(text,"#(\\d|\\w)+")),
mentions=sum(str_count(text,"@(\\d|\\w)+")),
urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")))
and then your solution works perfect to put them in a list :]
Wonderful goods from you, man. I have understand your stuff previous to and you
are just too fantastic. I actually like what you’ve acquired here, certainly like what
you are stating and the way in which you say it.
You make it enjoyable and you still care for to keep it smart.
I cant wait to read much more from you. This is really a tremendous web site.
Hi! I was wondering if you could upload the code again, I can’t seem to donwload it… Thank you so much for the post!