Counting Tweets in R – Substrings, Chaining, and Grouping

Posted by Salem on February 8, 2015

I was recently sent an email about transforming tweet data and presenting it in a simple way to represent stats about tweets by a certain category. I thought I would share how to do this:

A tweet is basically composed of text, hash tags (text prefixed with #), mentions (text prefixed with @), and lastly hyperlinks (text that follow some form of the pattern “http://_____.__”). We want to count these by some grouping – in this case we will group by user/character.

I prepared a sample data set containing some made up tweets by Sesame Street characters. You can download it by clicking: Sesame Street Faux Tweets.

Fire up R then load up our tweets into a dataframe:

# Load tweets and convert to dataframe
tweets<-read.csv('sesamestreet.csv', stringsAsFactors=FALSE)
tweets <- as.data.frame(tweets)

We will use 3 libraries: stringr for string manipulation, dplyr for chaining, and ggplot2 for some graphs.

# Libraries
library(dplyr)
library(ggplot2)
library(stringr)

We now want to create the summaries and store them in a list or dataframe of their own. We will use dplyr to do the grouping, and stringr with some regex to apply filters on our tweets. If you do not know what is in the tweets dataframe go ahead and run head(tweets) to get an idea before moving forward.

gluten <- tweets %>%
	group_by(character) %>%
	summarise(total=length(text), 
		    hashtags=sum(str_count(text,"#(\\d|\\w)+")),
		    mentions=sum(str_count(text,"@(\\d|\\w)+")),
		    urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)"))
	)

The code above starts with the variable we will store our list in … I called it “gluten” for no particular reason.

We will apply transformations to our tweets dataframe. The “dplyr” knows to step into the next command because we use “%>%” to indicate this.
We group by the first column “character” using the function group_by().
We then create summary stats by using the function summarise() – note the American spelling will work too 😛
We create a summary called total which is equal to the number of tweets (i.e. length of the list that has been grouped)
We then count the hashtags by using the regex expression “#(\\d|\\w)+” and the function str_count() from the stringr package. If this regex does not make sense, you can use many tools online to explain it.
We repeat the same step for mentions and urls

Phew. Now lets see what that outputs by typing “gluten” into the console:

       character total hashtags mentions urls
1       Big Bird     2        5        1    0
2 Cookie Monster     3        2        1    1
3  Earnie & Bert     4        0        4    0

Which is exactly what we would see if we opened up the CSV file.

We can now create simple plots using the following code:

 
# Plots 
ggplot(data=gluten)+aes(x=reorder(character,-total),y=total)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Tweets")
ggplot(data=gluten)+aes(x=reorder(character,-hashtags),y=hashtags)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Hash Tags")
ggplot(data=gluten)+aes(x=reorder(character,-mentions),y=mentions)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Mentions")
ggplot(data=gluten)+aes(x=reorder(character,-urls),y=urls)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total URLs ")

Admittedly they’re not the prettiest plots, I got lazy ^_^’

Enjoy! If you have any questions, leave a comment!!

Category: Code

Tags: dplyr, ggplot2, R, stringr

6 Comments on “Counting Tweets in R – Substrings, Chaining, and Grouping”

February 23, 2015 by Mikael Eskenazi
Hi there, I really like you posts, they’re always insightful. I just wonder what you could do if you wanted to have each separate character’s text attached to your gluten data frame in the end. Is there a means to have like a column of stacked text for each separate character inside the data frame ? Basically you’d have for the first row (Big Bird, 2, 5, 1, 0, “I miss @oscarthegrouch, he’s a nice #guy””The letter of the day is #B for #Bird #Big and #Banana”) where the text is stacked in the same entry.
thanks
Mikael

- February 23, 2015 by Salem
  Hi Mikael,
  
  Thanks!! Very kind of you.
  
  Yes in principle you can by using the mutate() function from the dplyr package and paste() function. I will try it out tonight and post an update soon 🙂 if you solve it before that please do share!
  
  Salem
  
  - February 24, 2015 by Mikael Eskenazi
    Hey, thank you very much for your quick reply. I think I solved it out, but I’d really like to get your angle anyway. Here’s what seems to work for me :
    
    gluten %
    group_by(character) %>%
    summarise(total=length(text),
    hashtags=sum(str_count(text,”#(\\d|\\w)+”)),
    mentions=sum(str_count(text,”@(\\d|\\w)+”)),
    urls=sum(str_count(text,”^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)”)), text= toString(text)
    )
    
    Mikael
    
    - February 24, 2015 by Salem
      Yup! That works :]
      
      Here is a solution that appends totals to each row of tweet – which means you can pivot with excel.
      
      gluten <- tweets %.%
      group_by(character,text) %.%
      mutate(total=length(text),
      hashtags=sum(str_count(text,"#(\\d|\\w)+")),
      mentions=sum(str_count(text,"@(\\d|\\w)+")),
      urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")))
      
      Another solution is to show totals for each character and list every tweet with those totals
      
      gluten <- tweets %.%
      group_by(character) %.%
      mutate(total=length(text),
      hashtags=sum(str_count(text,"#(\\d|\\w)+")),
      mentions=sum(str_count(text,"@(\\d|\\w)+")),
      urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")))
      
      and then your solution works perfect to put them in a list :]
      
February 28, 2015 by à¹€à¸‹à¸‹à¸²à¸¡à¸´à¸™
Wonderful goods from you, man. I have understand your stuff previous to and you
are just too fantastic. I actually like what you’ve acquired here, certainly like what
you are stating and the way in which you say it.

You make it enjoyable and you still care for to keep it smart.
I cant wait to read much more from you. This is really a tremendous web site.

June 30, 2016 by Tatiana
Hi! I was wondering if you could upload the code again, I can’t seem to donwload it… Thank you so much for the post!

Salem Marafi

Counting Tweets in R – Substrings, Chaining, and Grouping

6 Comments on “Counting Tweets in R – Substrings, Chaining, and Grouping”

Leave a Reply Cancel reply

Search

Topics

Salem Marafi

Counting Tweets in R – Substrings, Chaining, and Grouping

6 Comments on “Counting Tweets in R – Substrings, Chaining, and Grouping”

Leave a Reply Cancel reply

Search

Topics

Tag Cloud