Harvesting Tweets with R

Problem Definition

http://www.theispot.com/whatsnew/2012/2/brucie-rosch-twitter-data.htm

If you are a regular Twitter user, you’ll find yourself wanting to collect all your tweets for whatever obscure reason. I was looking to do some sort of sentiment analysis on a large dataset – it made sense to go to Twitter … but you can only get 1500 tweets at a time with Twitter’s API.

After looking around forums I couldn’t find a reasonable solution. So one way to tackle this is to build up a database over time – just store the tweets you want locally. Since I was going to use this with R, I wanted to collect data with R.

To do this, we can use the twitteR package to communicate with Twitter and save our data.

Once you have the script in place, you can run a cron job to execute the script every day, week, hour or whatever you see fit.

Connecting to Twitter

We will be using 3 libraries in our R Script. Lets load them into our environment:

library(twitteR)
library(RCurl) 
library(ROAuth)

We will need to set up an SSL certificate (especially if you are on Windows). We do this using the following line of code:

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

Lets set up our API variables to connect to twitter.

reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"

You will need to get a api key and api secret from Twitter’s developer site https://apps.twitter.com/app/new – make sure you give read, write, and direct message permissions to your newly created application.

apiKey <- "YOUR-KEY"
apiSecret <- "YOUR-SECRET"

Lets put everything together and try an authorize our application.

twitCred <- OAuthFactory$new(
   consumerKey=apiKey,
   consumerSecret=apiSecret,
   requestURL=reqURL,
   accessURL=accessURL,
   authURL=authURL)

Now lets connect by doing a handshake …

twitCred$handshake()

You will get a message to follow a link and get a confirmation code to input into your R console. The message will look like this:

To enable the connection, please direct your web browser to: http://api.twitter.com/oauth/authorize?oauth_token=4Ybnjkljkljlst5cvO5t3nqd8bhhGqTL3nQ When complete, record the PIN given to you and provide it here:

Now we can save our credentials for next time!

registerTwitterOAuth(twitCred)
save(list="twitCred", file="credentials")

Harvesting Tweets

Now that we are connected we can put in our queries:

  • We will use a comma separated string of key words we want to track.
  • Once we define our string of key words, we split the string and make a list.
  • We will also need a variable to hold our tweets.

I chose key words related to Kuwait.

query <- "kuwaiti,kuwait,#kuwait,#q8"
query <- unlist(strsplit(query,","))
tweets = list()

Now we are ready to ask Twitter for tweets on our key words. What we do in the following block of code is loop through our string of key words; in this case we loop 4 times for our 4 key words.

We use twitteR’s function searchTwitter() which takes the query as a parameter. We also supply additional parameters: n – the number of tweets, geocode – a latitude, longitude and radius (in our example we use within an 80 mile radius of Kuwait City).

for(i in 1:length(query)){
	result<-searchTwitter(query[i],n=1500,geocode='29.3454657,47.9969453,80mi')
	tweets <- c(tweets,result)
	tweets <- unique(tweets)
 }

That’s it, we have our data. All that needs to be done now is save it.
R does not allow you to append data to CSV files, so what we will do is:

  • Check if there is a file called tweets.csv and read the data in there
  • Merge the data already in the CSV file with our new data
  • Remove any duplicates in the data
  • Save the file as tweets.csv again
# Create a placeholder for the file
file<-NULL
 
# Check if tweets.csv exists
if (file.exists("tweets.csv")){file<- read.csv("tweets.csv")}
 
# Merge the data in the file with our new tweets
df <- do.call("rbind", lapply(tweets, as.data.frame))
df<-rbind(df,file)
 
# Remove duplicates
df <- df[!duplicated(df[c("id")]),]
 
# Save
write.csv(df,file="tweets.csv",row.names=FALSE)

Code

For your convenience, the code in one block:

 
# Load libraries
library(twitteR)
library(RCurl) 
library(ROAuth)
 
# SSL Certificate
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
 
# API URLs
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
 
# API Keys from https://apps.twitter.com/app/new 
apiKey <- "YOUR-KEY"
apiSecret <- "YOUR-SECRET"
 
# Connect to Twitter to get credentials
 
twitCred <- OAuthFactory$new(
   consumerKey=apiKey,
   consumerSecret=apiSecret,
   requestURL=reqURL,
   accessURL=accessURL,
   authURL=authURL)
 
# Twitter Handshake - you will need to get the PIN after this
twitCred$handshake()
 
# Optionally save credentials for later
registerTwitterOAuth(twitCred)
save(list="twitCred", file="credentials")
 
 
# Set up the query
query <- "kuwaiti,kuwait,#kuwait,#q8"
query <- unlist(strsplit(query,","))
tweets = list()
 
# Loop through the keywords and store results
 
for(i in 1:length(query)){
	result<-searchTwitter(query[i],n=1500,geocode='29.3454657,47.9969453,80mi')
	tweets <- c(tweets,result)
	tweets <- unique(tweets)
 }
 
# Create a placeholder for the file
file<-NULL
 
# Check if tweets.csv exists
if (file.exists("tweets.csv")){file<- read.csv("tweets.csv")}
 
# Merge the data in the file with our new tweets
df <- do.call("rbind", lapply(tweets, as.data.frame))
df<-rbind(df,file)
 
# Remove duplicates
df <- df[!duplicated(df[c("id")]),]
 
# Save
write.csv(df,file="tweets.csv",row.names=FALSE)
 
# Done!

7 Comments on “Harvesting Tweets with R

  1.  by  Vasco

    Hello,

    I want to know if you have the same script but to use the REST API to take the last 7 days of tweets with keywords.

    Many thanks.

    Best Regards,
    Vasco

  2.  by  mubarak ahmad

    hello sir
    sir i am trying to Run your code in R, as i enter the code
    ” twitCred$handshake() ” it gives an error of ” Error: Authorization Required “. kindly help me out in solving this problem…
    many thanks

  3.  by  alejandro

    Hey, thanks for sharing. I’m not able to use a cron job because it requires manual authentication each time. Any clues?

  4.  by  N S Manjunath

    SM,
    Very lucid piece of writing! I enjoyed reading through your posts and I was able to reverse engineer a portion of your code for some Tweets analysis. Thank you so much for making it accessible to beginners of data analytics using R.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>