Post Date Apr 28

Collaborative Filtering with Python

Collaborative FIltering

To start, I have to say that it is really heartwarming to get feedback from readers, so thank you for engagement. This post is a response to a request made collaborative filtering with R.

The approach used in the post required the use of loops on several occassions.
Loops in R are infamous for being slow. In fact, it is probably best to avoid them all together.
One way to avoid loops in R, is not to use R (mind: #blow). We can use Python, that is flexible and performs better for this particular scenario than R.
For the record, I am still learning Python. This is the first script I write in Python.

Refresher: The Last.FM dataset

The data set contains information about users, gender, age, and which artists they have listened to on Last.FM.
In our case we only use Germany’s data and transform the data into a frequency matrix.

We will use this to complete 2 types of collaborative filtering:

  • Item Based: which takes similarities between items’ consumption histories
  • User Based: that considers similarities between user consumption histories and item similarities

We begin by downloading our dataset:

Fire up your terminal and launch your favourite IDE. I use IPython and Notepad++.

Lets load the libraries we will use for this exercise (pandas and scipy)

# --- Import Libraries --- #
import pandas as pd
from scipy.spatial.distance import cosine

We then want to read our data file.

# --- Read Data --- #
data = pd.read_csv('data.csv')

If you want to check out the data set you can do so using data.head():

 
data.head(6).ix[:,2:8]
 
   abba  ac/dc  adam green  aerosmith  afi  air
0     0      0           0          0    0    0
1     0      0           1          0    0    0
2     0      0           0          0    0    0
3     0      0           0          0    0    0
4     0      0           0          0    0    0
5     0      0           0          0    0    0

Item Based Collaborative Filtering

Reminder: In item based collaborative filtering we do not care about the user column.
So we drop the user column (don’t worry, we’ll get them back later)

# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_germany = data.drop('user', 1)

Before we calculate our similarities we need a place to store them. We create a variable called data_ibs which is a Pandas Data Frame (… think of this as an excel table … but it’s vegan with super powers …)

# Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)

Now we can start to look at filling in similarities. We will use Cosin Similarities.
We needed to create a function in R to achieve this the way we wanted to. In Python, the Scipy library has a function that allows us to do this without customization.
In essense the cosine similarity takes the sum product of the first and second column, then dives that by the product of the square root of the sum of squares of each column.

This is a fancy way of saying “loop through each column, and apply a function to it and the next column”.

# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :
      # Fill in placeholder with cosine similarities
      data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])

With our similarity matrix filled out we can look for each items “neighbour” by looping through ‘data_ibs’, sorting each column in descending order, and grabbing the name of each of the top 10 songs.

# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
 
# --- End Item Based Recommendations --- #

Done!

data_neighbours.head(6).ix[:6,2:4]
 
                                      2                3              4
a perfect circle                   tool            dredg       deftones
abba                            madonna  robbie williams  elvis presley
ac/dc             red hot chili peppers        metallica    iron maiden
adam green               the libertines      the strokes   babyshambles
aerosmith                            u2     led zeppelin      metallica
afi                funeral for a friend     rise against   fall out boy

User Based collaborative Filtering

The process for creating a User Based recommendation system is as follows:

  • Have an Item Based similarity matrix at your disposal (we do…wohoo!)
  • Check which items the user has consumed
  • For each item the user has consumed, get the top X neighbours
  • Get the consumption record of the user for each neighbour.
  • Calculate a similarity score using some formula
  • Recommend the items with the highest score

Lets begin.

We first need a formula. We use the sum of the product 2 vectors (lists, if you will) containing purchase history and item similarity figures. We then divide that figure by the sum of the similarities in the respective vector.
The function looks like this:

# --- Start User Based Recommendations --- #
 
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history*similarities)/sum(similarities)

The rest is a matter of applying this function to the data frames in the right way.
We start by creating a variable to hold our similarity data.
This is basically the same as our original data but with nothing filled in except the headers.

# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]

We now loop through the rows and columns filling in empty spaces with similarity scores.

Note that we score items that the user has already consumed as 0, because there is no point recommending it again.

#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0,len(data_sims.index)):
    for j in range(1,len(data_sims.columns)):
        user = data_sims.index[i]
        product = data_sims.columns[j]
 
        if data.ix[i][j] == 1:
            data_sims.ix[i][j] = 0
        else:
            product_top_names = data_neighbours.ix[product][1:10]
            product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
            user_purchases = data_germany.ix[user,product_top_names]
 
            data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)

We can now produc a matrix of User Based recommendations as follows:

# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]

Instead of having the matrix filled with similarity scores, however, it would be nice to see the song names.
This can be done with the following loop:

# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
# Print a sample
print data_recommend.ix[:10,:4]

Done! Happy recommending ;]

   user                      1                      2                3
0     1         flogging molly               coldplay        aerosmith
1    33  red hot chili peppers          kings of leon        peter fox
2    42                 oomph!            lacuna coil        rammstein
3    51            the subways              the kooks  franz ferdinand
4    62           jack johnson                incubus       mando diao
5    75             hoobastank             papa roach           sum 41
6   130      alanis morissette  the smashing pumpkins        pearl jam
7   141           machine head        sonic syndicate          caliban
8   144                editors              nada surf      the strokes
9   150                placebo            the subways     eric clapton
10  205             in extremo          nelly furtado        finntroll

Entire Code

 
# --- Import Libraries --- #
 
import pandas as pd
from scipy.spatial.distance import cosine
 
# --- Read Data --- #
data = pd.read_csv('data.csv')
 
# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_germany = data.drop('user', 1)
 
# Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)
 
# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :
      # Fill in placeholder with cosine similarities
      data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])
 
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=[range(1,11)])
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
 
# --- End Item Based Recommendations --- #
 
# --- Start User Based Recommendations --- #
 
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history*similarities)/sum(similarities)
 
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]
 
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0,len(data_sims.index)):
    for j in range(1,len(data_sims.columns)):
        user = data_sims.index[i]
        product = data_sims.columns[j]
 
        if data.ix[i][j] == 1:
            data_sims.ix[i][j] = 0
        else:
            product_top_names = data_neighbours.ix[product][1:10]
            product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
            user_purchases = data_germany.ix[user,product_top_names]
 
            data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)
 
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]
 
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
 
# Print a sample
print data_recommend.ix[:10,:4]

Referenence

Post Date Feb 8

Counting Tweets in R – Substrings, Chaining, and Grouping

Sesame Street - Computer Capers

I was recently sent an email about transforming tweet data and presenting it in a simple way to represent stats about tweets by a certain category. I thought I would share how to do this:

A tweet is basically composed of text, hash tags (text prefixed with #), mentions (text prefixed with @), and lastly hyperlinks (text that follow some form of the pattern “http://_____.__”). We want to count these by some grouping – in this case we will group by user/character.

I prepared a sample data set containing some made up tweets by Sesame Street characters. You can download it by clicking: Sesame Street Faux Tweets.

Fire up R then load up our tweets into a dataframe:

# Load tweets and convert to dataframe
tweets<-read.csv('sesamestreet.csv', stringsAsFactors=FALSE)
tweets <- as.data.frame(tweets)

We will use 3 libraries: stringr for string manipulation, dplyr for chaining, and ggplot2 for some graphs.

# Libraries
library(dplyr)
library(ggplot2)
library(stringr)

We now want to create the summaries and store them in a list or dataframe of their own. We will use dplyr to do the grouping, and stringr with some regex to apply filters on our tweets. If you do not know what is in the tweets dataframe go ahead and run head(tweets) to get an idea before moving forward.

gluten <- tweets %>%
	group_by(character) %>%
	summarise(total=length(text), 
		    hashtags=sum(str_count(text,"#(\\d|\\w)+")),
		    mentions=sum(str_count(text,"@(\\d|\\w)+")),
		    urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)"))
	)

The code above starts with the variable we will store our list in … I called it “gluten” for no particular reason.

  1. We will apply transformations to our tweets dataframe. The “dplyr” knows to step into the next command because we use “%>%” to indicate this.
  2. We group by the first column “character” using the function group_by().
  3. We then create summary stats by using the function summarise() – note the American spelling will work too 😛
  4. We create a summary called total which is equal to the number of tweets (i.e. length of the list that has been grouped)
  5. We then count the hashtags by using the regex expression “#(\\d|\\w)+” and the function str_count() from the stringr package. If this regex does not make sense, you can use many tools online to explain it.
  6. We repeat the same step for mentions and urls

Phew. Now lets see what that outputs by typing “gluten” into the console:

       character total hashtags mentions urls
1       Big Bird     2        5        1    0
2 Cookie Monster     3        2        1    1
3  Earnie & Bert     4        0        4    0

Which is exactly what we would see if we opened up the CSV file.

We can now create simple plots using the following code:

 
# Plots 
ggplot(data=gluten)+aes(x=reorder(character,-total),y=total)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Tweets")
ggplot(data=gluten)+aes(x=reorder(character,-hashtags),y=hashtags)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Hash Tags")
ggplot(data=gluten)+aes(x=reorder(character,-mentions),y=mentions)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Mentions")
ggplot(data=gluten)+aes(x=reorder(character,-urls),y=urls)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total URLs ")

Admittedly they’re not the prettiest plots, I got lazy ^_^’

Enjoy! If you have any questions, leave a comment!!

Post Date Jun 11

Why we die – Mortality in Kuwait with R

Patient Boogie

I came across Randy Olson’s (editor at @DataIsBeautiful) tweet about causes of death in the US.
I thought I would replicate the information here using R and localize it for Kuwait – nothing fancy.

I will keep the code to the end because I think the results are quite interesting. The data source is from the UN Data portal

Here’s Randy Olson’s Tweet before we get going:

Causes of Death in the General Population

I always had the impression that the biggest issues with mortality in Kuwait were car accidents. Perhaps this is a bias introduced by media coverage. I always thought that if someone managed to survive their entire lives in Kuwait without dying in an accident, then only one other thing could get them ~ and that was cancer. This is not far from the truth:

General Causes of Death in Kuwait

What did catch me off guard is the number of people who die from circulatory system disease and heart disease. The numbers are not only large, but the trump both accidents and cancer figures. Interestingly enough, respiratory system diseases start to show up in 2006 just as problems with circulatory and pulmonary problems become more prevalent.

I thought that this surely is controlled within a demographic group.
So I decided to split the data into gender and age.

Causes of Death by Gender

Looking at the gender differences the first eye-popping fact is that less women seem to be dying … this is misleading because the population is generally bias towards men. There are about 9 men for every 5 women in Kuwait.

Cause of Death by Gender

The other eye-popping item that appears is in accidents. Less women pass away from accidents compared to men – a lot less! Is this indicative that women are safer drivers than their counterparts? Perhaps. In some nations this figure would indeed be zero because of social and legal constraints … it’s not necessarily good news … but it does stand out!

Proportionally there is a higher rate of mortality due to cancer in the female population vs. the male population.

Lastly, men seem to be more susceptible to death from heart diseases and circulatory system diseases. This might make you think why? Heart diseases and circulatory system diseases are exacerbated by sedentry life styles, poor diets, and other factors such as the consumption of tobacco. We have already looked at obesity in Kuwait … perhaps a deeper dive might shed some light on this matter.

For now, lets move on … surely younger people do not suffer from heart diseases …

Causes of Death by Age Group

This one confirms that if an accident does not get you before you’re 25 then the rest of the diseases are coming your way.
People fall victim to circulatory, respiratory, and heart diseases at extremely young ages. In fact what we see here is that irrespective of age group, after the age of 40 the mortality rates are the same for these three diseases.

On the other hand, accident mortalities go down as people shift to older age groups but are displaced by cancer. What is terribly depressing about this graph are the number of people below the ages of 19 that die in accidents. These might be just numbers, but in reality these are very real names to families.

Age

Take-aways

The graphs were just a fun way to play with R. What we can take away is that of the 5 main causes for mortality in Kuwait – Cancer, Heart Diseases, Circulatory Diseases, Respiratory Diseases, and Accidents – 4 of them are addressible through policy, regulation, and raising public awareness for social/behavioural impact.

Code

You can download the Kuwait dataset here or from the UN’s Data Portal.

Load up R and run the code below – fully commented for your geeky enjoyment:

# In Libraries we Trust
library(ggplot2)
library(dplyr)
 
# Read Data
data.mortality<-read.csv('kuwait.csv')
 
##################
#  Data Munging  #
##################
 
# Change Year into a categorical variable
data.mortality$Year<-as.factor(data.mortality$Year)
 
# Rename our columns
names(data.mortality)<-c("Country","Year","Area","Sex","Age","Cause","Record","Reliability","Source.Year","Value","Value.Notes")
 
# Data cleaning, here we groups some items together for simplification.
# Eg. All forms of neoplasms are grouped under "Cancer". 
 
data.mortality$Cause<-gsub(pattern=", ICD10","",x=data.mortality$Cause)
data.mortality[grep(pattern="eoplasm",x=data.mortality$Cause),]$Cause<-"Cancer"
data.mortality[grep(pattern="ccidents",x=data.mortality$Cause),]$Cause<-"Accidents"
data.mortality[grep(pattern="not elsewhere classified",x=data.mortality$Cause),]$Cause<-"Unknown"
data.mortality[grep(pattern="External causes",x=data.mortality$Cause),]$Cause<-"Unknown"
data.mortality[grep(pattern="Congenital malformations",x=data.mortality$Cause),]$Cause<-"Congenital malformations"
data.mortality[grep(pattern="Certain conditions originating in the perinatal period",x=data.mortality$Cause),]$Cause<-"Perinatal period conditions"
data.mortality[grep(pattern="Endocrine, nutritional and metabolic",x=data.mortality$Cause),]$Cause<-"Endocrine, Nutritional & Metabolic"
data.mortality[grep(pattern="Diseases of the respiratory",x=data.mortality$Cause),]$Cause<-"Respiratory Disease"
data.mortality[grep(pattern="Diseases of the circulatory system",x=data.mortality$Cause),]$Cause<-"Circulatory System"
data.mortality[grep(pattern="Hypertensive diseases",x=data.mortality$Cause),]$Cause<-"Cerebral"
data.mortality[grep(pattern="Ischaemic heart diseases",x=data.mortality$Cause),]$Cause<-"Heart Diseases"
data.mortality[grep(pattern="Cerebrovascular diseases",x=data.mortality$Cause),]$Cause<-"Cerebral"
 
 
#########################
# Data Transformations  #
#########################
 
# Use the dplyr library to group items from the original data set
# We want a general understanding of causes of death
# We subset out All causes to avoid duplication
# We subset out age groups that cause duplication
# We group by Country, Year and Cause
# We create a summary variable "Persons" that is the sum of the incidents
# We sort by Cause for pretty graphs
 
kw.general <- data.mortality %.%
  subset(!(Cause %in% "All causes")) %.%
  subset(Country %in% "Kuwait") %.%
  subset(!(Age %in% c("Total","Unknown","85 +","95 +","1","2","3","4","0"))) %.%
  group_by(Country) %.%
  group_by(Year) %.%
  group_by(Cause) %.%
  summarise(Persons=sum(Value)) %.%
  arrange(Cause) 
 
# We want an understanding of causes of death by age group
# We subset out All causes to avoid duplication
# We subset out age groups that cause duplication
# We group by Country, Year, Age and Cause
# We create a summary variable "Persons" that is the sum of the incidents
# We sort by Cause for pretty graphs
 
kw.age<-data.mortality %.%
  subset(!(Cause %in% "All causes")) %.%
  subset(Country %in% "Kuwait") %.%
  subset(!(Age %in% c("Total","Unknown","85 +","95 +","1","2","3","4","0"))) %.%
  group_by(Country) %.%
  group_by(Year) %.%
  group_by(Age) %.%
  group_by(Cause) %.%
  summarise(Persons=sum(Value)) %.%
  arrange(Cause)
 
# We reorder our age groups manually for pretty graphs
kw.age$Age<-(factor(kw.age$Age,levels(kw.age$Age)[c(1,2,6,9,12,3,15,4,5,7,8,10,11,13,14,16:28)]))
 
 
# We want an understanding of causes of death by gender
# We subset out All causes to avoid duplication
# We subset out age groups that cause duplication
# We group by Country, Year, Gender and Cause
# We create a summary variable "Persons" that is the sum of the incidents
# We sort by Cause for pretty graphs
 
kw.sex<-data.mortality %.%
  subset(!(Cause %in% "All causes")) %.%
  subset(Country %in% "Kuwait") %.%
  subset(!(Age %in% c("Total","Unknown","85 +","95 +","1","2","3","4","0"))) %.%
  group_by(Country) %.%
  group_by(Year) %.%
  group_by(Sex) %.%
  group_by(Cause) %.%
  summarise(Persons=sum(Value)) %.%
  arrange(Cause)
 
 
########################################
# Graphing, Plotting, Dunking Cookies  #
########################################
 
# We will limit our graphs by number of persons each incidents cause
# The main reason is because we do not want a graph that looks like a chocolate chip cookie
PersonLimit <- 200
 
# Plot the General data
General<-ggplot(subset(kw.general,Persons>=PersonLimit), aes(x = Year, y = Persons)) +
  geom_bar(aes(fill=Cause),stat='identity', position="stack")+
  ggtitle(paste("Causes of Death in Kuwait\n\n(Showing Causes Responsible for the Death of ",PersonLimit," Persons or More)\n\n"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.text=element_text(size=14),
        axis.text=element_text(size=12),
        axis.title=element_text(size=12))+ 
  scale_fill_brewer( palette = "RdBu")+
  geom_text(aes(label = Persons,y=Persons,ymax=Persons), size = 3.5,  vjust = 1.5, position="stack",color=I("#000000")) 
 
 
# Reset the Person Limit
PersonLimit <- 150
 
# Plot the Gender data faceted by Gender
Gender<-ggplot(subset(kw.sex,Persons>=PersonLimit), aes(x = Year, y = Persons)) +
  geom_bar(aes(fill=Cause),stat='identity', position="stack")+
  facet_wrap(~Sex)+
  ggtitle(paste("Causes of Death in Kuwait by Gender\n\n(Showing Causes Responsible for the Death of ",PersonLimit," Persons or More)\n\n"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.text=element_text(size=14),
        axis.text=element_text(size=12),
        axis.title=element_text(size=12)
  )+ 
  scale_fill_brewer( palette = "RdBu" )+
  geom_text(aes(label = Persons,y=Persons,ymax=Persons), size = 3.5,  vjust = 1.5, position="stack",color=I("#000000")) 
 
# Reset the Person Limit
PersonLimit <- 30
 
# Plot the Age group data facted by Age
Age<-ggplot(subset(kw.age,Persons>=PersonLimit), aes(x = Year, y = Persons)) +
  geom_bar(aes(fill=Cause),stat='identity', position="stack")+
  facet_wrap(~Age)+
  ggtitle(paste("Causes of Death in Kuwait by Age Group\n\n(Showing Causes Responsible for the Death of ",PersonLimit," Persons or More)\n\n"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.text=element_text(size=14),
        axis.text=element_text(size=12),
        axis.title=element_text(size=12)
  )+ 
  scale_fill_brewer( palette = "RdBu" ) 
 
# Save all three plots
ggsave(filename="General.png",plot=General,width=12,height=10)
ggsave(filename="Age.png",plot=Age,width=12,height=10)
ggsave(filename="Gender.png",plot=Gender,width=12,height=10)

Post Date Jun 10

Price Elasticity with R

mom-pop-store_300

Scenario

We covered Price Elasticity in an accompanying post. In this post we will look at how we can use this information to analyse our own product and cross product elasticity.

The scenario is as follows:

You are the owner of a corner mom and pop shop that sells eggs and cookies. You sometimes put a poster on your storefront advertising either your fresh farm eggs, or your delicious chocolate chip cookies. You are particularly concerned with the sales off eggs – your beautiful farm chicken would be terribly sad if they knew that their eggs were not doing so well.

Over a one month period, you collect information on sales of your eggs and the different prices you set for your product.

Data Time

We are using a modified version of Data Apple‘s data set.

You can download the supermarket data set here. In it you will find:

  • Sales: the total eggs sold on the given day
  • Price Eggs: the price of the eggs on the day
  • Ad Type: the type of poster – 0 is the egg poster, 1 is the cookie poster.
  • Price Cookies: the price of cookies on the day

Lets fire up R and load up the data

# Load data and output summary stats
sales.data<-read.csv('supermarket.csv')
#Look at column names and their classes
sapply(sales.data,class)

The output shows each column and its type:

        Sales    Price.Eggs       Ad.Type Price.Cookies 
    "integer"     "numeric"     "integer"     "numeric"

Since Ad.Type is a categorical variable, lets go ahead and change that and output the summary statistics of our dataset.

# Change Ad Type to factor
sales.data$Ad.Type<-as.factor(sales.data$Ad.Type)
summary(sales.data)

From the results we find that:

  • Sales of eggs ranged between 18 and 46
  • Price of eggs ranged between $3.73 and $4.77
  • We showed the egg poster 15 times and the cookies poster 15 times
  • Price of cookies ranged between $4 and $4.81
     Sales        Price.Eggs   Ad.Type Price.Cookies 
 Min.   :18.0   Min.   :3.73   0:15    Min.   :4.00  
 1st Qu.:25.2   1st Qu.:4.35   1:15    1st Qu.:4.17  
 Median :28.5   Median :4.48           Median :4.33  
 Mean   :30.0   Mean   :4.43           Mean   :4.37  
 3rd Qu.:33.8   3rd Qu.:4.67           3rd Qu.:4.61  
 Max.   :46.0   Max.   :4.77           Max.   :4.81

Right now we want to see if we can predict the relationship between Sales of Eggs, and everything else.

Regression … the Granddaddy of Prediction

We now want to run a regression and then do some diagnostics on our model before getting to the good stuff.

We can run the entire regression or add each variable to see the impact on the regression model. Since we have few predictors lets choose the latter option for fun.

# Create models
m1<-lm(formula=Sales~Price.Eggs,data=sales.data)
m2<-update(m1,.~.+Ad.Type)
m3<-update(m2,.~.+Price.Cookies)
mtable(m1,m2,m3)

The results are pasted below. We end up with a model “m3″ that has statistically significant predictors. Our model is:

Sales of Eggs = 137.37 – (16.12)Price.Eggs + 4.15 (Ad.Type) – (8.71)Price.Cookies

We look at our R2 and see that the regression explains 88.6% of the variance in the data. We also have a low mean squared error (2.611) compared to the other models we generated.

We can actually get better results by transforming our independent and dependent variables (e.g. LN(Sales)) but this will suffice for demonstrating how we can use regressions to calculate price elasticity.

Calls:
m1: lm(formula = Sales ~ Price.Eggs, data = sales.data)
m2: lm(formula = Sales ~ Price.Eggs + Ad.Type, data = sales.data)
m3: lm(formula = Sales ~ Price.Eggs + Ad.Type + Price.Cookies, data = sales.data)
 
------------------------------------------------
                    m1         m2         m3    
------------------------------------------------
(Intercept)     115.366*** 101.571*** 137.370***
Price.Eggs      -19.286*** -16.643*** -16.118***
Ad.Type: 1/0                 4.195**    4.147***
Price.Cookies                          -8.711***
------------------------------------------------
R-squared          0.722      0.793      0.886  
adj. R-squared     0.712      0.778      0.872  
sigma              3.924      3.444      2.611  
...  
------------------------------------------------

Diagnostics

We need to test for the following assumptions whenever we do regression analysis:

1. The relationship is linear
2. The errors have the same variance
3. The errors are independent of each other
4. The errors are normally distributed

First, we can address some of these points by creating plots of the model in R.

linearity

The plot on the left shows that the residuals (errors) have no pattern in variance (Good!).

The red line is concerning because it shows some curvature indicating that perhaps the relationshp is not entirely linear (hmmm…).

On the right we see that the errors are acceptably normally distributed (they are around the straight line … Good!).

To generate this plot you can use the R code below:

# Linearity Plots
par(mfrow=c(1,2))
plot(m3)
 
# Reset grid
par(mfrow=c(1,1))

So we want to look deeper into linearity issues. We will look at multi-colinearity first:

# Multi-collinearity
library(car)
vif(m3) # variance inflation factors 
sqrt(vif(m3)) > 2 # problem?

The code above will show if any of the variables have multicolinearity issues that could cause issues with the model’s integrity. Generally we want values less than 2, and we have values of around 1 so we are good on this front.

   Price.Eggs       Ad.Type Price.Cookies 
     1.195107      1.189436      1.006566

We can generate a CERES plot to assess non-linearity:

# Diagnosis: Nonlinearity
crPlots(m3)

We see that there is definitely some issues with linearity but not to an extent that it is a cause for concern for the purpose of demonstration. So we keep calm, and move on.

Lastly we want to test independence of the residuals using the Durban Watson Test:

# Diagnosis: Non-independence of Errors 
durbinWatsonTest(m3)

The output shows that there is no autocorrelation issues in the model:

 lag Autocorrelation D-W Statistic p-value
   1      0.06348248      1.792429   0.458
 Alternative hypothesis: rho != 0

So we are clear to move forward with Price Elasticity and Cross Product Price Elasticity!

Price Elasticity

We now have our model:
Sales of Eggs = 137.37 – (16.12)Price.Eggs + 4.15 (Ad.Type) – (8.71)Price.Cookies

Own Price Elasticity

To calculate Price Elasticity of Demand we use the formula:
PE = (ΔQ/ΔP) * (P/Q)

(ΔQ/ΔP) is determined by the coefficient -16.12 in our regression formula.
To determine (P/Q) we will use the mean Price (4.43) and mean Sales (30).

Therefore we have PE = -16.12 * 4.43/30 = -2.38

This means that an increase in the price of eggs by 1 unit will decrease the sales by 2.38 units.

Cross Price Elasticity

To calculate Cross Price Elasticity of Demand we are essentially looking for how the price of cookies impacts the sales of eggs. So we use the formula:

CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q)

We know from our regression that (ΔQ/ΔPcookies) is the coefficient of Price of Cookies (-8.71).
We use the mean price of cookies and mean sales for the rest of the formula giving (4.37/30)

CPEcookies = -8.71 * (4.37/30) = -1.27

This means that an increase in the price of cookies by 1 unit will decrease the sales of eggs by 1.27 units.

Interpretation

We now know that the price of eggs and price of cookies are complementary to one another in this scenario. Since you only sell too products, one explanation could be that people who come in for cookies and eggs would rather get them elsewhere if the price is too high.

Also, it means that if you had to choose between a price cut on cookies or eggs, go with cookies!

Next steps

You are now in an ideal situation where you can run an optimization function to set the right price for both cookies and eggs.
This is out of the scope of this post, but if you’re interested in doing that check out R’s optim() function ~ or leave a comment below 😛

Bonus

Can you figure out what to do with the Ads?

I’ve included some ideas in the complete code below:

###################################
#   Warm-up ... data and stuff    #
###################################
 
# Load data and output summary stats
sales.data<-read.csv('supermarket.csv')
 
#Look at column names and their classes
sapply(sales.data,class)
 
# Change Ad.Type to factor and print Summary Stats
sales.data$Ad.Type<-as.factor(sales.data$Ad.Type)
summary(sales.data)
 
####################
#  Create Models   #
####################
 
# Load required library
library(memisc) 
 
# Create models
m1<-lm(formula=Sales~Price.Eggs,data=sales.data)
m2<-update(m1,.~.+Ad.Type)
m3<-update(m2,.~.+Price.Cookies)
mtable(m1,m2,m3)
 
####################
#   DIAGNOSTICS    #
####################
 
# Linearity Plots
par(mfrow=c(1,2))
plot(m3)
par(mfrow=c(1,1))
 
# Multi-collinearity
library(car)
vif(m3) # variance inflation factors 
sqrt(vif(m3)) > 2 # problem?
 
# Diagnosis: Nonlinearity
crPlots(m3)
 
# Diagnosis: Non-independence of Errors 
# We want a D-W Statistic close to 2
durbinWatsonTest(m3)
 
#########################
#   Price Elasticity    #
#########################
 
# Calculate Price Elasticity
PE<-as.numeric(m3$coefficients["Price.Eggs"] * mean(sales.data$Price.Eggs)/mean(sales.data$Sales))
CPEcookies<-as.numeric(m3$coefficients["Price.Cookies"] * mean(sales.data$Price.Cookies)/mean(sales.data$Sales))
 
# Print Results 
PE
CPEcookies
 
 
####################################
#   BONUS - What about the ads?    #
####################################
 
# Subset the data
sales.adEggs <- subset(sales.data,Ad.Type==0)
sales.adCookies <- subset(sales.data,Ad.Type==1)
 
# Diagnostic on subsets' means and if they are different ... they are. 
wilcox.test(x=sales.adCookies$Sales,y=sales.adEggs$Sales,exact=F,paired=T)
 
# On average, does the advert for eggs generate higher sales for eggs?
mean(sales.adEggs$Sales) >= mean(sales.adCookies$Sales)

Post Date May 31

Overweight in Kuwait – Food Supply with R

Cookie Monster Food Pyramid

Disclaimer

I cannot say this enough … I have no idea about nutrition. In fact, if one were to ask me for a dietary plan it would consist of 4 portions of cookies and 20 portions of coffee. The reason I am doing this is to practice and demonstrate how to do some analysis with R on interesting issues.

The Problem: Food! (…delicious)

We already looked at the obesity in Kuwait using BMI data. What we found was that indeed, irrespective of gender, Kuwait ranks as one of the top countries with a high proportion of the population deemed obese (BMI >= 30).

The question we did not address is what might be the cause of this obesity. Following in the paths of the WHO’s report on “Diet, food supply, and obesity in the Pacific” (download here) we hope to emulate a very small subset of correlations between the dietary patterns and the prevalence of obesity in Kuwait.

Since this is only a blog post, we will only focus on the macro-nutrient level. Nevertheless, after completing this level we will be well equipped to dive into details with R (feel free to contact me if you want to do this!)

Conclusions First! … Plotting and Stuff

Lets start by plotting a graph to see if we notice any patterns (click it for a much larger image).

Food Supply - Kuwait vs World

What does this graph tell us?

  • The black line represents Kuwait figures.
  • The red dots represent the average around the world
  • The blue bars are top 80% to lower 20% of the observations with the medians marked by a blue diamond.

Lets make some observations about the plot. For brevity we will only highlight things relevant to obesity being mindful of:

  • the fact that I know nothing about nutrition
  • that there are other interesting things going on in the plot.
cereals

Cereals have been increasing since 1993. This means that the average Kuwaiti consumes more cereals every day than 80% of the persons around the world. The implication is, as explained in Cordain, 1999 that people who consume high amounts of cereals are “affected by disease and dysfunction directly attributable to the consumption of these foods”.

Fruits

Fruits & Starchy Roots both show trends that are below average. Both are important sources of fiber. Fibers help slim you down, but are also important for the prevention of other types of diseases; such as a prevalent one in Kuwaiti males: Colorectal cancer.

The consumption of vegetables perhaps serves as a balancing factor for this dietary nutrient of which the average Kuwaiti consumes more of compared to 80% of the people around the world.

Veg Oils

Vegetable Oils is my favourite … the only thing that stopped the rise of vegetable oil supply in Kuwait was the tragic war of 1990. In 2009 the food supply per capita in Kuwait was well over the 80th percentile in the rest of the world. Vegetable oils not only cause but rather promote obesity. We don’t love Mazola so much anymore now do we?

Sugars

Sugars & Sweetners have already made an appearance in another post and here we have more data. The findings are similar, the average Kuwaiti consumes much more sugar than the average person around the world. This not only contributes to obesity but also other diseases such as diabetes.

Data Gathering

So lets get going … where’s the data at?!

The Food and Agriculture Organization of the United Nations have a pretty awesome web portal that contains a lot of rich data. We are interested in data sets concerned with “Food Supply”. Particularly we want to look at:

… because there are only two types of food in the world: Vegetable and Not Vegetable … (cookies are vegetables)

So lets go ahead and download data sets that we want. You will find the selection for Food Supply of Crops. You will want to make sure all fields match:

  • Select all countries.
  • Use aggregated items
  • Leave out grand totals from the aggregated items
  • Choose the years you are interested in; I chose 1980 to 2009.
  • All checkboxes checked
  • Drop-down: output table, no thousands separator, no separator and 2 places for decimals.

Crops

Click download and you’re set for Crops!
Now repeat for the Food Supply of Livestock.

Making CSV Files

Now we have 2 XLS files that we can work with. The irony is that these are not excel files at all … they are HTML files
So the first thing we need to do is rename the 2 downloaded folders to “first.html” and “second.html” – it does not matter which one is first or second.

You will notice that the files are huge. So expect the next step to take time.

We will be using the XML library to read the files: first.html and second.html
We know that the files do not contain any headers and we do not want any white spaces.
We will make sure that the data is read into a data frame so we can work with data frame functions.

# Load libraries
library(XML)
 
# Read HTML Tables 
first<-readHTMLTable('first.html',header=F,trim=T,as.data.frame=T)
second<-readHTMLTable('second.html',header=F,trim=T,as.data.frame=T)
 
# Make sure that the data is of class data.frame
first<-as.data.frame(first)
second<-as.data.frame(second)

We now have our data frames and if we look into the data we can figure out what our headers should be.
So lets rename the headers now:

# Make header names
headers<-c("Domain","Country","CountryCode","Item","ItemCode","Element","ElementCode","Year","Unit","Value","Flag","FlagDescription")
names(first)<-headers
names(second)<-headers

Great! Now we can finish up by writing CSV files that are smaller in size and faster to read the next time we want to run the analysis:

# Write the CSV file for future use
write.csv(first,'first.csv')
write.csv(second,'second.csv')

Data Munging

We are now ready to play with the data. We will be using 2 libraries: dplyr to manage the data and ggplot2 for plotting. Load them up:

# Load libraries 
library(dplyr)
library(ggplot2)

Lets pretend we are starting from scratch and want to read in the data from the CSV files we created.
You will notice read.csv adds a column so we will also need to rename our headers.

# Read files
first<-read.csv('first.csv')
second<-read.csv('second.csv')
 
# Set header names
headers<-c("ID","Domain","Country","CountryCode","Item","ItemCode","Element","ElementCode","Year","Unit","Value","Flag","FlagDescription")
names(first)<-headers
names(second)<-headers

We now want to combine the Livestock and Crops data in the two files. This can be easily done with the rbind() function:

data<-rbind(first,second)

Great now we need to stop and think about what we want to do with this data.

We want to compare Kuwait’s nutritional information with some sort of summary data about the rest of the world. We can therefore break this up into 2 parts: Kuwait and Other.

Let us deal with Kuwait first and extract the calculated data for grams consumed per capita per day. If you look into the dataset you will know which filters to apply. We will apply those filters using the subset function:

# Extract Kuwait Data
data.kuwait<-subset(data,data$Country %in% "Kuwait" & 
                      data$"FlagDescription" %in% "Calculated data" &
                      data$"ElementCode" %in% "646")

Now we want to extract the same information for every other country except Kuwait. That is easy enough, just copy and paste the function from above and :

# Extract Other Countries data
data.other<-subset(data,!(data$Country %in% "Kuwait") & 
                      data$"FlagDescription" %in% "Calculated data" &
                      data$"ElementCode" %in% "646")

We need to do a bit more work with the other countries’ data. We said we want summary information and right now we only have raw data for each country. We will use the chaining mechanism in ‘dplyr’ to create summary data such as the mean, median, and upper/lower quantiles. We will group the data by Year and we will then group it by Item (Nutrient).

# Create summary data
data.other.summary<- data.other %.%  
  group_by(Year) %.%
  group_by(Item) %.%
  summarise(mean=mean(Value),
            median=median(Value),
            lowerbound=quantile(Value, probs=.20),
            upperbound=quantile(Value, probs=.80))

This is the same method we used earlier when looking at Sugar grams consumed per capita per day in an earlier post.

Great, our data is ready for some exploration.

GG Plotting and Stuff!

The ggplot code looks like a big ugly hairball … but I will explain it line by line.

# Plot
ggplot(data = data.other.summary, aes(x = Year, y=median))+
  geom_errorbar(aes(ymin = lowerbound, ymax = upperbound),colour = 'blue', width = 0.4) +
  stat_summary(fun.y=median, geom='point', shape=5, size=3, color='blue')+
  geom_line(data=data.kuwait, aes(Year, Value))+
  geom_point(data=data.other.summary, aes(Year, mean),color=I("red"))+
  ggtitle('Supply (g/capita/day) - Kuwait vs. World')+
  xlab('Year') + ylab('g/capita/day')+
  facet_wrap(~Item,scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
  • Line 1: We start by creating a plot using the data set for the rest of the world and plot median values by year.
  • Line 2: We then overlay the blue bars by passing in our summary stats lowerbound and upperbound calculated in the previous step.
  • Line 3: We then pass the median values and set the shape to 5 – diamond.
  • Line 4: We plot the black line using the Kuwait data set’s values by year.
  • Line 5: We plot the averages in red from the rest of the world data set.
  • Line 6: We set the title of the graph
  • Line 7: We set the labels of the axes
  • Line 8: We use the face_wrap() function to tell GG Plot that we want one graph per Item in our data set. We set the scales to “free” so that we get visible graphs (Eggs weigh less than Chicken!)
  • Line 9: We set our theme elements – I just changed the angle of the X axis items

Entire Code

# Load libraries
library(XML)
 
# Read HTML Tables 
first<-readHTMLTable('first.html',header=F,trim=T,as.data.frame=T)
second<-readHTMLTable('second.html',header=F,trim=T,as.data.frame=T)
 
# Make sure that the data is of class data.frame
first<-as.data.frame(first)
second<-as.data.frame(second)
 
# Make header names
headers<-c("Domain","Country","CountryCode","Item","ItemCode","Element","ElementCode","Year","Unit","Value","Flag","FlagDescription")
names(first)<-headers
names(second)<-headers
 
# Write the CSV file for future use
write.csv(first,'first.csv')
write.csv(second,'second.csv')
 
 
# Load libraries 
library(dplyr)
library(ggplot2)
 
# Read files
first<-read.csv('first.csv')
second<-read.csv('second.csv')
 
# Set headers for each data frame
headers<-c("ID","Domain","Country","CountryCode","Item","ItemCode","Element","ElementCode","Year","Unit","Value","Flag","FlagDescription")
names(first)<-headers
names(second)<-headers
 
# Combine data frames
data<-rbind(first,second)
 
# Check the Element Codes 
print(unique(data[,c("Element","ElementCode")]))
 
# Extract Kuwait Data
data.kuwait<-subset(data,data$Country %in% "Kuwait" & 
                      data$"FlagDescription" %in% "Calculated data" &
                      data$"ElementCode" %in% "646") 
 
# Extract Other Countries Data
data.other<-subset(data,!(data$Country %in% "Kuwait") & 
                      data$"FlagDescription" %in% "Calculated data" &
                      data$"ElementCode" %in% "646") 
 
# Create summary data
data.other.summary<- data.other %.%  
  group_by(Year) %.%
  group_by(Item) %.%
  summarise(mean=mean(Value),
            median=median(Value),
            lowerbound=quantile(Value, probs=.20),
            upperbound=quantile(Value, probs=.80))
 
# Plotting
ggplot(data = data.other.summary, aes(x = Year, y=median))+
  geom_errorbar(aes(ymin = lowerbound, ymax = upperbound),colour = 'blue', width = 0.4) +
  stat_summary(fun.y=median, geom='point', shape=5, size=3, color='blue')+
  geom_line(data=data.kuwait, aes(Year, Value))+
  geom_point(data=data.other.summary, aes(Year, mean),color=I("red"))+
  ggtitle(paste(' Supply (g/capita/day) - Kuwait vs. World'))+
  xlab('Year') + ylab('g/capita/day')+facet_wrap(~Item,scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Post Date May 29

Overweight in Kuwait – BMI with R

bmi-image

Inspiration

Let me open with this: Faisal Al-Basri brings a lot of laughter into my life. It’s nothing short of a pleasure watching the political, social, and general satire on his instagram account. One particular post recently caught my attention – entitled: “Kuwaiti Women 2nd most obese world wide”.

In his own words:

(Rough) translation: “You have embarrassed us in front of the world. They came to take a survey, *gasp* suck your bellies in! Baby showers, weddings, graduations, divorces … guests compete on who can make yummier cakes … and in the end it’s just biscuits with double cream on top! …”

Sarcasm aside, he does a much better report than some of the newspapers in Kuwait seem to – here’s all the Arab Times had to say about this.

Well … that was insightful! For a country that is spending billions on a healthcare budget, you would think a story like this might get a little bit more love. Faisal puts some sort of analysis in place; his hypothesis is that we lead a sedentary life style with a poor diet … he’s probably right! … but we will get to that in a different study.

The Study

The study comes from a visualisation that can be found on on the DailyMail.co.uk (and conveniently pasted right here!)

BMI

There are several concerns that arise when you look at this visualisation:

The first is that the indicator of obesity is the BMIand there are several criticisms of that indicator. I know nothing about nutrition so I will stop right here and move on.

The second is that is uses the mean BMI for each country. “So what?” might the average Joe ask. Well, the problem is that the arithmetic mean is a measure of central tendency that can be misleading. If there are more people on one end of the distribution, the mean is skewed … and therefore another measure of central tendency might be better (like the median). Moreover, the mean considers population size … so a small number of outliers (whales in our case) will impact the mean in a country with a smaller population than a larger one.

Lets do the numbers

So instead of bemoaning the beautiful infographics and extensive studies … lets take a look at the figures ourselves … use R!

We can find some information about the BMI by gender in the World Health Organization’s database. The data sets used show the proportion of a population that is over a certain BMI bracket. This is great, it over comes the issues with the averages we already discussed.

Since we are dealing with obesity, we want to download the data sets labelled:

Download the files into a working directory and lets get started!

Conclusions First!

Let us take a look at the results before we get into the code. You can find all three plots that we’re going to make side by side here.

Round 1: Adult

Lets first look at the % of adults that are obese in every population. We can see comes in position 13 after the USA, UAE, Saudi Arabia, Egypt, Bahrain, and the rest.

obese-adults

This means that, given the population of Kuwait, the ratio of obese people is less than the same ratio in 12 other countries. Is this good news? We’re still in the top 20! … but we’re not #1 (yet)

Round 2: Males

So the men had a great chuckle at these statistics I’m sure … well lets see hour our brothers at the sufra are doing. Proportionally, they are about the same as the general population: 29% … except they rank 10th now because as a ratio, our men weigh in heavier.

obese-males

Round 3: Moment of truth, Ladies

Immediately we can see that the heavy set ladies in Kuwait are not #1 in the world (awww!) In fact, they are ranked in the 15th position – lower than the men and the general population. The proportion of females that are obese is only slightly higher than that of men.

obese-females

So we can conclude that, although the mean might show that Kuwait is #1 in the world, this is far from the truth when we look at the proportions that might in fact be a more representative indicator of spread and centrality of BMI.

Time for a cookie?

The Code

Lets look at how we produced the graphs. If you haven’t already, download the data sets (here or from the WHO’s website):

Now read this data intro R:

# Read Data
obese.adults<-read.csv(file="BMIadults%obese(-=30.0).csv",stringsAsFactors=F)
obese.male<-read.csv(file="BMImales%obese(-=30.0).csv",stringsAsFactors=F)
obese.female<-read.csv(file="BMIfemales%obese(-=30.0).csv",stringsAsFactors=F)

Great! We will need a few libraries so lets get those loaded:

# Load Libraries
library(gridExtra)
library(reshape)
library(ggplot2)

Now we want to play with our data before we can plot it.
We will pick the top 20 countries ordered by BMI value and create each plot accordingly.

# Select Top X Countries
topx<-20
 
# Create Plot 1: % Adults Obese
# Melt the data
  p1<-melt(obese.adults) 
# Select only the most recent data 
  p1<-p1[p1$variable == "Most.recent",]
# Sort the data
  p1<-p1[order(p1$value,decreasing=T),]
# Remove any empty rows
  p1<-p1[!is.na(p1$value),] 
# Select Top X countries
  p1<-head(p1,topx)
 
# Find Kuwait in the Top X Countries to highlight
  kuwait<-nrow(p1)-which(reorder(p1$"country...year",p1$value)=="Kuwait",T,T)+1
 
plot1<-ggplot(data=p1,aes(y=reorder(p1$"country...year",p1$value),x=p1$value))+
  geom_point(color=I("black"))+
  ylab("")+xlab("% Obese")+ggtitle("BMI Adults % Obese (>= 30)")+
  scale_x_continuous(limits = c(0, 100),
                     breaks = seq(0, 100, 10))+
  theme_minimal()+
  geom_hline(yintercept=kuwait,linetype="dotted",colour="red")

That was the first plot, and now we repeat this 2 more times

# Create Plot 2: Females % Obese
p2<-melt(obese.female)
p2<-p2[p2$variable == "Most.recent",]
p2<-p2[order(p2$value,decreasing=T),]
p2<-p2[!is.na(p2$value),]
p2<-head(p2,topx)
 
kuwait<-nrow(p2)-which(reorder(p2$"country...year",p2$value)=="Kuwait",T,T)+1
 
plot2<-ggplot(data=p2,aes(y=reorder(p2$"country...year",p2$value),x=p2$value))+
  geom_point(color=I("violet"))+
  ylab("")+xlab("% Obese")+ggtitle("BMI Females % Obese (>= 30)")+
  scale_x_continuous(limits = c(0, 100),
                     breaks = seq(0, 100, 10))+
  theme_minimal()+
  geom_hline(yintercept=kuwait,linetype="dotted",colour="red")

Finally Plot 3

# Create Plot 1: % Male Obese
 
p3<-melt(obese.male)
p3<-p3[p3$variable == "Most.recent",]
p3<-p3[order(p3$value,decreasing=T),]
p3<-p3[!is.na(p3$value),]
p3<-head(p3,topx)
 
kuwait<-nrow(p3)-which(reorder(p3$"country...year",p3$value)=="Kuwait",T,T)+1
 
plot3<-ggplot(data=p3,aes(y=reorder(substring(p3$"country...year",first=0,last=20),p3$value),x=p3$value))+
  geom_point(color=I("blue"))+
  ylab("")+xlab("% Obese")+ggtitle("BMI Males % Obese (>= 30)")+
  scale_x_continuous(limits = c(0, 100),
                     breaks = seq(0, 100, 10))+
  theme_minimal()+
  geom_hline(yintercept=kuwait,linetype="dotted",colour="red")

Lastly, we want to create our 3 plots in one image so we use the gridExtra library here:

grid.arrange(plot1,plot2,plot3)

That’s it! You can download the entire code here:
Obesity.R

Next Steps

I will probably do a follow on this post by looking at reasons why obesity might be such an issue in Kuwait. Namely, I will look at consumption patterns among gender groups. Keep in mind, I do this to practice R and less so to do some sort of ground breaking study … most of this stuff can be done with a pen and paper.

Post Date May 13

Kuwait Sugar Consumption with R

Kuwait Sugar

I was doing Facebook’s Udacity Explorartory Data Analysis course and one of the tasks is to use GapMinder.org’s data to visualise data.

So I picked sugar consumption per capita per day; diabetes is a subject that has always been close and interesting to me. The graph above shows Kuwait’s sugar consumption relative to world averages.

Interesting that Kuwait has always been above the median and peaked in the 80’s – maybe that’s why obesity is such a prevalent issue nowadays. In 1990 the drop is, perhaps, due to a lack of data or perhaps due to an exodus of people. Nevertheless it seems that Kuwait has a long way to go in health awareness insofar as sugar consumption is concerned. Sugar consumption is lower today but still above the world median.

Want to reproduce this? Here’s the code – make sure to download the data set first! – :

# Data Cleansing
data<-read.csv(file="gapminder.csv",na.strings="NA",row.names=1, header=T, blank.lines.skip=T)
data<-(data.frame(t(data)))
data<-data.frame(year=c(1961:2004), data)
View(data)
 
# Data Pivot
library(reshape)
data.melt<-melt(data,id='year',variable_name="country")
data.melt<-subset(data.melt, value!='NA')
data.melt.kuwait<-subset(data.melt, country=='Kuwait')
data.melt.other<-subset(data.melt, country!='Kuwait')
 
# Grouping data 
library('dplyr')
year_group <- group_by(data.melt.other, year)
wmsub.year_group <- summarise(year_group,
                              mean=mean(value),
                              median=median(value),
                              lowerbound=quantile(value, probs=.25),
                              upperbound=quantile(value, probs=.75))
 
# Plotting
ggplot(data = wmsub.year_group, aes(x = year, y=median))+
  geom_errorbar(aes(ymin = lowerbound, ymax = upperbound),colour = 'blue', width = 0.4) +
  stat_summary(fun.y=median, geom='point', shape=5, size=3, color='blue')+
  geom_line(data=data.melt.kuwait, aes(year, value, color=country))+
  ggtitle('Sugar (g)  - capita per day - Kuwait vs. World')+
  xlab('Year') + ylab('Hours')

Post Date May 11

Install R Package automatically … (if it is not installed)

R Libraries

This is perhaps my favourite helper function in R.

usePackage <- function(p) 
{
  if (!is.element(p, installed.packages()[,1]))
    install.packages(p, dep = TRUE)
  require(p, character.only = TRUE)
}

Why? Well …

Have you ever tried to run a line of R code that needed a special function from a special library.
The code stops and you think, “Damn … ok I’ll just load the library” – and you do.
But you get an error message because you don’t have the library installed … ya! #FacePalm

After the momentary nuisance you then have to:

  • Type in the command to install the package
  • Wait for the package to be installed
  • Load the library
  • Re-run the code
  • Eat a cookie

Well that’s why this is my favourite helper function. I just use “usePackage” instead of “library” or “install.packages”. It checks if the library exists on the machine you are working on, if not then the library will be installed, and loaded into the environment of your workspace. Enjoy!

You can even use this as part of your R profile :)

Post Date May 4

Tweet Bayes Classification: Excel and R

mandrill-classification

Foreward

Data Smart
I started reading Data Smart by John Foreman. The book is a great read because of Foreman’s humorous style of writing. What sets this book apart from the other data analysis books I have come across is that it focuses on the techniques rather than the tools – everything is accomplished through the use of a spreadsheet program (e.g. Excel).

So as a personal project to learn more about data analysis and its applications, I will be reproducing exercises in the book both in Excel and R. I will be structured in the blog posts: 0) I will always repeat this paragraph in every post 1) Introducing the Concept and Data Set, 2) Doing the Analysis with Excel, 3) Reproducing the results with R.

If you like the stuff here, please consider buying the book.

Shortcuts

This post is long, so here is a shortcut menu:

Excel

R


Tweet Classification

The problem we are going to tackle here is one of classification. We want to know if blocks of texts can be classified into groups based on some model that we build. Specifically we are going to work with a small data set of tweets that have been pre-classified to make a model that will classify new tweets that we have not seen before.

We will be using a Naïve Bayes Classification model. This technique is useful not just for tweets, it can be used for any type of natural language classification: for example e-mail/sms spam filters, loan application risk screening, blog domain classification.

Mandrill The case is about Mandrill.com who are monitoring tweets with the tag “Mandrill”. We want to be able to classify tweets automagically into two categories: “relevant” and “irrelevant”.

The problem is that Mandrill is not necessarily a unique word. I searched Twitter for Mandrill and I get 5 tweets that are relevant, and 2 that are not … and the rest do not fit in my screenshot. There are many irrelevant ones beyond that … so do we just get trolled by the rest of the mandrill tweets?

Nope! Lets get started.

Download the vanilla excel files here:
Mandrill – Vanilla

Tweet Classification with Naive Bayes Classifier: Excel

In your Excel file you will have 3 tabs each containing tweets. The first tab contains 150 relevant tweets, the second tab labeled “AboutOther” contains irrelevant tweets, and the last tab has tweets we will use to test the model we are about to develop.

Step 1: Data Cleaning

We will need to do the following for data in the first two tabs. Repeat everything in step 1 for the “AboutMandrilApp” and “AboutOther” tabs.

The first thing we need to do is make everything in non-capital letters. This is done by using the lower() formula. Let us place the formula in cell B2 and drag the formula down to cell B151.

=lower(A2)

Next we need to swap punctuation with spaces. Specifically we swap these characters with spaces: “. “, “: “, “?”, “!”, “;”, “,”. Note the space after the full-stop and the colon is important. We need to add six columns in each tab use the substitute() function. Place the following 6 in cells C2,D2,E2,F2,G2 and drag them down to row 151:

=SUBSTITUTE(B2,". "," ")
=SUBSTITUTE(C2,": "," ")
=SUBSTITUTE(D2,"?"," ")
=SUBSTITUTE(E2,"!"," ")
=SUBSTITUTE(F2,";"," ")
=SUBSTITUTE(G2,","," ")

You can skip this step by downloading the file with this step completed: Mandrill – Step 1

Step 2: Word Counting

Tokenization just means splitting up our sentences (in this case tweets) word by word. Naturally words are split by spaces so we will be using ” ” to split our words. The goal is to have a list of all the words separately in a list. This means we need to figure out how many words we have per tweet to understand how many rows we will need.

We can count the number of words per line by counting the number of spaces in a sentence and adding 1. For example: “The red fox runs fast” has 4 spaces and 5 words. To count spaces we need to subtract the number of characters in a sentence with spaces from the number of characters in the same sentence without spaces. In our example, that would be 21 – 17 which is 4 spaces … and therefore 5 words! This is easily achieved with the following formula copied into column i:

=IF(LEN(TRIM(H2))=0,0,LEN(TRIM(H2))-LEN(SUBSTITUTE(H2," ",""))+1)

We then see that the maximum number of words is 28 … so we will be safe and take 30 words per sentence. We have 150 tweets, and 30 words per tweet, so we will need 4500 rows for all the words.

Great! We need 2 new tabs now: one called MandrillTokens and the other OtherTokens. Label cell A1 in each of these tabs “Tweet”. Next, copy cell H2 from the MandrillApp tab. Highlight cells A2 to A4501 in the MandrillTokens tab and paste only the values into the cells. This can be accomplished by using the Paste Special option in the Edit menu. You will notice that the tweets will repeat after row 151, this is good since we need to use the same tweet at least 30 times. Repeat this for OtherTokens.

We are now ready to Tokenize! (Wohoo!!)

You can skip this step by downloading the file with this step completed: Mandrill – Step 2

Step 3: Tokenization

Now that we have our tweet set up we need to count words. Since we know words exist between spaces, we need to look for spaces (pretty ironic!). We start by simply placing 0 for cells B2:B151 in MandrillTokens since we start at the very first word – position 0.

We then jump to cell B152 and we can count that the first space starts after 7 letters. To get excel to do this we need to use the find formula to find the space after position 0 (the value in cell B2). Now what if we had a tweet that was empty? We would get an error because there is no space after the first. In that case, we will get an error … so we just want to add 1 to the size of the tweet each time we get an error. This is accomplished with this formula in cell B152:

=IFERROR(FIND(" ",A152,B2+1),LEN(A152)+1)

You can now safely copy this formula all the way down to cell B4501 giving us all the space positions for all the tweets. Repeat this step for the OtherApps tab.

We can now finally extract each word. We want to get the word between spaces and we now have the position of each space. So in cell C2 we would use the formula:

=IFERROR(MID(A2,B2+1,B152-B2-1),".")

Which takes the word that starts at letter 1 (Cell B2+1) and ends at letter 7 (Cell B152 – 1 – Cell B2). If there is no word there, we will just place a dot. Copy and paste this formula all the way until cell A4501 and repeat the step for the OtherApps tab.

In column D we want to count the length of each word so place the following formula in D2 and copy it down for the rest of the cells:

=LEN(C2)

Great, now we’re ready to build our model!

You can skip this step by downloading the file with this step completed: Mandrill – Step 3

Step 4: Model Building

Lets create a pivot table from the information in MandrillTokens in a new tab called MandrillProbability.
Set up your pivot table with “Length” in report filter, “Word” in Row labels, and “Count of Word” in values.

Mandrill Pivot

We also want to remove any words that are less than 4 characters long. We do this using the report filter drop-down menu and unchecking 0,1,2, and 3.

Mandrill Pivot Filter

Now we need to do what is called additive smoothing: this protects your model against misrepresenting words that rarely appear – e.g. bit.ly urls are unique and so rarely appear. This is a very complicated way of saying “add 1 to everything” … in column C add 1 to all the rows in Column B and then sum them up! Your sheet should now look like this:

mandrill additive moothing

Now we can calculate the probabilities by dividing each cell in column C by the total of column C:

=C5/$C$828

The first thing you will notice is that the probabilities are tiny. In larger data sets they will be even tinier and perhaps Excel will not be able to handle it. So we take the natural log of the numbers using the LN() function in column D:

=LN(D5)

Repeat all these steps for the AboutOther tab and create an Other Probability tab.

You can skip this step by downloading the file with this step completed: Mandrill – Step 4

That’s it believe it or not! You have a model.

Step 5: Using Bayes and the MAP model

So we are at a stage where we have probabilities for words in the “relevant” tweets and “irrelevant” tweets. Now what? We will now use the Maximum A Posteriori (MAP) rule that basically says, given a word if the probability of it being classified as “relevant” is higher than “irrelevant, then it’s “relevant”.

Lets demonstrate with the “TestTweets” tab. First you need to process the tweets again like we did before:

=SUBSTITUTE(B2,". "," ")
=SUBSTITUTE(C2,": "," ")
=SUBSTITUTE(D2,"?"," ")
=SUBSTITUTE(E2,"!"," ")
=SUBSTITUTE(F2,";"," ")
=SUBSTITUTE(G2,","," ")

Then create a tab called “Predictions”. In cell A1 input the label “Number” and in A2 to A21 put in the numbers 1 through 20. Label cell B2 “Prediction”. Copy and paste the values from the “TestTweets” column into C2.

Now we need to tokenize the tweets in column C. It is a lot easier this time around. All you have to do is highlight cells C2 to C21 and select “Text to Column” from the “Data” menu. You will see a pop-up screen, click next and then select “Spaces”:

mandrill-spaces

Click next and in the final screen input $D$2 as the destination cell. That’s it! All tokenized and ready for predictions … which is going to get messy :)

What we need to do is take each token and look up the probabilities in the “Mandrill Probability” tab and in the “Other Probability” tab. Start in cell D25 and use VLookUp() as follows:

=VLOOKUP(D2,'Mandrill Probability'!$A$5:$E$827,5,FALSE)

Which says look up the value of D2 in the range of cells in the Mandrill Probability tab, and if you find the token, return the value in column 5 of the Mandrill Probability tab – which is the probability.

The problem here is that if you do not find the word you will have a case of a “rare word” … which we need to replace with the natural log of 1 / total smoothed tokens. So our formula will now use an if() statement and the NA() function which returns 1 or 0 if the value is found or not found respectively. We also want to throw away tokens that are shorter than 4 letters long. So we need yet another if() statement … and here is the monstrosity:

=IF(LEN(D2)<=3,0,IF(ISNA(VLOOKUP(D2,'Mandrill Probability'!$A$5:$E$827,5,FALSE)), LN(1/'Mandrill Probability'!$C$828),VLOOKUP(D2,'Mandrill Probability'!$A$5:$E$827,5,FALSE)))

Now you can drag this down to D44 through to AI44.

Repeat this in D48 to AI for the ‘Other Probability':

=IF(LEN(D2)<=3,0,IF(ISNA(VLOOKUP(D2,'Other Probability'!$A$5:$E$810,5,FALSE)), LN(1/'Other Probability'!$C$811),VLOOKUP(D2,'Other Probability'!$A$5:$E$810,5,FALSE)))

Almost there, now we need to sum up each row. In cell B25 input:

=SUM(D25:AI25)

Copy this down to B67. Now we need to compare if the top set of numbers is greater than the bottom set of numbers in our predictions. In cell B2, input the comparison:

=IF(B25>B48,"APP","OTHER")

Drag this down to B20 to get the predictions!

mandrill-predictions

Of the 20 predicted, the model only makes a mistake on 1. That’s a pretty good margin of error.

All you need to do is repeat the steps for new tweets in the last tab. The model will deteriorate over time and therefore it is important to continuously update the model for better prediction.

You can skip this step by downloading the file with this step completed: Mandrill – Step 5

Tweet Classification with Naive Bayes Classifier: R

Step 1: Data Cleaning
We start off with our vanilla data set separated into 3 different files:

Download those files into your working directory and fire up R.

The first thing we need to do is load up the libraries we will be using and read our data. We also add the appropriate columns for “App” and “Other”:

# Libraries
library(tm)
 
# Collect data
tweets.mandrill<-read.csv('Mandrill.csv',header=T)
tweets.other<-read.csv('Other.csv',header=T)
tweets.test<-read.csv('Test.csv',header=T)
 
# Add "App" to mandrill tweets, and "Other" to others. 
tweets.mandrill["class"]<-rep("App",nrow(tweets.mandrill))
tweets.other["class"]<-rep("Other",nrow(tweets.mandrill))

We then need a helper function to replace all the stuff we do not want in the tweets. To do this we use a function called gsub() within our helper function.

# Helper Function
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x
}

Let us do the text transformations on our tweets now:

# Do our punctuation stuff
tweets.mandrill$Tweet <- replacePunctuation(tweets.mandrill$Tweet)
tweets.other$Tweet <- replacePunctuation(tweets.other$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)

Step 2: Word Counting
We do not need to count words for the R version. Pat yourself on the back for making a smart decision … not too much though this will get messy soon I promise :)

mandrill-successKid

Step 3: Tokenization
We are going to use the ‘tm’ library to tokenize our training and test data. First we will create a Corpus – which is just a funky word for a set of text stuff … in this case tweets!

# Create corpus
tweets.mandrill.corpus <- Corpus(VectorSource(as.vector(tweets.mandrill$Tweet)))
tweets.other.corpus <- Corpus(VectorSource(as.vector(tweets.other$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))

Great! Now we want to get a term document matrix – another fancy phrase for a table that shows how many times each word was mentioned in each document.

# Create term document matrix
tweets.mandrill.matrix <- t(TermDocumentMatrix(tweets.mandrill.corpus,control = list(wordLengths=c(4,Inf))));
tweets.other.matrix <- t(TermDocumentMatrix(tweets.other.corpus,control = list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));

This is what the term document matrix looks like:

    Terms
Docs which whom wisest with wordpress words
  10     0    0      0    2         0     0
  11     0    0      0    0         0     0
  12     0    0      0    0         0     0
  13     0    0      0    0         0     0
  14     0    0      0    1         0     0
  15     0    0      0    0         0     0

Great! That’s just what we need; we can count words in documents and totals later!

We can move on to building our model now.

Step 4: Model Building
We want to create the probabilities table.

  • We start by counting the number of times each word has been used
  • We then need to do the “additive smoothing” step
  • We then take then calculate the probabilities using the smoothed frequencies
  • Lastly we need to take the natural log of the probabilities

To save us time, we will write a little helper function that will do this for any document matrix we might have.

probabilityMatrix <-function(docMatrix)
{
  # Sum up the term frequencies
  termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
  # Add one
  termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
  # Calculate the probabilties
  termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
  # Calculate the natural log of the probabilities
  termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
  # Add pretty names to the columns
  colnames(termSums)<-c("term","count","additive","probability","lnProbability")
  termSums
}

Great, now we can just use that for our training sets “App” and “Other”.

tweets.mandrill.pMatrix<-probabilityMatrix(tweets.mandrill.matrix)
tweets.other.pMatrix<-probabilityMatrix(tweets.other.matrix)

We can output these to files and compare them to the Excel file:

write.csv(file="mandrillProbMatrix.csv",tweets.mandrill.pMatrix)
write.csv(file="otherProbMatrix.csv",tweets.mandrill.pMatrix)

Step 5: Using Bayes and the MAP model

We now want to use the MAP model to compare each of the test tweets with the two probability matrices.
We will write another function to help us out here. We start thinking about what we want to do with each tweet:

  • Get each test tweets characters and look for them in a probability matrix
  • We want to know how many words were not found.
  • We want to sum up the probabilities of the found words
  • For the words we do not find, we want to add ln(1/sum of the smoothed words) each time
  • Sum up the probabilities from the last 2 steps

Lets go:

getProbability <- function(testChars,probabilityMatrix)
{
  charactersFound<-probabilityMatrix[probabilityMatrix[,1] %in% testChars,"term"]
  # Count how many words were not found in the mandrill matrix
  charactersNotFound<-length(testChars)-length(charactersFound)
  # Add the normalized probabilities for the words founds together
  charactersFoundSum<-sum(as.numeric(probabilityMatrix[probabilityMatrix[,1] %in% testChars,"lnProbability"]))
  # We use ln(1/total smoothed words) for words not found
  charactersNotFoundSum<-charactersNotFound*log(1/sum(as.numeric(probabilityMatrix[,"additive"])))
  #This is our probability
  prob<-charactersFoundSum+charactersNotFoundSum 
  prob
}

Now we can use this function for every tweet we have with a for loop. Lets see how this is done:

# Get the matrix
tweets.test.matrix<-as.matrix(tweets.test.matrix)
# A holder for classification 
classified<-NULL
 
for(documentNumber in 1:nrow(tweets.test.matrix))
{
  # Extract the test words
  tweets.test.chars<-names(tweets.test.matrix[documentNumber,tweets.test.matrix[documentNumber,] %in% 1])
  # Get the probabilities
  mandrillProbability <- getProbability(tweets.test.chars,tweets.mandrill.pMatrix)
  otherProbability <- getProbability(tweets.test.chars,tweets.other.pMatrix)
  # Add it to the classification list
  classified<-c(classified,ifelse(mandrillProbability>otherProbability,"App","Other"))
}

Brilliant we can take a look at our classification alongside our tweets:

View(cbind(classified,tweets.test$Tweet))

Very much like our Excel model, we are off by 1/20.

Entire Code

library(tm)
 
# Collect data
tweets.mandrill<-read.csv('Mandrill.csv',header=T)
tweets.other<-read.csv('Other.csv',header=T)
tweets.test<-read.csv('Test.csv',header=T)
 
tweets.mandrill["class"]<-rep("App",nrow(tweets.mandrill))
tweets.other["class"]<-rep("Other",nrow(tweets.mandrill))
 
# Helper Function
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x
}
 
# Do our punctuation stuff
tweets.mandrill$Tweet <- replacePunctuation(tweets.mandrill$Tweet)
tweets.other$Tweet <- replacePunctuation(tweets.other$Tweet)
tweets.test$Tweet <- replacePunctuation(tweets.test$Tweet)
 
# Create corpus
tweets.mandrill.corpus <- Corpus(VectorSource(as.vector(tweets.mandrill$Tweet)))
tweets.other.corpus <- Corpus(VectorSource(as.vector(tweets.other$Tweet)))
tweets.test.corpus <- Corpus(VectorSource(as.vector(tweets.test$Tweet)))
 
# Create term document matrix
tweets.mandrill.matrix <- t(TermDocumentMatrix(tweets.mandrill.corpus,control = list(wordLengths=c(4,Inf))));
tweets.other.matrix <- t(TermDocumentMatrix(tweets.other.corpus,control = list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));
 
# Probability Matrix
probabilityMatrix <-function(docMatrix)
{
  # Sum up the term frequencies
  termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
  # Add one
  termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
  # Calculate the probabilties
  termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
  # Calculate the natural log of the probabilities
  termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
  # Add pretty names to the columns
  colnames(termSums)<-c("term","count","additive","probability","lnProbability")
  termSums
}
 
tweets.mandrill.pMatrix<-probabilityMatrix(tweets.mandrill.matrix)
tweets.other.pMatrix<-probabilityMatrix(tweets.other.matrix)
 
 
#Predict
 
# Get the test matrix
# Get words in the first document
 
getProbability <- function(testChars,probabilityMatrix)
{
  charactersFound<-probabilityMatrix[probabilityMatrix[,1] %in% testChars,"term"]
  # Count how many words were not found in the mandrill matrix
  charactersNotFound<-length(testChars)-length(charactersFound)
  # Add the normalized probabilities for the words founds together
  charactersFoundSum<-sum(as.numeric(probabilityMatrix[probabilityMatrix[,1] %in% testChars,"lnProbability"]))
  # We use ln(1/total smoothed words) for words not found
  charactersNotFoundSum<-charactersNotFound*log(1/sum(as.numeric(probabilityMatrix[,"additive"])))
  #This is our probability
  prob<-charactersFoundSum+charactersNotFoundSum 
  prob
}
 
# Get the matrix
tweets.test.matrix<-as.matrix(tweets.test.matrix)
 
# A holder for classification 
classified<-NULL
 
for(documentNumber in 1:nrow(tweets.test.matrix))
{
  # Extract the test words
  tweets.test.chars<-names(tweets.test.matrix[documentNumber,tweets.test.matrix[documentNumber,] %in% 1])
  # Get the probabilities
  mandrillProbability <- getProbability(tweets.test.chars,tweets.mandrill.pMatrix)
  otherProbability <- getProbability(tweets.test.chars,tweets.other.pMatrix)
  # Add it to the classification list
  classified<-c(classified,ifelse(mandrillProbability>otherProbability,"App","Other"))
}
 
View(cbind(classified,tweets.test$Tweet))

Post Date Apr 30

Customer Segmentation: Excel and R

Source: http://laughingsquid.com/wp-content/uploads/girl-scout-cookies-20110901-182357.jpg

Foreward

Data Smart
I started reading Data Smart by John Foreman. The book is a great read because of Foreman’s humorous style of writing. What sets this book apart from the other data analysis books I have come across is that it focuses on the techniques rather than the tools – everything is accomplished through the use of a spreadsheet program (e.g. Excel).

So as a personal project to learn more about data analysis and its applications, I will be reproducing exercises in the book both in Excel and R. I will be structured in the blog posts: 0) I will always repeat this paragraph in every post 1) Introducing the Concept and Data Set, 2) Doing the Analysis with Excel, 3) Reproducing the results with R.

If you like the stuff here, please consider buying the book.

Shortcuts

This post is long, so here is a shortcut menu:

Excel

R


Customer Segmentation

Customer segmentation is as simple as it sounds: grouping customers by their characteristics – and why would you want to do that? To better serve their needs!

So how does one go about segmenting customers? One method we will look at is an unsupervised method of machine learning called k-Means clustering. Unsupervised learning means finding out stuff without knowing anything about the data to start … so you want to discover.

Our example today is to do with e-mail marketing. We use the dataset from Chapter 2 on Wiley’s website – download a vanilla copy here

What we have are offers sent via email, and transactions based on those offers. What we want to do with K-Means clustering is classify customers based on offers they consume. Simplystatistics.org have a nice animation of what this might look like:

KMeans Animation

If we plot a given set of data by their dimensions, we can identify groups of similar points (customers) within the dataset by looking at how those points center around two or more points. The k in k-means is just the number of clusters you choose to identify; naturally this would be greater than one cluster.

Great, we’re ready to start.

K-Means Clustering – Excel

First what we need to do is create a transaction matrix. That means, we need to put the offers we mailed out next to the transaction history of each customer. This is easily achieved with a pivot table.

Step 1: Pivot & Copy

kmeans-pivot Go to your “transactions” tab and create a pivot table with the settings shown in the image to the right.

Once you have that you will have the names of the customers in columns, and the offer numbers in rows with a bunch of 1’s indicating that the customers have made a transaction based on the given offer.

Great. Copy and paste the pivot table (leaving out the grand totals column and row) into a new tab called “Matrix”.

Now copy the table from the “OfferInformation” tab and paste it to the left of your new Matrix table.

Done – You have your transaction matrix! This is what it should look like:

kmeans-MatrixScreen

If you would like to skip this step, download this file with Step 1 completed.

Step 2: Distances and Clusters

We will use k = 4 indicating that we will use 4 clusters. This is somewhat arbitrary, but the number you pick should be representative of the number of segments you can handle as a business. So 100 segments does not make sense for an e-mail marketing campaign.

Lets go ahead and add 4 columns – each representing a cluster.

We need to calculate how far away each customer is from the cluster’s mean. To do this we could use many distances, one of which is the Euclidean Distance. This is basically the distance between two points using pythagorean theorem.

To do this in Excel we will use a feature called multi-cell arrays that will make our lives a lot easier. For “Adam” insert the following formula to calculate the distance from Cluster 1 and then press CTRL+SHIFT+ENTER (if you just press enter, Excel wont know what to do):

=SQRT(SUM((L$2:L$33-$H$2:$H$33)^2))

Excel will automatically add braces “{}” around your formula indicating that it will run the formula for each row.
Now do the same for clusters 2, 3 and 4.

=SQRT(SUM((L$2:L$33-$I$2:$I$33)^2))
=SQRT(SUM((L$2:L$33-$J$2:$J$33)^2))
=SQRT(SUM((L$2:L$33-$K$2:$K$33)^2))

Now copy Adam’s 4 cells, highlight all the “distance from cluster …” cells for the rest of the customers, and paste the formula. You now have the distances for all customers.

Before moving on we need to calculate the cluster with the minimum distance. To do this we will add 2 rows: the first is “Minimum Distance” which for Adam is:

=MIN(L34:L37)

Do the same for the remaining customers. Then add another row labelled “Assigned Cluster” with the formula:

=MATCH(L38,L34:L37,0)

That’s all you need. Now lets go find the optimal cluster centers!

If you would like to skip this step, download this file with Step 2 completed.

Step 3: Solving for optimal cluster centers

We will use “Solver” to help us calculate the optimal centers. This step is pretty easy now that you have most of your model set up. The optimal centers will allow us to have the minimum distance for each customer – and therefore we are minimizing the total distance.

To do this, we first must know what the total distance is. So add a row under the “Assigned Cluster” and calculate the sum of the “Minimum Distance” row using the formula:

=SUM(L38:DG38)

We are now ready to use solver (found in Tools -> Solver). Set up your solver to match the following screenshot.
Note that we change the solver to “evolutionary” and set options of convergence to 0.00001, and timeout to 600 seconds.

KMeans Solver Screen

When ready, press solve. Go make yourself a coffee, this will take about 10 minutes.

Note: Foreman is able to achieve 140, I only reached 144 before my computer times out. This is largely due to my laptop’s processing power I believe. Nevertheless, the solution is largely acceptable.

I think it’s a good time to appropriately name your tab “Clusters”.

If you would like to skip this step, download this file with Step 3 completed.

Step 4: Top deals by clusters
Now we have our clusters – and essentially our segments – we want to find the top deals for each segment. That means we want to calculate, for each offer, how many were consumed by each segment.

This is easily achieved through the SUMIF function in excel. First lets set up a new tab called “Top Deals”. Copy and paste the “Offer Information” tab into your new Top Deals tab and add 4 empty columns labelled 1,2,3, and 4.

The SUMIF formula uses 3 arguments: the data you want to consider, the criteria whilst considering that data, and finally if the criteria is met what should the formula sum up. In our case we will consider the clusters as the data, the cluster number as the criteria, and the number of transactions for every offer as the final argument.

This is done using the following formula in the cell H2 (Offer 1, Cluster 1):

=SUMIF(Clusters!$L$39:$DG$39,'Top Deals'!H$1,Clusters!$L2:$DG2)

Drag this across (or copy and paste into) cells H2 to K33. You now have your top deals! It’s not very clear though so add some conditional formatting by highlighting H2:K33 and selecting format -> conditional formatting. I added the following rule:

kmeans - conditional formatting

That’s it! Foreman goes into details about how to judge whether 4 clusters are enough but this is beyond the scope of the post. We have 4 clusters and our offers assigned to each cluster. The rest is interpretation. One observation that stands out is that cluster 4 was the only cluster that consumed Offer 22.

Another is that cluster 1 have clear preferences for Pinot Noir offers and so we will push Pinot Noir offers to: Anderson, Bell, Campbell, Cook, Cox, Flores, Jenkins, Johnson, Moore, Morris, Phillips, Smith. You could even use a cosine similarity or apriori recommendations to go a step further with Cluster 1.

We are done with Excel, now lets see how to do this with R!

If you would like to skip this step, download this file with Step 4 completed.

K-Means Clustering – R

We will go through the same steps we did. We start with our vanilla files – provided in CSV format:

Download these files into a folder where you will work from in R. Fire up RStudio and lets get started.

Step 1: Pivot & Copy

First we want to read our data. This is simple enough:

# Read offers and transaction data
offers<-read.csv(file="OfferInformation.csv")
transactions<-read.csv(file="Transactions.csv")

We now need to combine these 2 files to get a frequency matrix – equivalent to pivoting in Excel. This can be done using the reshape library in R. Specifically we will use the melt and cast functions.

We first melt the 2 columns of the transaction data. This will create data that we can pivot: customer, variable, and value. We only have 1 variable here – Offer.

We then want to cast this data by putting value first (the offer number) in rows, customer names in the columns. This is done by using R’s style of formula input: Value ~ Customers.

We then want to count each occurrence of customers in the row. This can be done by using a function that takes customer names as input, counts how many there are, and returns the result. Simply: function(x) length(x)

Lastly, we want to combine the data from offers with our new transaction matrix. This is done using cbind (column bind) which glues stuff together automagically.

Lots of explanations for 3 lines of code!

#Load Library
library(reshape)
 
# Melt transactions, cast offer by customers
pivot<-melt(transactions[1:2])
pivot<-(cast(pivot,value~Customer.Last.Name,fill=0,fun.aggregate=function(x) length(x)))
 
# Bind to offers, we remove the first column of our new pivot because it's redundant. 
pivot<-cbind(offers,pivot[-1])

We can output the pivot table into a new CSV file called pivot.

write.csv(file="pivot.csv",pivot)

Step 2: Clustering
To cluster the data we will use only the columns starting from “Adams” until “Young”.
We will use the fpc library to run the KMeans algorithm with 4 clusters.
To use the algorithm we will need to rotate the transaction matrix with t().

That’s all you need: 4 lines of code!

# Load library
library(fpc)
 
# Only use customer transaction data and we will rotate the matrix
cluster.data<-pivot[,8:length(pivot)]
cluster.data<-t(cluster.data)
 
# We will run KMeans using pamk (more robust) with 4 clusters. 
cluster.kmeans<-pamk(cluster.data,k=4)
 
# Use this to view the clusters
View(cluster.kmeans$pamobject$clustering)

Step 3: Solving for Cluster Centers

This is not a necessary step in R! Pat yourself on the back, get another cup of tea or coffee and move onto to step 4.

Step 4: Top deals by clusters

Top get the top deals we will have to do a little bit of data manipulation. First we need to combine our clusters and transactions. Noteably the lengths of the ‘tables’ holding transactions and clusters are different. So we need a way to merge the data … so we use the merge() function and give our columns sensible names:

#Merge Data
cluster.deals<-merge(transactions[1:2],cluster.kmeans$pamobject$clustering,by.x = "Customer.Last.Name", by.y = "row.names")
colnames(cluster.deals)<-c("Name","Offer","Cluster")

We then want to repeat the pivoting process to get Offers in rows and clusters in columns counting the total number of transactions for each cluster. Once we have our pivot table we will merge it with the offers data table like we did before:

# Melt, cast, and bind
cluster.pivot<-melt(cluster.deals,id=c("Offer","Cluster"))
cluster.pivot<-cast(cluster.pivot,Offer~Cluster,fun.aggregate=length)
cluster.topDeals<-cbind(offers,cluster.pivot[-1])

We can then reproduce the excel version by writing to a csv file:

write.csv(file="topdeals.csv",cluster.topDeals,row.names=F)

Note
It’s important to note that cluster 1 in excel does not correspond to cluster 1 in R. It’s just the way the algorithms run. Moreover, the allocation of clusters might differ slightly because of the nature of kmeans algorithm. However, your insights will be the same; in R we also see that cluster 3 prefers Pinot Noir and cluster 4 has a strong preference for Offer 22.

Entire Code

# Read data
offers<-read.csv(file="OfferInformation.csv")
transactions<-read.csv(file="Transactions.csv")
 
# Create transaction matrix
library(reshape)
pivot<-melt(transactions[1:2])
pivot<-(cast(pivot,value~Customer.Last.Name,fill=0,fun.aggregate=function(x) length(x)))
pivot<-cbind(offers,pivot[-1])
 
write.csv(file="pivot.csv",pivot)
 
# Cluster 
library(fpc)
cluster.data<-pivot[,8:length(pivot)]
cluster.data<-t(cluster.data)
cluster.kmeans<-pamk(cluster.data,k=4)
 
# Merge Data
cluster.deals<-merge(transactions[1:2],cluster.kmeans$pamobject$clustering,by.x = "Customer.Last.Name", by.y = "row.names")
colnames(cluster.deals)<-c("Name","Offer","Cluster")
 
# Get top deals by cluster
cluster.pivot<-melt(cluster.deals,id=c("Offer","Cluster"))
cluster.pivot<-cast(cluster.pivot,Offer~Cluster,fun.aggregate=length)
cluster.topDeals<-cbind(offers,cluster.pivot[-1])
 
write.csv(file="topdeals.csv",cluster.topDeals,row.names=F)