Post Date Apr 28

Collaborative Filtering with Python

Collaborative FIltering

To start, I have to say that it is really heartwarming to get feedback from readers, so thank you for engagement. This post is a response to a request made collaborative filtering with R.

The approach used in the post required the use of loops on several occassions.
Loops in R are infamous for being slow. In fact, it is probably best to avoid them all together.
One way to avoid loops in R, is not to use R (mind: #blow). We can use Python, that is flexible and performs better for this particular scenario than R.
For the record, I am still learning Python. This is the first script I write in Python.

Refresher: The Last.FM dataset

The data set contains information about users, gender, age, and which artists they have listened to on Last.FM.
In our case we only use Germany’s data and transform the data into a frequency matrix.

We will use this to complete 2 types of collaborative filtering:

  • Item Based: which takes similarities between items’ consumption histories
  • User Based: that considers similarities between user consumption histories and item similarities

We begin by downloading our dataset:

Fire up your terminal and launch your favourite IDE. I use IPython and Notepad++.

Lets load the libraries we will use for this exercise (pandas and scipy)

# --- Import Libraries --- #
import pandas as pd
from scipy.spatial.distance import cosine

We then want to read our data file.

# --- Read Data --- #
data = pd.read_csv('data.csv')

If you want to check out the data set you can do so using data.head():

 
data.head(6).ix[:,2:8]
 
   abba  ac/dc  adam green  aerosmith  afi  air
0     0      0           0          0    0    0
1     0      0           1          0    0    0
2     0      0           0          0    0    0
3     0      0           0          0    0    0
4     0      0           0          0    0    0
5     0      0           0          0    0    0

Item Based Collaborative Filtering

Reminder: In item based collaborative filtering we do not care about the user column.
So we drop the user column (don’t worry, we’ll get them back later)

# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_germany = data.drop('user', 1)

Before we calculate our similarities we need a place to store them. We create a variable called data_ibs which is a Pandas Data Frame (… think of this as an excel table … but it’s vegan with super powers …)

# Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)

Now we can start to look at filling in similarities. We will use Cosin Similarities.
We needed to create a function in R to achieve this the way we wanted to. In Python, the Scipy library has a function that allows us to do this without customization.
In essense the cosine similarity takes the sum product of the first and second column, then dives that by the product of the square root of the sum of squares of each column.

This is a fancy way of saying “loop through each column, and apply a function to it and the next column”.

# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :
      # Fill in placeholder with cosine similarities
      data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])

With our similarity matrix filled out we can look for each items “neighbour” by looping through ‘data_ibs’, sorting each column in descending order, and grabbing the name of each of the top 10 songs.

# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
 
# --- End Item Based Recommendations --- #

Done!

data_neighbours.head(6).ix[:6,2:4]
 
                                      2                3              4
a perfect circle                   tool            dredg       deftones
abba                            madonna  robbie williams  elvis presley
ac/dc             red hot chili peppers        metallica    iron maiden
adam green               the libertines      the strokes   babyshambles
aerosmith                            u2     led zeppelin      metallica
afi                funeral for a friend     rise against   fall out boy

User Based collaborative Filtering

The process for creating a User Based recommendation system is as follows:

  • Have an Item Based similarity matrix at your disposal (we do…wohoo!)
  • Check which items the user has consumed
  • For each item the user has consumed, get the top X neighbours
  • Get the consumption record of the user for each neighbour.
  • Calculate a similarity score using some formula
  • Recommend the items with the highest score

Lets begin.

We first need a formula. We use the sum of the product 2 vectors (lists, if you will) containing purchase history and item similarity figures. We then divide that figure by the sum of the similarities in the respective vector.
The function looks like this:

# --- Start User Based Recommendations --- #
 
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history*similarities)/sum(similarities)

The rest is a matter of applying this function to the data frames in the right way.
We start by creating a variable to hold our similarity data.
This is basically the same as our original data but with nothing filled in except the headers.

# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]

We now loop through the rows and columns filling in empty spaces with similarity scores.

Note that we score items that the user has already consumed as 0, because there is no point recommending it again.

#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0,len(data_sims.index)):
    for j in range(1,len(data_sims.columns)):
        user = data_sims.index[i]
        product = data_sims.columns[j]
 
        if data.ix[i][j] == 1:
            data_sims.ix[i][j] = 0
        else:
            product_top_names = data_neighbours.ix[product][1:10]
            product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
            user_purchases = data_germany.ix[user,product_top_names]
 
            data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)

We can now produc a matrix of User Based recommendations as follows:

# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]

Instead of having the matrix filled with similarity scores, however, it would be nice to see the song names.
This can be done with the following loop:

# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
# Print a sample
print data_recommend.ix[:10,:4]

Done! Happy recommending ;]

   user                      1                      2                3
0     1         flogging molly               coldplay        aerosmith
1    33  red hot chili peppers          kings of leon        peter fox
2    42                 oomph!            lacuna coil        rammstein
3    51            the subways              the kooks  franz ferdinand
4    62           jack johnson                incubus       mando diao
5    75             hoobastank             papa roach           sum 41
6   130      alanis morissette  the smashing pumpkins        pearl jam
7   141           machine head        sonic syndicate          caliban
8   144                editors              nada surf      the strokes
9   150                placebo            the subways     eric clapton
10  205             in extremo          nelly furtado        finntroll

Entire Code

 
# --- Import Libraries --- #
 
import pandas as pd
from scipy.spatial.distance import cosine
 
# --- Read Data --- #
data = pd.read_csv('data.csv')
 
# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_germany = data.drop('user', 1)
 
# Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns,columns=data_germany.columns)
 
# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for i in range(0,len(data_ibs.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_ibs.columns)) :
      # Fill in placeholder with cosine similarities
      data_ibs.ix[i,j] = 1-cosine(data_germany.ix[:,i],data_germany.ix[:,j])
 
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=[range(1,11)])
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
 
# --- End Item Based Recommendations --- #
 
# --- Start User Based Recommendations --- #
 
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history*similarities)/sum(similarities)
 
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]
 
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0,len(data_sims.index)):
    for j in range(1,len(data_sims.columns)):
        user = data_sims.index[i]
        product = data_sims.columns[j]
 
        if data.ix[i][j] == 1:
            data_sims.ix[i][j] = 0
        else:
            product_top_names = data_neighbours.ix[product][1:10]
            product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
            user_purchases = data_germany.ix[user,product_top_names]
 
            data_sims.ix[i][j] = getScore(user_purchases,product_top_sims)
 
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]
 
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
 
# Print a sample
print data_recommend.ix[:10,:4]

Referenence

Post Date Feb 8

Counting Tweets in R – Substrings, Chaining, and Grouping

Sesame Street - Computer Capers

I was recently sent an email about transforming tweet data and presenting it in a simple way to represent stats about tweets by a certain category. I thought I would share how to do this:

A tweet is basically composed of text, hash tags (text prefixed with #), mentions (text prefixed with @), and lastly hyperlinks (text that follow some form of the pattern “http://_____.__”). We want to count these by some grouping – in this case we will group by user/character.

I prepared a sample data set containing some made up tweets by Sesame Street characters. You can download it by clicking: Sesame Street Faux Tweets.

Fire up R then load up our tweets into a dataframe:

# Load tweets and convert to dataframe
tweets<-read.csv('sesamestreet.csv', stringsAsFactors=FALSE)
tweets <- as.data.frame(tweets)

We will use 3 libraries: stringr for string manipulation, dplyr for chaining, and ggplot2 for some graphs.

# Libraries
library(dplyr)
library(ggplot2)
library(stringr)

We now want to create the summaries and store them in a list or dataframe of their own. We will use dplyr to do the grouping, and stringr with some regex to apply filters on our tweets. If you do not know what is in the tweets dataframe go ahead and run head(tweets) to get an idea before moving forward.

gluten <- tweets %>%
	group_by(character) %>%
	summarise(total=length(text), 
		    hashtags=sum(str_count(text,"#(\\d|\\w)+")),
		    mentions=sum(str_count(text,"@(\\d|\\w)+")),
		    urls=sum(str_count(text,"^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)"))
	)

The code above starts with the variable we will store our list in … I called it “gluten” for no particular reason.

  1. We will apply transformations to our tweets dataframe. The “dplyr” knows to step into the next command because we use “%>%” to indicate this.
  2. We group by the first column “character” using the function group_by().
  3. We then create summary stats by using the function summarise() – note the American spelling will work too 😛
  4. We create a summary called total which is equal to the number of tweets (i.e. length of the list that has been grouped)
  5. We then count the hashtags by using the regex expression “#(\\d|\\w)+” and the function str_count() from the stringr package. If this regex does not make sense, you can use many tools online to explain it.
  6. We repeat the same step for mentions and urls

Phew. Now lets see what that outputs by typing “gluten” into the console:

       character total hashtags mentions urls
1       Big Bird     2        5        1    0
2 Cookie Monster     3        2        1    1
3  Earnie & Bert     4        0        4    0

Which is exactly what we would see if we opened up the CSV file.

We can now create simple plots using the following code:

 
# Plots 
ggplot(data=gluten)+aes(x=reorder(character,-total),y=total)+geom_bar(stat="identity") +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Tweets")
ggplot(data=gluten)+aes(x=reorder(character,-hashtags),y=hashtags)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Hash Tags")
ggplot(data=gluten)+aes(x=reorder(character,-mentions),y=mentions)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total Mentions")
ggplot(data=gluten)+aes(x=reorder(character,-urls),y=urls)+geom_bar(stat="identity")  +theme(axis.text.x = element_text(angle = 90, hjust = 1))+ylab("Tweets")+xlab("Character")+ggtitle("Characters Ranked by Total URLs ")

Admittedly they’re not the prettiest plots, I got lazy ^_^’

Enjoy! If you have any questions, leave a comment!!

Post Date Jun 11

Why we die – Mortality in Kuwait with R

Patient Boogie

I came across Randy Olson’s (editor at @DataIsBeautiful) tweet about causes of death in the US.
I thought I would replicate the information here using R and localize it for Kuwait – nothing fancy.

I will keep the code to the end because I think the results are quite interesting. The data source is from the UN Data portal

Here’s Randy Olson’s Tweet before we get going:

Causes of Death in the General Population

I always had the impression that the biggest issues with mortality in Kuwait were car accidents. Perhaps this is a bias introduced by media coverage. I always thought that if someone managed to survive their entire lives in Kuwait without dying in an accident, then only one other thing could get them ~ and that was cancer. This is not far from the truth:

General Causes of Death in Kuwait

What did catch me off guard is the number of people who die from circulatory system disease and heart disease. The numbers are not only large, but the trump both accidents and cancer figures. Interestingly enough, respiratory system diseases start to show up in 2006 just as problems with circulatory and pulmonary problems become more prevalent.

I thought that this surely is controlled within a demographic group.
So I decided to split the data into gender and age.

Causes of Death by Gender

Looking at the gender differences the first eye-popping fact is that less women seem to be dying … this is misleading because the population is generally bias towards men. There are about 9 men for every 5 women in Kuwait.

Cause of Death by Gender

The other eye-popping item that appears is in accidents. Less women pass away from accidents compared to men – a lot less! Is this indicative that women are safer drivers than their counterparts? Perhaps. In some nations this figure would indeed be zero because of social and legal constraints … it’s not necessarily good news … but it does stand out!

Proportionally there is a higher rate of mortality due to cancer in the female population vs. the male population.

Lastly, men seem to be more susceptible to death from heart diseases and circulatory system diseases. This might make you think why? Heart diseases and circulatory system diseases are exacerbated by sedentry life styles, poor diets, and other factors such as the consumption of tobacco. We have already looked at obesity in Kuwait … perhaps a deeper dive might shed some light on this matter.

For now, lets move on … surely younger people do not suffer from heart diseases …

Causes of Death by Age Group

This one confirms that if an accident does not get you before you’re 25 then the rest of the diseases are coming your way.
People fall victim to circulatory, respiratory, and heart diseases at extremely young ages. In fact what we see here is that irrespective of age group, after the age of 40 the mortality rates are the same for these three diseases.

On the other hand, accident mortalities go down as people shift to older age groups but are displaced by cancer. What is terribly depressing about this graph are the number of people below the ages of 19 that die in accidents. These might be just numbers, but in reality these are very real names to families.

Age

Take-aways

The graphs were just a fun way to play with R. What we can take away is that of the 5 main causes for mortality in Kuwait – Cancer, Heart Diseases, Circulatory Diseases, Respiratory Diseases, and Accidents – 4 of them are addressible through policy, regulation, and raising public awareness for social/behavioural impact.

Code

You can download the Kuwait dataset here or from the UN’s Data Portal.

Load up R and run the code below – fully commented for your geeky enjoyment:

# In Libraries we Trust
library(ggplot2)
library(dplyr)
 
# Read Data
data.mortality<-read.csv('kuwait.csv')
 
##################
#  Data Munging  #
##################
 
# Change Year into a categorical variable
data.mortality$Year<-as.factor(data.mortality$Year)
 
# Rename our columns
names(data.mortality)<-c("Country","Year","Area","Sex","Age","Cause","Record","Reliability","Source.Year","Value","Value.Notes")
 
# Data cleaning, here we groups some items together for simplification.
# Eg. All forms of neoplasms are grouped under "Cancer". 
 
data.mortality$Cause<-gsub(pattern=", ICD10","",x=data.mortality$Cause)
data.mortality[grep(pattern="eoplasm",x=data.mortality$Cause),]$Cause<-"Cancer"
data.mortality[grep(pattern="ccidents",x=data.mortality$Cause),]$Cause<-"Accidents"
data.mortality[grep(pattern="not elsewhere classified",x=data.mortality$Cause),]$Cause<-"Unknown"
data.mortality[grep(pattern="External causes",x=data.mortality$Cause),]$Cause<-"Unknown"
data.mortality[grep(pattern="Congenital malformations",x=data.mortality$Cause),]$Cause<-"Congenital malformations"
data.mortality[grep(pattern="Certain conditions originating in the perinatal period",x=data.mortality$Cause),]$Cause<-"Perinatal period conditions"
data.mortality[grep(pattern="Endocrine, nutritional and metabolic",x=data.mortality$Cause),]$Cause<-"Endocrine, Nutritional & Metabolic"
data.mortality[grep(pattern="Diseases of the respiratory",x=data.mortality$Cause),]$Cause<-"Respiratory Disease"
data.mortality[grep(pattern="Diseases of the circulatory system",x=data.mortality$Cause),]$Cause<-"Circulatory System"
data.mortality[grep(pattern="Hypertensive diseases",x=data.mortality$Cause),]$Cause<-"Cerebral"
data.mortality[grep(pattern="Ischaemic heart diseases",x=data.mortality$Cause),]$Cause<-"Heart Diseases"
data.mortality[grep(pattern="Cerebrovascular diseases",x=data.mortality$Cause),]$Cause<-"Cerebral"
 
 
#########################
# Data Transformations  #
#########################
 
# Use the dplyr library to group items from the original data set
# We want a general understanding of causes of death
# We subset out All causes to avoid duplication
# We subset out age groups that cause duplication
# We group by Country, Year and Cause
# We create a summary variable "Persons" that is the sum of the incidents
# We sort by Cause for pretty graphs
 
kw.general <- data.mortality %.%
  subset(!(Cause %in% "All causes")) %.%
  subset(Country %in% "Kuwait") %.%
  subset(!(Age %in% c("Total","Unknown","85 +","95 +","1","2","3","4","0"))) %.%
  group_by(Country) %.%
  group_by(Year) %.%
  group_by(Cause) %.%
  summarise(Persons=sum(Value)) %.%
  arrange(Cause) 
 
# We want an understanding of causes of death by age group
# We subset out All causes to avoid duplication
# We subset out age groups that cause duplication
# We group by Country, Year, Age and Cause
# We create a summary variable "Persons" that is the sum of the incidents
# We sort by Cause for pretty graphs
 
kw.age<-data.mortality %.%
  subset(!(Cause %in% "All causes")) %.%
  subset(Country %in% "Kuwait") %.%
  subset(!(Age %in% c("Total","Unknown","85 +","95 +","1","2","3","4","0"))) %.%
  group_by(Country) %.%
  group_by(Year) %.%
  group_by(Age) %.%
  group_by(Cause) %.%
  summarise(Persons=sum(Value)) %.%
  arrange(Cause)
 
# We reorder our age groups manually for pretty graphs
kw.age$Age<-(factor(kw.age$Age,levels(kw.age$Age)[c(1,2,6,9,12,3,15,4,5,7,8,10,11,13,14,16:28)]))
 
 
# We want an understanding of causes of death by gender
# We subset out All causes to avoid duplication
# We subset out age groups that cause duplication
# We group by Country, Year, Gender and Cause
# We create a summary variable "Persons" that is the sum of the incidents
# We sort by Cause for pretty graphs
 
kw.sex<-data.mortality %.%
  subset(!(Cause %in% "All causes")) %.%
  subset(Country %in% "Kuwait") %.%
  subset(!(Age %in% c("Total","Unknown","85 +","95 +","1","2","3","4","0"))) %.%
  group_by(Country) %.%
  group_by(Year) %.%
  group_by(Sex) %.%
  group_by(Cause) %.%
  summarise(Persons=sum(Value)) %.%
  arrange(Cause)
 
 
########################################
# Graphing, Plotting, Dunking Cookies  #
########################################
 
# We will limit our graphs by number of persons each incidents cause
# The main reason is because we do not want a graph that looks like a chocolate chip cookie
PersonLimit <- 200
 
# Plot the General data
General<-ggplot(subset(kw.general,Persons>=PersonLimit), aes(x = Year, y = Persons)) +
  geom_bar(aes(fill=Cause),stat='identity', position="stack")+
  ggtitle(paste("Causes of Death in Kuwait\n\n(Showing Causes Responsible for the Death of ",PersonLimit," Persons or More)\n\n"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.text=element_text(size=14),
        axis.text=element_text(size=12),
        axis.title=element_text(size=12))+ 
  scale_fill_brewer( palette = "RdBu")+
  geom_text(aes(label = Persons,y=Persons,ymax=Persons), size = 3.5,  vjust = 1.5, position="stack",color=I("#000000")) 
 
 
# Reset the Person Limit
PersonLimit <- 150
 
# Plot the Gender data faceted by Gender
Gender<-ggplot(subset(kw.sex,Persons>=PersonLimit), aes(x = Year, y = Persons)) +
  geom_bar(aes(fill=Cause),stat='identity', position="stack")+
  facet_wrap(~Sex)+
  ggtitle(paste("Causes of Death in Kuwait by Gender\n\n(Showing Causes Responsible for the Death of ",PersonLimit," Persons or More)\n\n"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.text=element_text(size=14),
        axis.text=element_text(size=12),
        axis.title=element_text(size=12)
  )+ 
  scale_fill_brewer( palette = "RdBu" )+
  geom_text(aes(label = Persons,y=Persons,ymax=Persons), size = 3.5,  vjust = 1.5, position="stack",color=I("#000000")) 
 
# Reset the Person Limit
PersonLimit <- 30
 
# Plot the Age group data facted by Age
Age<-ggplot(subset(kw.age,Persons>=PersonLimit), aes(x = Year, y = Persons)) +
  geom_bar(aes(fill=Cause),stat='identity', position="stack")+
  facet_wrap(~Age)+
  ggtitle(paste("Causes of Death in Kuwait by Age Group\n\n(Showing Causes Responsible for the Death of ",PersonLimit," Persons or More)\n\n"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.text=element_text(size=14),
        axis.text=element_text(size=12),
        axis.title=element_text(size=12)
  )+ 
  scale_fill_brewer( palette = "RdBu" ) 
 
# Save all three plots
ggsave(filename="General.png",plot=General,width=12,height=10)
ggsave(filename="Age.png",plot=Age,width=12,height=10)
ggsave(filename="Gender.png",plot=Gender,width=12,height=10)