I took a class in the summer of 2016 titled, “Text Mining.” One of the units was on sentiment analysis with the AFINN sentiment dictionary. In short, we wrote scripts in Python to assign a ‘mood score’ to strings of text.

Text strings that I am extremely familiar with are Twitter feeds. Below is the code I used to pull my brother-in-law’s Tweets. I have a lot of Python that I need to learn, but the Text Mining class I took gave me a fantastic grasp of how to use Python, Pandas, and Numpy for data analysis.

Mitchell was supportive of this experiment.

Here is the presentation we used to show our findings:


Setting up our Python environment

Our professor gave us a function to turn the Tweet object we’ll return from Tweepy into a Pandas data frame. I was a little surprised that Tweepy didn’t have a function to do this already, but I suppose a data frame is not a native object in Python like it is in R.

import pandas as pd
import numpy as np
import tweepy
from tweepy import OAuthHandler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
import re
from nltk.corpus import stopwords
from __future__ import division
pd.set_option('display.max_colwidth', 1500)


#Our data frame function from class
def toDataFrame(tweets):
    
    #Data Frame Object
    DataSet = pd.DataFrame()

    DataSet['userID'] = [tweet.user.id for tweet in tweets]
    DataSet['id'] = [tweet.id for tweet in tweets]
    DataSet['userName'] = [tweet.user.name for tweet in tweets]
    DataSet['tweetText'] = [tweet.text for tweet in tweets]
    DataSet['tweetCreated'] = [tweet.created_at for tweet in tweets]
    DataSet['userLocation'] = [tweet.user.location for tweet in tweets]

    return DataSet

Twitter setup

Pretty standard stuff here. I took out my keys.

#This cell sets up Twitter

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit = True)

#the object to throw our tweets in
results = []

Use Tweepy to gather Mitchell’s (original) tweets

This Tweepy function will pull as many of Mitchell’s tweets as it can. After I got all of the Tweets, I noticed that a large amount of Mitchell’s timeline is filled with retweets. I wrote a quick loop to take the retweets out of our dataset. I want to know how sad Mitchell is, not how sad other people are.

#our tweets
tweets = tweepy.Cursor(api.user_timeline, id="bigblondeox055").items()

for tweet in tweets:
    if 'RT @' not in tweet.text:
        results.append(tweet)

Save the Tweets as a .CSV for analysis in other applications

The .CSV for this data exercise can be found here.

#print str(round((len(results)/10000)*100,2)) + "% of the last 10,000 tweets were original."
tweet_frame = toDataFrame(results)
tweet_frame.to_csv("~/Dropbox/finalproj/mitch.csv",encoding='utf-8')

Use SKLearn to bring in a Count Vectorizor to tokenize and count our words

#tweet_frame['newtext'] = map(lambda x: x.decode('latin-1').encode('ascii','ignore'), tweet_frame['tweetText'])

print tweet_frame.shape

tweet_frame['newtext'] = tweet_frame['tweetText'].str.lower()

prelim = CountVectorizer(binary=False, lowercase = False, stop_words = 'english')
prelim_dm = prelim.fit_transform(tweet_frame['newtext'])
print prelim_dm.shape

names = prelim.get_feature_names()

count = np.sum(prelim_dm.toarray(), axis = 0).tolist()
count_df = pd.DataFrame(count, index = names, columns = ['count'])

count_df.sort(['count'], ascending = False).head(10)

Bring in the AFINN sentiment dictionary with our custom Emoji scores and loop through every word in every Tweet

In order to analyze Mitchell’s tweets, we had to add Emojis to the AFINN dictionary and assign them a score. The file my partner and I created can be downloaded right here.

afinnfile = open('/Users/zfleeman/Dropbox/TM/finalproj/AFINN-111_emoji.txt')

afinn = dict(map(lambda (k,v): (k,int(v)), [ line.split('\t') for line in afinnfile ]))

def afinn_sent(inputstring):
    
    sentcount =0
    for word in inputstring.split():  
        if word in afinn:
            sentcount = sentcount + afinn[word]
    
    if (sentcount < 0):
        sentiment = 'Negative'
    elif (sentcount >0):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    return sentiment
    #return sentcount

def afinn_sent_count(inputstring):
    
    sentcount =0
    for word in inputstring.split():  
        if word in afinn:
            sentcount = sentcount + afinn[word]
    
    return sentcount
    
tweet_frame['afinn'] = map(lambda x: afinn_sent(x), tweet_frame['newtext'])
tweet_frame['afinn_score'] = map(lambda x: afinn_sent_count(x), tweet_frame['newtext'])
tweet_frame.to_csv("~/Dropbox/TM/finalproj/mitch.csv",encoding='utf-8')
tweet_frame['afinn'].value_counts()

Load the Tweets into R for data manipulation and visualization

There are packages like matplotlib for Python, but I am pretty comfortable with aggregating and plotting data with R and it’s packages, so I decided to stick with that. Lubridate was the package I used to get the date into weekly buckets, and even though I could have just used a general plot() command in R to produce the happiness over time plot, I liked some of the added features of ggplot2.

library(ggplot2)
library(lubridate)

mitch <- read.csv("mitch.csv")

mitch$tweetCreated <- as.Date(substr(mitch$tweetCreated,0,10))
mitch$week <- week(mitch$tweetCreated)
mitch <- subset(mitch, mitch$week > 13)
mitch$week <- as.factor(mitch$week)

mitch <- aggregate(mitch$afinn_score ~ mitch$week, FUN = "sum")
colnames(mitch) <- c("week", "afinn_score")
mitch$week <- paste("Week",mitch$week)
mitch$label[8] <- "Graduation"
mitch$label[16] <- NA

ggplot(data = mitch, aes(x = week, y = afinn_score, label = label)) + geom_label(nudge_y = -10) + geom_line(group = 1) + labs(title = "Mitchell Wilfer Happiness Over Time -- 2016", y = "AFINN w/ Emoji Score", x = "") + theme(axis.text.x = element_text(angle = 45, hjust = 1))