Tuesday, April 5, 2011

Analyzing data from Twitter feeds

On average 1 billion tweets are being posted per week. This gigantic amount of data represents a great opportunity to analyze live trends and opinions... if you can filter out the data you are looking for.

I attended a very interesting conference at SXSW that enticed me to try this out. The idea is to pull some data from social media websites (Twitter in this example) and analyze them.

I hacked a little program to download tweets, to store them in a database and to run some basic analysis on these data. It is using twitter4j and sqlite4java. The source code is available upon request.

My original plan was to download all tweets posted during the SXSW conference containing the word SXSW. Unfortunately, Twitter limits their search results to 1500 tweets per day, so I was not able to download them all. That's why I decided to narrow my search to all tweets posted during the time of the final keynote and at the Austin Convention Center using the GPS coordinates. With these new filters, I was able to retrieve 2600 tweets in 2 days.

First, I counted the numbers of occurrence for each word. After excluding all the common punctuations, prepositions, articles and numbers, I looked at the most common words used in these tweets:

The words that are tweeted the most are the users' location (i.e. Austin, Convention, Center), followed by the particulars of the keynote speaker-- Blake Mycoskie.

Only a few words are being tweeted repeatedly. Out of the 4840 words my program counted, 97% of them were used 20 times or less as you can see below.

The vocabulary diversity chart above demonstrates that the words occurring more than 20 times are representative of the tweeting trend at that time and place.

At this point, it was still very hard to determine what is the trend/opinion specific to the keynote. The program was still processing many tweets that are not related to the content of the keynote itself. So I decided to add another filter to select only the tweets that contain at least one of the following word: Blake, Mycoskie, TOMS or shoes. I also used a dictionary of the 5000 most commonly used words from the Princeton university to filter out the results.

With this new filters, the following words stood out:

great     amazing     giving     story     june     1for1

This may be a little simplistic but as a conclusion I would say that the attendees of the keynote speech:
- enjoyed the speech (great, amazing, story)
- understood the concept of TOMS business and were willing to share it (giving, 1for1)
- were excited to share the coming announcement on June 7th (june).

This post is the result of a few long nights of coding and analyzing. There are a lot more advanced works publicly available. You may be interested in the following links:


Anonymous said...

Good article! Thanks

coconino said...

I do quite a bit of text and comment analysis with my job and I think your project is very cool!

Matt Buchner said...

I found this interesting article today: http://www.hackerfactor.com/blog/index.php?/archives/441-NobodyKnowsYoureADog.html

It is about guessing the gender to the author based on the vocabulary used.