Friday, June 17, 2016

SQL on Twitter: Analyzing Twitter Data Made Simple

SQL on Twitter

"If I had more time, I would have written shorter letter"
-- Blaise Pascal

There have been lengthy articles on analyzing twitter data by Cloudera here, hereandhere. More from Hortonworks here and here. This article is going to be short. Thanks to features in Couchbase 4.5!

There have been lengthy articles on analyzing twitter data by Cloudera here, hereandhere. More from Hortonworks here and here. This article is going to be short. Thanks to features in Couchbase 4.5!

Step 1: Install Couchbase 4.5 Use Couchbase console create a bucket called twitterandCREATE PRIMARY INDEX on twitter using query workbench or any other tool.


Step 2: Request for your twitter archive. Once you receive it, unzip it. (You can use larger twitter archives as well).


cd <to the unzipped location>/data/tweets
Step 3: $ for i in `ls`; 
        do 
          grep -i ^Grailbird $i > $i.out ; 
        done

Step 4: update your ip, username & password before you run this.

$ for i in `ls *.out`;
  do 
   /opt/couchbase/bin/cbbackupmgr json --host http://172.23.123.107:8091 --username Administrator --password password --bucket default --dataset file:///root/$i --format list --generate-key %id_str%; 
 done

Step 5: There is no step 5.
Login to Couchbase query workbench and start playing! Simply use SQL based N1QL to query and play with the data. Online interactive tutorial makes this even easier.

Here are some observations for my own twitter deck:

In the query tab, simply click on the bucketname twitter, it'll automatically analyze the data to give you the structure of the documents with sample data. Move your cursor over the fields to see the sample data.




2. Get me a sample twitter document.

select * 
from twitter 
limit 1;



3. Give me the count of my tweets.

select count(*) as my_tweet_count 
from twitter;



4. Give me top 5 hashtags and counts in my tweet


select ht.text, count(1) 
from twitter unnest entities.hashtags ht 
group by ht 
order by count(1) desc limit 5;





Yes. I worked for Informix and IBM :-)


5. What days did I tweet most?

select substr(created_at, 0, 10), count(1)
from twitter
group by substr(created_at, 0, 10)
order by count(1) desc
limit 5;





Couchbase 4.5 makes it very easy to ingest JSON so you can get insights.

Try out with your own twitter data or a public JSON archive. Create indices on fields, and arrays. Ask more questions, find more insights!

A Deep Dive Into Couchbase N1QL Query Optimization

[Reposting of the article published with Sitaram Vemulapalli on DZone.  https://dzone.com/articles/a-deep-dive-into-couchbase-n1ql-query-op...