Schema design for tweet analytics

Dominik Hübner Thu, 21 May 2015 05:46:57 -0700

Hey, I am new to HBase and struggle a bit on how to design my schema.
To dive in, I gathered a dataset from the Twitter sample stream (roughly 40GB 
by now).


I want to answer the following queries:
- What where the trending hashtags/terms for a certain period of time (e.g. an 
hour) over last X days and being able to plot those as a timeline 

Row key: <“”trending”><TIMESTAMP_OF_THE_DAY> 
Column key: <TIMESTAMP_OF_HOUR>
Value: Set of most popular X tweets

All writes would be close in terms of data locality, but as its just an 
aggregate with basically 1 write per hour it should be fine. On lookup time I 
will be able to scan through the days and get a timeline of changing trending 
terms for a range of days.


- Get a tweet by its identifier

Row key: <TWEET_ID><“tweet”>
Column key: <TWEET_FIELD_NAME>
Value: value of the tweet feature like its text or author

Straightforward, 1 tweet per row for direct lookup


- Number of tweets for all countries
Row key: <“tweets_per_country”>
Column key: <COUNTRY_ID_NAME>
Value: the count

A single row to either get all countries or a particular country from a column.


- Which are the most (top N) similar tweets to a particular tweet
This one might be a bit more tricky. I wrote a MapReduce job to the the top N 
most similar tweets for a particular tweet with a similarity score. How can I 
map this to an hbase schema? My guess would be to keep them in a similar schema 
as the actual tweets

<tweet_id>< 0 >     <= actual tweet
<tweet_id>< LONG_MAX-SIMILARITY_SCORE>     <= most similar tweets in descending 
order

But what should i store in those rows? The actual (then duplicated tweets) or 
just their ids and do a second lookup later.




I would really appreciate if someone could have a look at my ideas about 
schema/lookup and tell me if do something wrong here.


Best
Dominik

Schema design for tweet analytics

Reply via email to