Hey, I am new to HBase and struggle a bit on how to design my schema. To dive in, I gathered a dataset from the Twitter sample stream (roughly 40GB by now).
I want to answer the following queries: - What where the trending hashtags/terms for a certain period of time (e.g. an hour) over last X days and being able to plot those as a timeline Row key: <“”trending”><TIMESTAMP_OF_THE_DAY> Column key: <TIMESTAMP_OF_HOUR> Value: Set of most popular X tweets All writes would be close in terms of data locality, but as its just an aggregate with basically 1 write per hour it should be fine. On lookup time I will be able to scan through the days and get a timeline of changing trending terms for a range of days. - Get a tweet by its identifier Row key: <TWEET_ID><“tweet”> Column key: <TWEET_FIELD_NAME> Value: value of the tweet feature like its text or author Straightforward, 1 tweet per row for direct lookup - Number of tweets for all countries Row key: <“tweets_per_country”> Column key: <COUNTRY_ID_NAME> Value: the count A single row to either get all countries or a particular country from a column. - Which are the most (top N) similar tweets to a particular tweet This one might be a bit more tricky. I wrote a MapReduce job to the the top N most similar tweets for a particular tweet with a similarity score. How can I map this to an hbase schema? My guess would be to keep them in a similar schema as the actual tweets <tweet_id>< 0 > <= actual tweet <tweet_id>< LONG_MAX-SIMILARITY_SCORE> <= most similar tweets in descending order But what should i store in those rows? The actual (then duplicated tweets) or just their ids and do a second lookup later. I would really appreciate if someone could have a look at my ideas about schema/lookup and tell me if do something wrong here. Best Dominik
