Re: Column index vs Row index vs Denormalizing

aaron morton Thu, 11 Apr 2013 03:22:18 -0700

> Retrieving the latest 1000 tweets (of a given day) is trivial by requesting 
> the streamTweets columnFamily. 
If you normally want to get the most recent items use a reverse comparator on 
the column name 
see http://thelastpickle.com/2011/10/03/Reverse-Comparators/


> Getting the latest tweets for a given hashtag would mean you have to get the 
> TimeUUIDs from the streamHashTagTweets first, and then do a second get call 
> on the streamTweets with the former TimeUUIDs as the list of columns we like 
> to retrieve (column index).
You choices here depend on what sort of queries are the most frequent and how 
much disk space you have. 

You current model makes sense if the stream by day is the most frequent query, 
and you want to conserve disk space. If disk space is not an issue you can 
denormalise further and store the tweet JSON. 

If you have potentially many streamHashTagTweets rows where a single tweet is 
replicated it may make sense to stick with the current design to reduce disk 
use. 

> (we want to get up to 1000 tweets). 
If you want to get 1000 anything from cassandra please break the multiget up 
into multiple calls. Each row request becomes a task in the thread pools on RF 
nodes. If you have a small ish cluster one client asking for 1000 rows will 
temporarily block other clients and hurt request throughput. 

>  Referencing key values requires another columnFamily for tweets (key: 
> tweetId, columns: 1 column with data).
This will be a more efficient (aka faster) read than reading from the a wide 
row. 

> Next to that we will request tweets by these secondary indexes quite 
> infrequently, while the tweets by timestamp will be requested heavily.
If the hot path is the streamTweets calls demoralise into that, and normalise 
the tweet storage into it's own CF and reference them from the 
streamHashTagTweets. Have a canonical store of the events / tweets / entities  
addressable by their business key can give you more flexibility. 

> Given we are estimating to store many TBs of tweets, we would prefer setting 
> up machines with spinning disks (2TB per node) to save costs.
If you have spinning disks and 1G networking the rule of thumb is 300GB to 
500GB per node. See previous discussions about size per node.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:00 AM, Coen Stevens <beatle...@gmail.com> wrote:

> Hi all, 
> 
> We are working on a data model for storing tweets for multiple streams (where 
> a stream is defined by a number of keyword filters on the full twitter 
> firehose), and retrieving the tweets by timestamp and hashtag. My question is 
> whether the following data model would a good way for doing that, where I'm 
> creating a column name index for the hashtags.
> 
> ColumnFamily: streamTweets
>      key: streamID + dayTimestamp (creating daily buckets for each stream)
>      columns => name: TimeUUID, value: tweet json (storing all the tweets for 
> this stream in a wide row with a TimeUUID)
> 
> ColumnFamily: streamHashTagTweets
>      key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
>      columns => name: TimeUUID (referencing the TimeUUID value in the 
> streamTweets ColumnFamily), value: tweetID
> 
> Retrieving the latest 1000 tweets (of a given day) is trivial by requesting 
> the streamTweets columnFamily. Getting the latest tweets for a given hashtag 
> would mean you have to get the TimeUUIDs from the streamHashTagTweets first, 
> and then do a second get call on the streamTweets with the former TimeUUIDs 
> as the list of columns we like to retrieve (column index).
> 
> Is referencing column names (TimeUUIDs) a smart thing to do when we have wide 
> rows spanning millions of columns? It seems easier (one reference call) to do 
> this, then it is to reference key values and running a multi-get to get all 
> the rows (we want to get up to 1000 tweets). Referencing key values requires 
> another columnFamily for tweets (key: tweetId, columns: 1 column with data).
> 
> Of course we could instead denormalize the data and store the tweet also in 
> the streamHashTagTweet columns, but we want to do the same thing for other 
> indexes as well (topics, twitter usernames, links, etc), so it quickly adds 
> up in required storage space. Next to that we will request tweets by these 
> secondary indexes quite infrequently, while the tweets by timestamp will be 
> requested heavily.
> 
> Given we are estimating to store many TBs of tweets, we would prefer setting 
> up machines with spinning disks (2TB per node) to save costs.
> 
> We would love to hear your feedback.
> 
> Cheers,
> Coen

Re: Column index vs Row index vs Denormalizing

Reply via email to