Re: Column index vs Row index vs Denormalizing

2013-04-11 Thread Coen Stevens
Thanks for the feedback! We will be going forward by implementing and
deploying the proposed model, and test it out.

Cheers,
Coen


On Thu, Apr 11, 2013 at 12:21 PM, aaron morton aa...@thelastpickle.comwrote:

 Retrieving the latest 1000 tweets (of a given day) is trivial by
 requesting the streamTweets columnFamily.

 If you normally want to get the most recent items use a reverse comparator
 on the column name
 see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

 Getting the latest tweets for a given hashtag would mean you have to get
 the TimeUUIDs from the streamHashTagTweets first, and then do a second get
 call on the streamTweets with the former TimeUUIDs as the list of columns
 we like to retrieve (column index).

 You choices here depend on what sort of queries are the most frequent and
 how much disk space you have.

 You current model makes sense if the stream by day is the most frequent
 query, and you want to conserve disk space. If disk space is not an issue
 you can denormalise further and store the tweet JSON.

 If you have potentially many streamHashTagTweets rows where a single tweet
 is replicated it may make sense to stick with the current design to reduce
 disk use.

 (we want to get up to 1000 tweets).

 If you want to get 1000 anything from cassandra please break the multiget
 up into multiple calls. Each row request becomes a task in the thread pools
 on RF nodes. If you have a small ish cluster one client asking for 1000
 rows will temporarily block other clients and hurt request throughput.

  Referencing key values requires another columnFamily for tweets (key:
 tweetId, columns: 1 column with data).

 This will be a more efficient (aka faster) read than reading from the a
 wide row.

 Next to that we will request tweets by these secondary indexes quite
 infrequently, while the tweets by timestamp will be requested heavily.

 If the hot path is the streamTweets calls demoralise into that, and
 normalise the tweet storage into it's own CF and reference them from
 the streamHashTagTweets. Have a canonical store of the events / tweets /
 entities  addressable by their business key can give you more flexibility.

 Given we are estimating to store many TBs of tweets, we would prefer
 setting up machines with spinning disks (2TB per node) to save costs.

 If you have spinning disks and 1G networking the rule of thumb is 300GB to
 500GB per node. See previous discussions about size per node.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 10/04/2013, at 2:00 AM, Coen Stevens beatle...@gmail.com wrote:

 Hi all,

 We are working on a data model for storing tweets for multiple streams
 (where a stream is defined by a number of keyword filters on the full
 twitter firehose), and retrieving the tweets by timestamp and hashtag. My
 question is whether the following data model would a good way for doing
 that, where I'm creating a column name index for the hashtags.

 ColumnFamily: streamTweets
  key: streamID + dayTimestamp (creating daily buckets for each stream)
  columns = name: TimeUUID, value: tweet json (storing all the tweets
 for this stream in a wide row with a TimeUUID)

 ColumnFamily: streamHashTagTweets
  key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
  columns = name: TimeUUID (referencing the TimeUUID value in the
 streamTweets ColumnFamily), value: tweetID

 Retrieving the latest 1000 tweets (of a given day) is trivial by
 requesting the streamTweets columnFamily. Getting the latest tweets for a
 given hashtag would mean you have to get the TimeUUIDs from the
 streamHashTagTweets first, and then do a second get call on the
 streamTweets with the former TimeUUIDs as the list of columns we like to
 retrieve (column index).

 Is referencing column names (TimeUUIDs) a smart thing to do when we have
 wide rows spanning millions of columns? It seems easier (one reference
 call) to do this, then it is to reference key values and running a
 multi-get to get all the rows (we want to get up to 1000 tweets).
 Referencing key values requires another columnFamily for tweets (key:
 tweetId, columns: 1 column with data).

 Of course we could instead denormalize the data and store the tweet also
 in the streamHashTagTweet columns, but we want to do the same thing for
 other indexes as well (topics, twitter usernames, links, etc), so it
 quickly adds up in required storage space. Next to that we will request
 tweets by these secondary indexes quite infrequently, while the tweets by
 timestamp will be requested heavily.

 Given we are estimating to store many TBs of tweets, we would prefer
 setting up machines with spinning disks (2TB per node) to save costs.

 We would love to hear your feedback.

 Cheers,
 Coen





Column index vs Row index vs Denormalizing

2013-04-09 Thread Coen Stevens
Hi all,

We are working on a data model for storing tweets for multiple streams
(where a stream is defined by a number of keyword filters on the full
twitter firehose), and retrieving the tweets by timestamp and hashtag. My
question is whether the following data model would a good way for doing
that, where I'm creating a column name index for the hashtags.

ColumnFamily: streamTweets
 key: streamID + dayTimestamp (creating daily buckets for each stream)
 columns = name: TimeUUID, value: tweet json (storing all the tweets
for this stream in a wide row with a TimeUUID)

ColumnFamily: streamHashTagTweets
 key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
 columns = name: TimeUUID (referencing the TimeUUID value in the
streamTweets ColumnFamily), value: tweetID

Retrieving the latest 1000 tweets (of a given day) is trivial by requesting
the streamTweets columnFamily. Getting the latest tweets for a given
hashtag would mean you have to get the TimeUUIDs from the
streamHashTagTweets first, and then do a second get call on the
streamTweets with the former TimeUUIDs as the list of columns we like to
retrieve (column index).

Is referencing column names (TimeUUIDs) a smart thing to do when we have
wide rows spanning millions of columns? It seems easier (one reference
call) to do this, then it is to reference key values and running a
multi-get to get all the rows (we want to get up to 1000 tweets).
Referencing key values requires another columnFamily for tweets (key:
tweetId, columns: 1 column with data).

Of course we could instead denormalize the data and store the tweet also in
the streamHashTagTweet columns, but we want to do the same thing for other
indexes as well (topics, twitter usernames, links, etc), so it quickly adds
up in required storage space. Next to that we will request tweets by these
secondary indexes quite infrequently, while the tweets by timestamp will be
requested heavily.

Given we are estimating to store many TBs of tweets, we would prefer
setting up machines with spinning disks (2TB per node) to save costs.

We would love to hear your feedback.

Cheers,
Coen