Re: Column index vs Row index vs Denormalizing

2013-04-11 Thread aaron morton
 Retrieving the latest 1000 tweets (of a given day) is trivial by requesting 
 the streamTweets columnFamily. 
If you normally want to get the most recent items use a reverse comparator on 
the column name 
see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

 Getting the latest tweets for a given hashtag would mean you have to get the 
 TimeUUIDs from the streamHashTagTweets first, and then do a second get call 
 on the streamTweets with the former TimeUUIDs as the list of columns we like 
 to retrieve (column index).
You choices here depend on what sort of queries are the most frequent and how 
much disk space you have. 

You current model makes sense if the stream by day is the most frequent query, 
and you want to conserve disk space. If disk space is not an issue you can 
denormalise further and store the tweet JSON. 

If you have potentially many streamHashTagTweets rows where a single tweet is 
replicated it may make sense to stick with the current design to reduce disk 
use. 

 (we want to get up to 1000 tweets). 
If you want to get 1000 anything from cassandra please break the multiget up 
into multiple calls. Each row request becomes a task in the thread pools on RF 
nodes. If you have a small ish cluster one client asking for 1000 rows will 
temporarily block other clients and hurt request throughput. 

  Referencing key values requires another columnFamily for tweets (key: 
 tweetId, columns: 1 column with data).
This will be a more efficient (aka faster) read than reading from the a wide 
row. 

 Next to that we will request tweets by these secondary indexes quite 
 infrequently, while the tweets by timestamp will be requested heavily.
If the hot path is the streamTweets calls demoralise into that, and normalise 
the tweet storage into it's own CF and reference them from the 
streamHashTagTweets. Have a canonical store of the events / tweets / entities  
addressable by their business key can give you more flexibility. 

 Given we are estimating to store many TBs of tweets, we would prefer setting 
 up machines with spinning disks (2TB per node) to save costs.
If you have spinning disks and 1G networking the rule of thumb is 300GB to 
500GB per node. See previous discussions about size per node.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:00 AM, Coen Stevens beatle...@gmail.com wrote:

 Hi all, 
 
 We are working on a data model for storing tweets for multiple streams (where 
 a stream is defined by a number of keyword filters on the full twitter 
 firehose), and retrieving the tweets by timestamp and hashtag. My question is 
 whether the following data model would a good way for doing that, where I'm 
 creating a column name index for the hashtags.
 
 ColumnFamily: streamTweets
  key: streamID + dayTimestamp (creating daily buckets for each stream)
  columns = name: TimeUUID, value: tweet json (storing all the tweets for 
 this stream in a wide row with a TimeUUID)
 
 ColumnFamily: streamHashTagTweets
  key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
  columns = name: TimeUUID (referencing the TimeUUID value in the 
 streamTweets ColumnFamily), value: tweetID
 
 Retrieving the latest 1000 tweets (of a given day) is trivial by requesting 
 the streamTweets columnFamily. Getting the latest tweets for a given hashtag 
 would mean you have to get the TimeUUIDs from the streamHashTagTweets first, 
 and then do a second get call on the streamTweets with the former TimeUUIDs 
 as the list of columns we like to retrieve (column index).
 
 Is referencing column names (TimeUUIDs) a smart thing to do when we have wide 
 rows spanning millions of columns? It seems easier (one reference call) to do 
 this, then it is to reference key values and running a multi-get to get all 
 the rows (we want to get up to 1000 tweets). Referencing key values requires 
 another columnFamily for tweets (key: tweetId, columns: 1 column with data).
 
 Of course we could instead denormalize the data and store the tweet also in 
 the streamHashTagTweet columns, but we want to do the same thing for other 
 indexes as well (topics, twitter usernames, links, etc), so it quickly adds 
 up in required storage space. Next to that we will request tweets by these 
 secondary indexes quite infrequently, while the tweets by timestamp will be 
 requested heavily.
 
 Given we are estimating to store many TBs of tweets, we would prefer setting 
 up machines with spinning disks (2TB per node) to save costs.
 
 We would love to hear your feedback.
 
 Cheers,
 Coen



Re: Column index vs Row index vs Denormalizing

2013-04-11 Thread Coen Stevens
Thanks for the feedback! We will be going forward by implementing and
deploying the proposed model, and test it out.

Cheers,
Coen


On Thu, Apr 11, 2013 at 12:21 PM, aaron morton aa...@thelastpickle.comwrote:

 Retrieving the latest 1000 tweets (of a given day) is trivial by
 requesting the streamTweets columnFamily.

 If you normally want to get the most recent items use a reverse comparator
 on the column name
 see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

 Getting the latest tweets for a given hashtag would mean you have to get
 the TimeUUIDs from the streamHashTagTweets first, and then do a second get
 call on the streamTweets with the former TimeUUIDs as the list of columns
 we like to retrieve (column index).

 You choices here depend on what sort of queries are the most frequent and
 how much disk space you have.

 You current model makes sense if the stream by day is the most frequent
 query, and you want to conserve disk space. If disk space is not an issue
 you can denormalise further and store the tweet JSON.

 If you have potentially many streamHashTagTweets rows where a single tweet
 is replicated it may make sense to stick with the current design to reduce
 disk use.

 (we want to get up to 1000 tweets).

 If you want to get 1000 anything from cassandra please break the multiget
 up into multiple calls. Each row request becomes a task in the thread pools
 on RF nodes. If you have a small ish cluster one client asking for 1000
 rows will temporarily block other clients and hurt request throughput.

  Referencing key values requires another columnFamily for tweets (key:
 tweetId, columns: 1 column with data).

 This will be a more efficient (aka faster) read than reading from the a
 wide row.

 Next to that we will request tweets by these secondary indexes quite
 infrequently, while the tweets by timestamp will be requested heavily.

 If the hot path is the streamTweets calls demoralise into that, and
 normalise the tweet storage into it's own CF and reference them from
 the streamHashTagTweets. Have a canonical store of the events / tweets /
 entities  addressable by their business key can give you more flexibility.

 Given we are estimating to store many TBs of tweets, we would prefer
 setting up machines with spinning disks (2TB per node) to save costs.

 If you have spinning disks and 1G networking the rule of thumb is 300GB to
 500GB per node. See previous discussions about size per node.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 10/04/2013, at 2:00 AM, Coen Stevens beatle...@gmail.com wrote:

 Hi all,

 We are working on a data model for storing tweets for multiple streams
 (where a stream is defined by a number of keyword filters on the full
 twitter firehose), and retrieving the tweets by timestamp and hashtag. My
 question is whether the following data model would a good way for doing
 that, where I'm creating a column name index for the hashtags.

 ColumnFamily: streamTweets
  key: streamID + dayTimestamp (creating daily buckets for each stream)
  columns = name: TimeUUID, value: tweet json (storing all the tweets
 for this stream in a wide row with a TimeUUID)

 ColumnFamily: streamHashTagTweets
  key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
  columns = name: TimeUUID (referencing the TimeUUID value in the
 streamTweets ColumnFamily), value: tweetID

 Retrieving the latest 1000 tweets (of a given day) is trivial by
 requesting the streamTweets columnFamily. Getting the latest tweets for a
 given hashtag would mean you have to get the TimeUUIDs from the
 streamHashTagTweets first, and then do a second get call on the
 streamTweets with the former TimeUUIDs as the list of columns we like to
 retrieve (column index).

 Is referencing column names (TimeUUIDs) a smart thing to do when we have
 wide rows spanning millions of columns? It seems easier (one reference
 call) to do this, then it is to reference key values and running a
 multi-get to get all the rows (we want to get up to 1000 tweets).
 Referencing key values requires another columnFamily for tweets (key:
 tweetId, columns: 1 column with data).

 Of course we could instead denormalize the data and store the tweet also
 in the streamHashTagTweet columns, but we want to do the same thing for
 other indexes as well (topics, twitter usernames, links, etc), so it
 quickly adds up in required storage space. Next to that we will request
 tweets by these secondary indexes quite infrequently, while the tweets by
 timestamp will be requested heavily.

 Given we are estimating to store many TBs of tweets, we would prefer
 setting up machines with spinning disks (2TB per node) to save costs.

 We would love to hear your feedback.

 Cheers,
 Coen





Column index vs Row index vs Denormalizing

2013-04-09 Thread Coen Stevens
Hi all,

We are working on a data model for storing tweets for multiple streams
(where a stream is defined by a number of keyword filters on the full
twitter firehose), and retrieving the tweets by timestamp and hashtag. My
question is whether the following data model would a good way for doing
that, where I'm creating a column name index for the hashtags.

ColumnFamily: streamTweets
 key: streamID + dayTimestamp (creating daily buckets for each stream)
 columns = name: TimeUUID, value: tweet json (storing all the tweets
for this stream in a wide row with a TimeUUID)

ColumnFamily: streamHashTagTweets
 key: streamID + dayTimestamp + hashTag (e.g. 123_2013-04-02_cassandra)
 columns = name: TimeUUID (referencing the TimeUUID value in the
streamTweets ColumnFamily), value: tweetID

Retrieving the latest 1000 tweets (of a given day) is trivial by requesting
the streamTweets columnFamily. Getting the latest tweets for a given
hashtag would mean you have to get the TimeUUIDs from the
streamHashTagTweets first, and then do a second get call on the
streamTweets with the former TimeUUIDs as the list of columns we like to
retrieve (column index).

Is referencing column names (TimeUUIDs) a smart thing to do when we have
wide rows spanning millions of columns? It seems easier (one reference
call) to do this, then it is to reference key values and running a
multi-get to get all the rows (we want to get up to 1000 tweets).
Referencing key values requires another columnFamily for tweets (key:
tweetId, columns: 1 column with data).

Of course we could instead denormalize the data and store the tweet also in
the streamHashTagTweet columns, but we want to do the same thing for other
indexes as well (topics, twitter usernames, links, etc), so it quickly adds
up in required storage space. Next to that we will request tweets by these
secondary indexes quite infrequently, while the tweets by timestamp will be
requested heavily.

Given we are estimating to store many TBs of tweets, we would prefer
setting up machines with spinning disks (2TB per node) to save costs.

We would love to hear your feedback.

Cheers,
Coen