Re: best practices for time-series data with massive amounts of records

2015-03-07 Thread Eric Stevens
It's probably quite rare for extremely large time series data to be
querying the whole set of data.  Instead there's almost always a Between X
and Y dates aspect to nearly every real time query you might have against
a table like this (with the exception of most recent N events).

Because of this, time bucketing can be an effective strategy, though until
you understand your data better, it's hard to know how large (or small) to
make your buckets.  Because of *that*, I recommend using timestamp data
type for your bucketing strategy - this gives you the advantage of being
able to reduce your bucket sizes while keeping your at-rest data mostly
still quite accessible.

What I mean is that if you change your bucketing strategy from day to hour,
when you are querying across that changed time period, you can iterate at
the finer granularity buckets (hour), and you'll pick up the coarser
granularity (day) automatically for all but the earliest bucket (which is
easy to correct for when you're flooring your start bucket).  In the
coarser time period, most reads are partition key misses, which are
extremely inexpensive in Cassandra.

If you do need most-recent-N queries for broad ranges and you expect to
have some users whose clickrate is dramatically less frequent than your
bucket interval (making iterating over buckets inefficient), you can keep a
separate counter table with PK of ((user_id), bucket) in which you count
new events.  Now you can identify the exact set of buckets you need to read
to satisfy the query no matter what the user's click volume is (so very low
volume users have at most N partition keys queried, higher volume users
query fewer partition keys).

On Fri, Mar 6, 2015 at 4:06 PM, graham sanderson gra...@vast.com wrote:

 Note that using static column(s) for the “head” value, and trailing TTLed
 values behind is something we’re considering. Note this is especially nice
 if your head state includes say a map which is updated by small deltas
 (individual keys)

 We have not yet studied the effect of static columns on say DTCS


 On Mar 6, 2015, at 4:42 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi all,

 Thanks for the responses, this was very helpful.

 I don't know yet what the distribution of clicks and users will be, but I
 expect to see a few users with an enormous amount of interactions and most
 users having very few.  The idea of doing some additional manual
 partitioning, and then maintaining another table that contains the head
 partition for each user makes sense, although it would add additional
 latency when we want to get say the most recent 1000 interactions for a
 given user (which is something that we have to do sometimes for
 applications with tight SLAs).

 FWIW I doubt that any users will have so many interactions that they
 exceed what we could reasonably put in a row, but I wanted to have a
 strategy to deal with this.

 Having a nice design pattern in Cassandra for maintaining a row with the
 N-most-recent interactions would also solve this reasonably well, but I
 don't know of any way to implement that without running batch jobs that
 periodically clean out data (which might be okay).

 Best regards,
 Clint




 On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org wrote:


  Here partition is a random digit from 0 to (N*M)
  where N=nodes in cluster, and M=arbitrary number.


 Hopefully it was obvious, but here (unless you've got hot partitions),
 you don't need N.
 ~mck






Re: best practices for time-series data with massive amounts of records

2015-03-06 Thread Clint Kelly
Hi all,

Thanks for the responses, this was very helpful.

I don't know yet what the distribution of clicks and users will be, but I
expect to see a few users with an enormous amount of interactions and most
users having very few.  The idea of doing some additional manual
partitioning, and then maintaining another table that contains the head
partition for each user makes sense, although it would add additional
latency when we want to get say the most recent 1000 interactions for a
given user (which is something that we have to do sometimes for
applications with tight SLAs).

FWIW I doubt that any users will have so many interactions that they exceed
what we could reasonably put in a row, but I wanted to have a strategy to
deal with this.

Having a nice design pattern in Cassandra for maintaining a row with the
N-most-recent interactions would also solve this reasonably well, but I
don't know of any way to implement that without running batch jobs that
periodically clean out data (which might be okay).

Best regards,
Clint




On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org wrote:


  Here partition is a random digit from 0 to (N*M)
  where N=nodes in cluster, and M=arbitrary number.


 Hopefully it was obvious, but here (unless you've got hot partitions),
 you don't need N.
 ~mck



Re: best practices for time-series data with massive amounts of records

2015-03-06 Thread graham sanderson
Note that using static column(s) for the “head” value, and trailing TTLed 
values behind is something we’re considering. Note this is especially nice if 
your head state includes say a map which is updated by small deltas (individual 
keys)

We have not yet studied the effect of static columns on say DTCS

 On Mar 6, 2015, at 4:42 PM, Clint Kelly clint.ke...@gmail.com wrote:
 
 Hi all,
 
 Thanks for the responses, this was very helpful.
 
 I don't know yet what the distribution of clicks and users will be, but I 
 expect to see a few users with an enormous amount of interactions and most 
 users having very few.  The idea of doing some additional manual 
 partitioning, and then maintaining another table that contains the head 
 partition for each user makes sense, although it would add additional latency 
 when we want to get say the most recent 1000 interactions for a given user 
 (which is something that we have to do sometimes for applications with tight 
 SLAs).
 
 FWIW I doubt that any users will have so many interactions that they exceed 
 what we could reasonably put in a row, but I wanted to have a strategy to 
 deal with this.
 
 Having a nice design pattern in Cassandra for maintaining a row with the 
 N-most-recent interactions would also solve this reasonably well, but I don't 
 know of any way to implement that without running batch jobs that 
 periodically clean out data (which might be okay).
 
 Best regards,
 Clint
 
 
 
 
 On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org 
 mailto:m...@apache.org wrote:
 
  Here partition is a random digit from 0 to (N*M)
  where N=nodes in cluster, and M=arbitrary number.
 
 
 Hopefully it was obvious, but here (unless you've got hot partitions),
 you don't need N.
 ~mck
 



smime.p7s
Description: S/MIME cryptographic signature


Re: best practices for time-series data with massive amounts of records

2015-03-03 Thread Jens Rantil
Hi,

I have not done something similar, however I have some comments:

On Mon, Mar 2, 2015 at 8:47 PM, Clint Kelly clint.ke...@gmail.com wrote:

 The downside of this approach is that we can no longer do a simple
 continuous scan to get all of the events for a given user.


Sure, but would you really do that real time anyway? :) If you have
billions of events that's not going to scale anyway. Also, if you have
10 events per bucket. The latency introduced by batching should be
manageable.


 Some users may log lots and lots of interactions every day, while others
 may interact with our application infrequently,


This makes another reason to split them up into bucket to make the cluster
partitions more manageble and homogenous.


 so I'd like a quick way to get the most recent interaction for a given
 user.


For this you could actually have a second table that stores the
last_time_bucket for a user. Upon event write, you could simply do an
update of the last_time_bucket. You could even have an index of all time
buckets per user if you want.


 Has anyone used different approaches for this problem?

 The only thing I can think of is to use the second table schema described
 above, but switch to an order-preserving hashing function, and then
 manually hash the id field.  This is essentially what we would do in
 HBase.


Like you might already know, this order preserving hashing is _not_
considered best practise in the Cassandra world.

Cheers,
Jens


-- 
Jens Rantil
Backend engineer
Tink AB

Email: jens.ran...@tink.se
Phone: +46 708 84 18 32
Web: www.tink.se

Facebook https://www.facebook.com/#!/tink.se Linkedin
http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
 Twitter https://twitter.com/tink


Re: best practices for time-series data with massive amounts of records

2015-03-03 Thread Jack Krupansky
I'd recommend using 100K and 10M as rough guidelines for the maximum number
of rows and bytes in a single partition. Sure, Cassandra can technically
handle a lot more than that, but very large partitions can make your life
more difficult. Of course you will have to do a POC to validate the sweet
spot for your particular app, data model, actual data values, hardware, app
access patterns, and app latency requirements. It may be that your actual
numbers should be half or twice my guidance, but they are a starting point.

Back to your starting point: You really need to characterize the number of
records per user. For example, will you have a large number of users with
few records? IOW, what are the expected distributions for user count and
record per user count. Give some specific numbers. Even if you don't know
what the real numbers will be, you have to at least have a model for counts
before modeling the partition keys.

-- Jack Krupansky

On Mon, Mar 2, 2015 at 2:47 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi all,

 I am designing an application that will capture time series data where we
 expect the number of records per user to potentially be extremely high.  I
 am not sure if we will eclipse the max row size of 2B elements, but I
 assume that we would not want our application to approach that size anyway.

 If we wanted to put all of the interactions in a single row, then I would
 make a data model that looks like:

 CREATE TABLE events (
   id text,
   event_time timestamp,
   event blob,
   PRIMARY KEY (id, event_time))
 WITH CLUSTERING ORDER BY (event_time DESC);

 The best practice for breaking up large rows of time series data is, as I
 understand it, to put part of the time into the partitioning key (
 http://planetcassandra.org/getting-started-with-time-series-data-modeling/
 ):

 CREATE TABLE events (
   id text,
   date text, // Could also use year+month here or year+week or something
 else
   event_time timestamp,
   event blob,
   PRIMARY KEY ((id, date), event_time))
 WITH CLUSTERING ORDER BY (event_time DESC);

 The downside of this approach is that we can no longer do a simple
 continuous scan to get all of the events for a given user.  Some users may
 log lots and lots of interactions every day, while others may interact with
 our application infrequently, so I'd like a quick way to get the most
 recent interaction for a given user.

 Has anyone used different approaches for this problem?

 The only thing I can think of is to use the second table schema described
 above, but switch to an order-preserving hashing function, and then
 manually hash the id field.  This is essentially what we would do in
 HBase.

 Curious if anyone else has any thoughts.

 Best regards,
 Clint





Re: best practices for time-series data with massive amounts of records

2015-03-03 Thread Yulian Oifa
Hello
You can use timeuuid as raw key and create sepate CF to be used for indexing
Indexing CF may be either with user_id as key , or a better approach is to
partition row by timestamp.
In case of partition you can create compound key , in which you will store
user_id and timestamp base ( for example if you would like to keep 8 of 13
digits in timestamp , then new row will be created each 10 seconds -
approximately each day , a bit more and maximum number of rows per user
would be 100K , of course you can play with number of rows/ time for each
row depending on number of records you are receiving. i am creating new row
each 11 days , so its 35 rows per year , per user ) )
In each column you can store timeuuid as name and empty value.

This way you keep you data ordered by time. The only disadvantage of this
approach is that you have to glue your data when you finished reading one
index row and started another one ( both asc and desc ).

When reading data you should first get slice depending on your needs from
index , and then get multi_range from original CF based on slice received.
Hope it helps
Best regards
Yulian Oifa



On Mon, Mar 2, 2015 at 9:47 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi all,

 I am designing an application that will capture time series data where we
 expect the number of records per user to potentially be extremely high.  I
 am not sure if we will eclipse the max row size of 2B elements, but I
 assume that we would not want our application to approach that size anyway.

 If we wanted to put all of the interactions in a single row, then I would
 make a data model that looks like:

 CREATE TABLE events (
   id text,
   event_time timestamp,
   event blob,
   PRIMARY KEY (id, event_time))
 WITH CLUSTERING ORDER BY (event_time DESC);

 The best practice for breaking up large rows of time series data is, as I
 understand it, to put part of the time into the partitioning key (
 http://planetcassandra.org/getting-started-with-time-series-data-modeling/
 ):

 CREATE TABLE events (
   id text,
   date text, // Could also use year+month here or year+week or something
 else
   event_time timestamp,
   event blob,
   PRIMARY KEY ((id, date), event_time))
 WITH CLUSTERING ORDER BY (event_time DESC);

 The downside of this approach is that we can no longer do a simple
 continuous scan to get all of the events for a given user.  Some users may
 log lots and lots of interactions every day, while others may interact with
 our application infrequently, so I'd like a quick way to get the most
 recent interaction for a given user.

 Has anyone used different approaches for this problem?

 The only thing I can think of is to use the second table schema described
 above, but switch to an order-preserving hashing function, and then
 manually hash the id field.  This is essentially what we would do in
 HBase.

 Curious if anyone else has any thoughts.

 Best regards,
 Clint





Re: best practices for time-series data with massive amounts of records

2015-03-03 Thread mck

 Here partition is a random digit from 0 to (N*M) 
 where N=nodes in cluster, and M=arbitrary number.


Hopefully it was obvious, but here (unless you've got hot partitions),
you don't need N.
~mck


Re: best practices for time-series data with massive amounts of records

2015-03-03 Thread mck
Clint,

 CREATE TABLE events (
   id text,
   date text, // Could also use year+month here or year+week or something else
   event_time timestamp,
   event blob,
   PRIMARY KEY ((id, date), event_time))
 WITH CLUSTERING ORDER BY (event_time DESC);
 
 The downside of this approach is that we can no longer do a simple
 continuous scan to get all of the events for a given user.  Some users
 may log lots and lots of interactions every day, while others may interact
 with our application infrequently, so I'd like a quick way to get the most
 recent interaction for a given user.
 
 Has anyone used different approaches for this problem?


One idea is to provide additional manual partitioning like…

CREATE TABLE events (
  user_id text,
  partition int,
  event_time timeuuid,
  event_json text,
  PRIMARY KEY ((user_id, partition), event_time)
) WITH
  CLUSTERING ORDER BY (event_time DESC) AND
  compaction={'class': 'DateTieredCompactionStrategy'};


Here partition is a random digit from 0 to (N*M) 
where N=nodes in cluster, and M=arbitrary number.

Read performance is going to suffer a little because you need to query
N*M as many partition keys for each read, but should be constant enough
that it comes down to increasing the cluster's hardware and scaling out
as need be.

The multikey reads you can do it with a SELECT…IN query, or better yet
with parallel reads (less pressure on the coordinator at expense of 
extra network calls).

Starting with M=1, you have the option to increase it over time if the
rows in partitions for any users get too high.
(We do¹ something similar for storing all raw events in our enterprise
platform, but because the data is not user-centric the initial partition
key is minute-by-minute timebuckets, and M has remained at 1 the whole
time).

This approach is better than using order-preserving partition (really
don't do that).

I would also consider replacing event blob with event text, choosing
json instead of any binary serialisation. We've learnt the hard way the
value of data transparency, and i'm guessing the storage cost is small
given c* compression.

Otherwise the advice here is largely repeating what Jens has already
said.

~mck

  ¹ slide 19+20 from
  
https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/


best practices for time-series data with massive amounts of records

2015-03-02 Thread Clint Kelly
Hi all,

I am designing an application that will capture time series data where we
expect the number of records per user to potentially be extremely high.  I
am not sure if we will eclipse the max row size of 2B elements, but I
assume that we would not want our application to approach that size anyway.

If we wanted to put all of the interactions in a single row, then I would
make a data model that looks like:

CREATE TABLE events (
  id text,
  event_time timestamp,
  event blob,
  PRIMARY KEY (id, event_time))
WITH CLUSTERING ORDER BY (event_time DESC);

The best practice for breaking up large rows of time series data is, as I
understand it, to put part of the time into the partitioning key (
http://planetcassandra.org/getting-started-with-time-series-data-modeling/):

CREATE TABLE events (
  id text,
  date text, // Could also use year+month here or year+week or something
else
  event_time timestamp,
  event blob,
  PRIMARY KEY ((id, date), event_time))
WITH CLUSTERING ORDER BY (event_time DESC);

The downside of this approach is that we can no longer do a simple
continuous scan to get all of the events for a given user.  Some users may
log lots and lots of interactions every day, while others may interact with
our application infrequently, so I'd like a quick way to get the most
recent interaction for a given user.

Has anyone used different approaches for this problem?

The only thing I can think of is to use the second table schema described
above, but switch to an order-preserving hashing function, and then
manually hash the id field.  This is essentially what we would do in
HBase.

Curious if anyone else has any thoughts.

Best regards,
Clint