Re: Best way to store millisecond-accurate data

2010-05-05 Thread Даниел Симеонов
Hi
In practice, one would want to model their data such that the 'row has
too much columns' scenario is prevented.
   I am curious how really to prevent this, if the data is sharded with one
day granularity, nothing stops the client to insert enormous amount of new
columns (very often it is not possible to foreseen how much data clients
would insert) then some functionality is needed prevent too much columns in
a row (too much depends on the data), then such runtime sharding in
necessary (to split the day granulary to two rows). I still think if this
runtime sharding is possible in cassandra.
Best regards, Daniel.

2010/5/4 Miguel Verde miguelitov...@gmail.com

 One would use batch processes (e.g. through Hadoop) or client-side
 aggregation, yes. In theory it would be possible to introduce runtime
 sharding across rows into the Cassandra server side, but it's not part of
 its design.

 In practice, one would want to model their data such that the 'row has too
 much columns' scenario is prevented.

 On May 4, 2010, at 8:06 AM, Даниел Симеонов dsimeo...@gmail.com wrote:

 Hi Miguel,
   I'd like to ask is it possible to have runtime sharding or rows in
 cassandra, i.e. if the row has too much new columns inserted then create
 another one row (let's say if the original timesharding is one day per row,
 then we would have two rows for that day). Maybe batch processes could do
 that.
 Best regards, Daniel.

 2010/4/24 Miguel Verde  miguelitov...@gmail.commiguelitov...@gmail.com

 TimeUUID's time component is measured in 100-nanosecond intervals. The
 library you use might calculate it with poorer accuracy or precision, but
 from a storage/comparison standpoint in Cassandra millisecond data is easily
 captured by it.

 One typical way of dealing with the data explosion of sampled time series
 data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you
 put an upper bound on the row length.


 On Apr 23, 2010, at 7:01 PM, Andrew Nguyen 
 andrew-lists-cassan...@ucsfcti.org
 andrew-lists-cassan...@ucsfcti.org wrote:

  Hello,

 I am looking to store patient physiologic data in Cassandra - it's being
 collected at rates of 1 to 125 Hz.  I'm thinking of storing the timestamps
 as the column names and the patient/parameter combo as the row key.  For
 example, Bob is in the ICU and is currently having his blood pressure,
 intracranial pressure, and heart rate monitored.  I'd like to collect this
 with the following row keys:

 Bob-bloodpressure
 Bob-intracranialpressure
 Bob-heartrate

 The column names would be timestamps but that's where my questions start:

 I'm not sure what the best data type and CompareWith would be.  From my
 searching, it sounds like the TimeUUID may be suitable but isn't really
 designed for millisecond accuracy.  My other thought is just to store them
 as strings (2010-04-23 10:23:45.016).  While I space isn't the foremost
 concern, we will be collecting this data 24/7 so we'll be creating many
 columns over the long-term.

 I found https://issues.apache.org/jira/browse/CASSANDRA-16
 https://issues.apache.org/jira/browse/CASSANDRA-16 which states that the
 entire row must fit in memory.  Does this include the values as well as the
 column names?

 In considering the limits of cassandra and the best way to model this, we
 would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7).
  However, I can't really think of a better way to model this...  So, am I
 thinking about this all wrong or am I on the right track?

 Thanks,
 Andrew





Re: Best way to store millisecond-accurate data

2010-04-23 Thread Miguel Verde
TimeUUID's time component is measured in 100-nanosecond intervals. The  
library you use might calculate it with poorer accuracy or precision,  
but from a storage/comparison standpoint in Cassandra millisecond data  
is easily captured by it.


One typical way of dealing with the data explosion of sampled time  
series data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure)  
so that you put an upper bound on the row length.


On Apr 23, 2010, at 7:01 PM, Andrew Nguyen andrew-lists-cassan...@ucsfcti.org 
 wrote:



Hello,

I am looking to store patient physiologic data in Cassandra - it's  
being collected at rates of 1 to 125 Hz.  I'm thinking of storing  
the timestamps as the column names and the patient/parameter combo  
as the row key.  For example, Bob is in the ICU and is currently  
having his blood pressure, intracranial pressure, and heart rate  
monitored.  I'd like to collect this with the following row keys:


Bob-bloodpressure
Bob-intracranialpressure
Bob-heartrate

The column names would be timestamps but that's where my questions  
start:


I'm not sure what the best data type and CompareWith would be.  From  
my searching, it sounds like the TimeUUID may be suitable but isn't  
really designed for millisecond accuracy.  My other thought is just  
to store them as strings (2010-04-23 10:23:45.016).  While I space  
isn't the foremost concern, we will be collecting this data 24/7 so  
we'll be creating many columns over the long-term.


I found https://issues.apache.org/jira/browse/CASSANDRA-16 which  
states that the entire row must fit in memory.  Does this include  
the values as well as the column names?


In considering the limits of cassandra and the best way to model  
this, we would be adding 3.9 billion rows per year (assuming 125 Hz  
@ 24/7).  However, I can't really think of a better way to model  
this...  So, am I thinking about this all wrong or am I on the right  
track?


Thanks,
Andrew


Re: Best way to store millisecond-accurate data

2010-04-23 Thread Erik Holstad
On Fri, Apr 23, 2010 at 5:54 PM, Miguel Verde miguelitov...@gmail.comwrote:

 TimeUUID's time component is measured in 100-nanosecond intervals. The
 library you use might calculate it with poorer accuracy or precision, but
 from a storage/comparison standpoint in Cassandra millisecond data is easily
 captured by it.

 One typical way of dealing with the data explosion of sampled time series
 data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you
 put an upper bound on the row length.


 On Apr 23, 2010, at 7:01 PM, Andrew Nguyen 
 andrew-lists-cassan...@ucsfcti.org wrote:

  Hello,

 I am looking to store patient physiologic data in Cassandra - it's being
 collected at rates of 1 to 125 Hz.  I'm thinking of storing the timestamps
 as the column names and the patient/parameter combo as the row key.  For
 example, Bob is in the ICU and is currently having his blood pressure,
 intracranial pressure, and heart rate monitored.  I'd like to collect this
 with the following row keys:

 Bob-bloodpressure
 Bob-intracranialpressure
 Bob-heartrate

 The column names would be timestamps but that's where my questions start:

 I'm not sure what the best data type and CompareWith would be.  From my
 searching, it sounds like the TimeUUID may be suitable but isn't really
 designed for millisecond accuracy.  My other thought is just to store them
 as strings (2010-04-23 10:23:45.016).  While I space isn't the foremost
 concern, we will be collecting this data 24/7 so we'll be creating many
 columns over the long-term.

 You could just get an 8 byte millisecond timestamp and store that as a part
of the key


 I found https://issues.apache.org/jira/browse/CASSANDRA-16 which states
 that the entire row must fit in memory.  Does this include the values as
 well as the column names?

 Yes. The option is to store one insert per row, you are not going to be
able to do backwards slices this way,  without extra index, but you can
scale mush better.


 In considering the limits of cassandra and the best way to model this, we
 would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7).
  However, I can't really think of a better way to model this...  So, am I
 thinking about this all wrong or am I on the right track?

 Thanks,
 Andrew




-- 
Regards Erik