Re: Best way to store millisecond-accurate data
Hi In practice, one would want to model their data such that the 'row has too much columns' scenario is prevented. I am curious how really to prevent this, if the data is sharded with one day granularity, nothing stops the client to insert enormous amount of new columns (very often it is not possible to foreseen how much data clients would insert) then some functionality is needed prevent too much columns in a row (too much depends on the data), then such runtime sharding in necessary (to split the day granulary to two rows). I still think if this runtime sharding is possible in cassandra. Best regards, Daniel. 2010/5/4 Miguel Verde miguelitov...@gmail.com One would use batch processes (e.g. through Hadoop) or client-side aggregation, yes. In theory it would be possible to introduce runtime sharding across rows into the Cassandra server side, but it's not part of its design. In practice, one would want to model their data such that the 'row has too much columns' scenario is prevented. On May 4, 2010, at 8:06 AM, Даниел Симеонов dsimeo...@gmail.com wrote: Hi Miguel, I'd like to ask is it possible to have runtime sharding or rows in cassandra, i.e. if the row has too much new columns inserted then create another one row (let's say if the original timesharding is one day per row, then we would have two rows for that day). Maybe batch processes could do that. Best regards, Daniel. 2010/4/24 Miguel Verde miguelitov...@gmail.commiguelitov...@gmail.com TimeUUID's time component is measured in 100-nanosecond intervals. The library you use might calculate it with poorer accuracy or precision, but from a storage/comparison standpoint in Cassandra millisecond data is easily captured by it. One typical way of dealing with the data explosion of sampled time series data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you put an upper bound on the row length. On Apr 23, 2010, at 7:01 PM, Andrew Nguyen andrew-lists-cassan...@ucsfcti.org andrew-lists-cassan...@ucsfcti.org wrote: Hello, I am looking to store patient physiologic data in Cassandra - it's being collected at rates of 1 to 125 Hz. I'm thinking of storing the timestamps as the column names and the patient/parameter combo as the row key. For example, Bob is in the ICU and is currently having his blood pressure, intracranial pressure, and heart rate monitored. I'd like to collect this with the following row keys: Bob-bloodpressure Bob-intracranialpressure Bob-heartrate The column names would be timestamps but that's where my questions start: I'm not sure what the best data type and CompareWith would be. From my searching, it sounds like the TimeUUID may be suitable but isn't really designed for millisecond accuracy. My other thought is just to store them as strings (2010-04-23 10:23:45.016). While I space isn't the foremost concern, we will be collecting this data 24/7 so we'll be creating many columns over the long-term. I found https://issues.apache.org/jira/browse/CASSANDRA-16 https://issues.apache.org/jira/browse/CASSANDRA-16 which states that the entire row must fit in memory. Does this include the values as well as the column names? In considering the limits of cassandra and the best way to model this, we would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). However, I can't really think of a better way to model this... So, am I thinking about this all wrong or am I on the right track? Thanks, Andrew
Re: Best way to store millisecond-accurate data
TimeUUID's time component is measured in 100-nanosecond intervals. The library you use might calculate it with poorer accuracy or precision, but from a storage/comparison standpoint in Cassandra millisecond data is easily captured by it. One typical way of dealing with the data explosion of sampled time series data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you put an upper bound on the row length. On Apr 23, 2010, at 7:01 PM, Andrew Nguyen andrew-lists-cassan...@ucsfcti.org wrote: Hello, I am looking to store patient physiologic data in Cassandra - it's being collected at rates of 1 to 125 Hz. I'm thinking of storing the timestamps as the column names and the patient/parameter combo as the row key. For example, Bob is in the ICU and is currently having his blood pressure, intracranial pressure, and heart rate monitored. I'd like to collect this with the following row keys: Bob-bloodpressure Bob-intracranialpressure Bob-heartrate The column names would be timestamps but that's where my questions start: I'm not sure what the best data type and CompareWith would be. From my searching, it sounds like the TimeUUID may be suitable but isn't really designed for millisecond accuracy. My other thought is just to store them as strings (2010-04-23 10:23:45.016). While I space isn't the foremost concern, we will be collecting this data 24/7 so we'll be creating many columns over the long-term. I found https://issues.apache.org/jira/browse/CASSANDRA-16 which states that the entire row must fit in memory. Does this include the values as well as the column names? In considering the limits of cassandra and the best way to model this, we would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). However, I can't really think of a better way to model this... So, am I thinking about this all wrong or am I on the right track? Thanks, Andrew
Re: Best way to store millisecond-accurate data
On Fri, Apr 23, 2010 at 5:54 PM, Miguel Verde miguelitov...@gmail.comwrote: TimeUUID's time component is measured in 100-nanosecond intervals. The library you use might calculate it with poorer accuracy or precision, but from a storage/comparison standpoint in Cassandra millisecond data is easily captured by it. One typical way of dealing with the data explosion of sampled time series data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you put an upper bound on the row length. On Apr 23, 2010, at 7:01 PM, Andrew Nguyen andrew-lists-cassan...@ucsfcti.org wrote: Hello, I am looking to store patient physiologic data in Cassandra - it's being collected at rates of 1 to 125 Hz. I'm thinking of storing the timestamps as the column names and the patient/parameter combo as the row key. For example, Bob is in the ICU and is currently having his blood pressure, intracranial pressure, and heart rate monitored. I'd like to collect this with the following row keys: Bob-bloodpressure Bob-intracranialpressure Bob-heartrate The column names would be timestamps but that's where my questions start: I'm not sure what the best data type and CompareWith would be. From my searching, it sounds like the TimeUUID may be suitable but isn't really designed for millisecond accuracy. My other thought is just to store them as strings (2010-04-23 10:23:45.016). While I space isn't the foremost concern, we will be collecting this data 24/7 so we'll be creating many columns over the long-term. You could just get an 8 byte millisecond timestamp and store that as a part of the key I found https://issues.apache.org/jira/browse/CASSANDRA-16 which states that the entire row must fit in memory. Does this include the values as well as the column names? Yes. The option is to store one insert per row, you are not going to be able to do backwards slices this way, without extra index, but you can scale mush better. In considering the limits of cassandra and the best way to model this, we would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). However, I can't really think of a better way to model this... So, am I thinking about this all wrong or am I on the right track? Thanks, Andrew -- Regards Erik