I would bucket the time stats as well. If you write all the attributes at the same time, and always want to read them together, storing them in something like a JSON blob is legitimate approach.
Other Aaron, can you elaborate on > I'm not using composite row keys (it's just > AsciiType) as that can lead to hotspots on disk. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 18/05/2012, at 4:56 AM, Aaron Turner wrote: > On Thu, May 17, 2012 at 8:55 AM, jason kowalewski > <jay.kowalew...@gmail.com> wrote: >> We have been attempting to change our data model to provide more >> performance in our cluster. >> >> Currently there are a couple ways to model the data and i was >> wondering if some people out there could help us out. >> >> We are storing time-series data currently keyed by a user id. This >> current approach is leading to some hot-spotting of nodes likely due >> to the key distribution not being representative of the usage pattern. >> Currently we are using super columns (the super column name is the >> timestamp), which we intend to dispose of as well with this datamodel >> redesign. >> >> The first idea we had is that we can shard the data using composite row >> keys into time buckets: >> >> UserId:<TimeBucket> : { >> <timestamp>:<colname> = <col value1>, >> <timestamp>:<colname2 = <col value2> >> ... and so on. >> } >> >> We can then use a wide row index for tracking these in the future: >> <TimeBucket>: { >> <userId> = null >> } >> >> This first approach would always have the data be retrieved by the composite >> row key. >> >> Alternatively we could just do wide rows using composite columns: >> >> UserId : { >> <timestamp>:<colname> = <col value1>, >> <timestamp>:<colname2> = <col value2> >> >> ... and so on >> } >> >> >> The second approach would have less granular keys, but is easier to group >> historical timeseries rather than sharding the data into buckets. This second >> approach also will depend solely on Range Slices of the columns to retrieve >> the data. >> >> Is there a speed advantage in doing a Row point get in the first approach vs >> range scans on these columns in the second approach? In the first approach >> each bucket would have no more than 200 events. In the second approach we >> would expect the number of columns to be in the thousands to hundreds of >> thousands... Our reads currently (using supercolumns) are PAINFULLY slow - >> the cluster is constantly timing out on many nodes and disk i/o is very high. >> >> Also, Instead of having each column name as a new composite column is it >> better to serialize the multiple values into some format (json, binary, etc) >> to >> reduce the amount of disk seeks when paging over this timeseries data? >> >> Thanks for any ideas out there! > > > You didn't say what your queries look like, but the way I did it was: > > <userid>|<stat_name>|<timebucket> : { > <timestamp> = <value> > } > > This provides very efficient read for a given user/stat combination. > If I need to get multiple stats per user, I just use more threads on > the client side. I'm not using composite row keys (it's just > AsciiType) as that can lead to hotspots on disk. My timestamps are > also just plain unix epoch's as that takes less space then something > like TimeUUID. > > > > -- > Aaron Turner > http://synfin.net/ Twitter: @synfinatic > http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & > Windows > Those who would give up essential Liberty, to purchase a little temporary > Safety, deserve neither Liberty nor Safety. > -- Benjamin Franklin > "carpe diem quam minimum credula postero"