Re: Data modeling for read performance

aaron morton Sun, 20 May 2012 16:44:21 -0700

I would bucket the time stats as well.

If you write all the attributes at the same time, and always want to read them 
together, storing them in something like a JSON blob is  legitimate approach.


Other Aaron, can you elaborate on 
> I'm not using composite row keys (it's just
> AsciiType) as that can lead to hotspots on disk.  

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 18/05/2012, at 4:56 AM, Aaron Turner wrote:

> On Thu, May 17, 2012 at 8:55 AM, jason kowalewski
> <jay.kowalew...@gmail.com> wrote:
>> We have been attempting to change our data model to provide more
>> performance in our cluster.
>> 
>> Currently there are a couple ways to model the data and i was
>> wondering if some people out there could help us out.
>> 
>> We are storing time-series data currently keyed by a user id. This
>> current approach is leading to some hot-spotting of nodes likely due
>> to the key distribution not being representative of the usage pattern.
>> Currently we are using super columns (the super column name is the
>> timestamp), which we intend to dispose of as well with this datamodel
>> redesign.
>> 
>> The first idea we had is that we can shard the data using composite row
>> keys into time buckets:
>> 
>> UserId:<TimeBucket> : {
>>  <timestamp>:<colname> = <col value1>,
>>  <timestamp>:<colname2 = <col value2>
>> ... and so on.
>> }
>> 
>> We can then use a wide row index for tracking these in the future:
>> <TimeBucket>: {
>>  <userId> = null
>> }
>> 
>> This first approach would always have the data be retrieved by the composite
>> row key.
>> 
>> Alternatively we could just do wide rows using composite columns:
>> 
>> UserId : {
>>  <timestamp>:<colname> = <col value1>,
>>  <timestamp>:<colname2> = <col value2>
>> 
>> ... and so on
>> }
>> 
>> 
>> The second approach would have less granular keys, but is easier to group
>> historical timeseries rather than sharding the data into buckets. This second
>> approach also will depend solely on Range Slices of the columns to retrieve
>> the data.
>> 
>> Is there a speed advantage in doing a Row point get in the first approach vs
>> range scans on these columns  in the second approach? In the first approach
>> each bucket would have no more than 200 events. In the second approach we
>> would expect the number of columns to be in the thousands to hundreds of
>> thousands... Our reads currently (using supercolumns) are PAINFULLY slow -
>> the cluster is constantly timing out on many nodes and disk i/o is very high.
>> 
>> Also, Instead of having each column name as a new composite column is it
>> better to serialize the multiple values into some format (json, binary, etc) 
>> to
>> reduce the amount of disk seeks when paging over this timeseries data?
>> 
>> Thanks for any ideas out there!
> 
> 
> You didn't say what your queries look like, but the way I did it was:
> 
> <userid>|<stat_name>|<timebucket> : {
>  <timestamp> = <value>
> }
> 
> This provides very efficient read for a given user/stat combination.
> If I need to get multiple stats per user, I just use more threads on
> the client side.  I'm not using composite row keys (it's just
> AsciiType) as that can lead to hotspots on disk.  My timestamps are
> also just plain unix epoch's as that takes less space then something
> like TimeUUID.
> 
> 
> 
> -- 
> Aaron Turner
> http://synfin.net/         Twitter: @synfinatic
> http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & 
> Windows
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
>     -- Benjamin Franklin
> "carpe diem quam minimum credula postero"

Re: Data modeling for read performance

Reply via email to