would it make sense to convert your fat table into a tall table by keeping the source of the metric as part of the row key (may be as the suffix ? ). For accessing all the metrics associated with a particular user, metric and time, u will be resorting to prefix match on ur key. Also all the keys for a particular user, metric and time will fall in adjacent regions.
On Tue, Jan 10, 2012 at 11:41 PM, Stack <[email protected]> wrote: > On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <[email protected]> > wrote: > > i was scanning through different questions that people asked in this > > mailing list regarding choosing the right schema so that map reduce jobs > > can be run appropriately and hot regions avoided due to sequential > > accesses. > > somewhere, i got the impression that it is ok for a row to have millions > of > > columns and/or have large volume of data per region. but then my map > reduce > > job to copy rows failed due to row size being too large (121MB). so now i > > am confused about whats the recommended way. does it mean that default > > region size and other configuration parameters need to be tweaked? > > > > Yeah, if you request all of the row, its going to try and give it to > you even if millions of columns. You can ask the scan to give you > back a bounded number of columns per iteration so you read through the > big row a piece at a time. > > > in my use case, my system is receiving lots of metrics for different > users > > and i need to maintain daily counters for each of them. it is at day > > granularity and not a typical TSD series. my row key has user id, metric > > name as prefix and day timestamp as suffix. and i keep incrementing the > > values. the scale issue happens because i store information about the > > source of the metric too. e.g. i store the id of the person who mentioned > > my user in a tweet.. I am storing all that information in different > columns > > of the same row. so the pattern here is variable - you can have a million > > people tweet about someone and just 2 people tweet about someone else on > a > > given day. is it a bad idea to use columns here? i did it this way > because > > it makes it easy for a different process to run later and aggregate > > information such as list all people who mentioned my user during a given > > date range. > > > > All in one column family? Would it make sense to have more than one CF? > > St.Ack >
