On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <[email protected]> wrote: > i was scanning through different questions that people asked in this > mailing list regarding choosing the right schema so that map reduce jobs > can be run appropriately and hot regions avoided due to sequential > accesses. > somewhere, i got the impression that it is ok for a row to have millions of > columns and/or have large volume of data per region. but then my map reduce > job to copy rows failed due to row size being too large (121MB). so now i > am confused about whats the recommended way. does it mean that default > region size and other configuration parameters need to be tweaked? >
Yeah, if you request all of the row, its going to try and give it to you even if millions of columns. You can ask the scan to give you back a bounded number of columns per iteration so you read through the big row a piece at a time. > in my use case, my system is receiving lots of metrics for different users > and i need to maintain daily counters for each of them. it is at day > granularity and not a typical TSD series. my row key has user id, metric > name as prefix and day timestamp as suffix. and i keep incrementing the > values. the scale issue happens because i store information about the > source of the metric too. e.g. i store the id of the person who mentioned > my user in a tweet.. I am storing all that information in different columns > of the same row. so the pattern here is variable - you can have a million > people tweet about someone and just 2 people tweet about someone else on a > given day. is it a bad idea to use columns here? i did it this way because > it makes it easy for a different process to run later and aggregate > information such as list all people who mentioned my user during a given > date range. > All in one column family? Would it make sense to have more than one CF? St.Ack
