Yes, Vinod, you got it right. I was suggesting to have the secondary users also part of the row key as the suffix.
On Wed, Jan 11, 2012 at 1:02 AM, T Vinod Gupta <[email protected]>wrote: > Thanks St.Ack and Kisalay. > In my case, I have primary users and people who interact with my primary > users. Lets call them secondary users. > Kisalay, you are right and I already have the primary user, metric name and > timestamp in my row key. did you mean having the secondary user also part > of the row key as the suffix? if yes, i might consider that. > St. Ack - yeah i have all secondary users in the same CF. even if i add new > CFs, most of the data is in the secondary users data. so it will all stack > up in the new CF. > > Thanks > > > On Tue, Jan 10, 2012 at 11:20 AM, kisalay <[email protected]> wrote: > > > would it make sense to convert your fat table into a tall table by > keeping > > the source of the metric as part of the row key (may be as the suffix ? > ). > > For accessing all the metrics associated with a particular user, metric > and > > time, u will be resorting to prefix match on ur key. > > Also all the keys for a particular user, metric and time will fall in > > adjacent regions. > > > > > > > > On Tue, Jan 10, 2012 at 11:41 PM, Stack <[email protected]> wrote: > > > > > On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <[email protected]> > > > wrote: > > > > i was scanning through different questions that people asked in this > > > > mailing list regarding choosing the right schema so that map reduce > > jobs > > > > can be run appropriately and hot regions avoided due to sequential > > > > accesses. > > > > somewhere, i got the impression that it is ok for a row to have > > millions > > > of > > > > columns and/or have large volume of data per region. but then my map > > > reduce > > > > job to copy rows failed due to row size being too large (121MB). so > > now i > > > > am confused about whats the recommended way. does it mean that > default > > > > region size and other configuration parameters need to be tweaked? > > > > > > > > > > Yeah, if you request all of the row, its going to try and give it to > > > you even if millions of columns. You can ask the scan to give you > > > back a bounded number of columns per iteration so you read through the > > > big row a piece at a time. > > > > > > > in my use case, my system is receiving lots of metrics for different > > > users > > > > and i need to maintain daily counters for each of them. it is at day > > > > granularity and not a typical TSD series. my row key has user id, > > metric > > > > name as prefix and day timestamp as suffix. and i keep incrementing > the > > > > values. the scale issue happens because i store information about the > > > > source of the metric too. e.g. i store the id of the person who > > mentioned > > > > my user in a tweet.. I am storing all that information in different > > > columns > > > > of the same row. so the pattern here is variable - you can have a > > million > > > > people tweet about someone and just 2 people tweet about someone else > > on > > > a > > > > given day. is it a bad idea to use columns here? i did it this way > > > because > > > > it makes it easy for a different process to run later and aggregate > > > > information such as list all people who mentioned my user during a > > given > > > > date range. > > > > > > > > > > All in one column family? Would it make sense to have more than one > CF? > > > > > > St.Ack > > > > > >
