On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <[email protected]> wrote:
> i was scanning through different questions that people asked in this
> mailing list regarding choosing the right schema so that map reduce jobs
> can be run appropriately and hot regions avoided due to sequential
> accesses.
> somewhere, i got the impression that it is ok for a row to have millions of
> columns and/or have large volume of data per region. but then my map reduce
> job to copy rows failed due to row size being too large (121MB). so now i
> am confused about whats the recommended way. does it mean that default
> region size and other configuration parameters need to be tweaked?
>

Yeah, if you request all of the row, its going to try and give it to
you even if millions of columns.  You can ask the scan to give you
back a bounded number of columns per iteration so you read through the
big row a piece at a time.

> in my use case, my system is receiving lots of metrics for different users
> and i need to maintain daily counters for each of them. it is at day
> granularity and not a typical TSD series. my row key has user id, metric
> name as prefix and day timestamp as suffix. and i keep incrementing the
> values. the scale issue happens because i store information about the
> source of the metric too. e.g. i store the id of the person who mentioned
> my user in a tweet.. I am storing all that information in different columns
> of the same row. so the pattern here is variable - you can have a million
> people tweet about someone and just 2 people tweet about someone else on a
> given day. is it a bad idea to use columns here? i did it this way because
> it makes it easy for a different process to run later and aggregate
> information such as list all people who mentioned my user during a given
> date range.
>

All in one column family?  Would it make sense to have more than one CF?

St.Ack

Reply via email to