would it make sense to convert your fat table into a tall table by keeping
the source of the metric as part of the row key (may be as the suffix ? ).
For accessing all the metrics associated with a particular user, metric and
time, u will be resorting to prefix match on ur key.
Also all the keys for a particular user, metric and time will fall in
adjacent regions.



On Tue, Jan 10, 2012 at 11:41 PM, Stack <[email protected]> wrote:

> On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <[email protected]>
> wrote:
> > i was scanning through different questions that people asked in this
> > mailing list regarding choosing the right schema so that map reduce jobs
> > can be run appropriately and hot regions avoided due to sequential
> > accesses.
> > somewhere, i got the impression that it is ok for a row to have millions
> of
> > columns and/or have large volume of data per region. but then my map
> reduce
> > job to copy rows failed due to row size being too large (121MB). so now i
> > am confused about whats the recommended way. does it mean that default
> > region size and other configuration parameters need to be tweaked?
> >
>
> Yeah, if you request all of the row, its going to try and give it to
> you even if millions of columns.  You can ask the scan to give you
> back a bounded number of columns per iteration so you read through the
> big row a piece at a time.
>
> > in my use case, my system is receiving lots of metrics for different
> users
> > and i need to maintain daily counters for each of them. it is at day
> > granularity and not a typical TSD series. my row key has user id, metric
> > name as prefix and day timestamp as suffix. and i keep incrementing the
> > values. the scale issue happens because i store information about the
> > source of the metric too. e.g. i store the id of the person who mentioned
> > my user in a tweet.. I am storing all that information in different
> columns
> > of the same row. so the pattern here is variable - you can have a million
> > people tweet about someone and just 2 people tweet about someone else on
> a
> > given day. is it a bad idea to use columns here? i did it this way
> because
> > it makes it easy for a different process to run later and aggregate
> > information such as list all people who mentioned my user during a given
> > date range.
> >
>
> All in one column family?  Would it make sense to have more than one CF?
>
> St.Ack
>

Reply via email to