Re: size and column count recommendations for rows in hbase

kisalay Tue, 10 Jan 2012 18:30:48 -0800

Yes, Vinod, you got it right. I was suggesting to have the secondary users
also part of the row key as the suffix.


On Wed, Jan 11, 2012 at 1:02 AM, T Vinod Gupta <[email protected]>wrote:

> Thanks St.Ack and Kisalay.
> In my case, I have primary users and people who interact with my primary
> users. Lets call them secondary users.
> Kisalay, you are right and I already have the primary user, metric name and
> timestamp in my row key. did you mean having the secondary user also part
> of the row key as the suffix? if yes, i might consider that.
> St. Ack - yeah i have all secondary users in the same CF. even if i add new
> CFs, most of the data is in the secondary users data. so it will all stack
> up in the new CF.
>
> Thanks
>
>
> On Tue, Jan 10, 2012 at 11:20 AM, kisalay <[email protected]> wrote:
>
> > would it make sense to convert your fat table into a tall table by
> keeping
> > the source of the metric as part of the row key (may be as the suffix ?
> ).
> > For accessing all the metrics associated with a particular user, metric
> and
> > time, u will be resorting to prefix match on ur key.
> > Also all the keys for a particular user, metric and time will fall in
> > adjacent regions.
> >
> >
> >
> > On Tue, Jan 10, 2012 at 11:41 PM, Stack <[email protected]> wrote:
> >
> > > On Tue, Jan 10, 2012 at 3:17 AM, T Vinod Gupta <[email protected]>
> > > wrote:
> > > > i was scanning through different questions that people asked in this
> > > > mailing list regarding choosing the right schema so that map reduce
> > jobs
> > > > can be run appropriately and hot regions avoided due to sequential
> > > > accesses.
> > > > somewhere, i got the impression that it is ok for a row to have
> > millions
> > > of
> > > > columns and/or have large volume of data per region. but then my map
> > > reduce
> > > > job to copy rows failed due to row size being too large (121MB). so
> > now i
> > > > am confused about whats the recommended way. does it mean that
> default
> > > > region size and other configuration parameters need to be tweaked?
> > > >
> > >
> > > Yeah, if you request all of the row, its going to try and give it to
> > > you even if millions of columns.  You can ask the scan to give you
> > > back a bounded number of columns per iteration so you read through the
> > > big row a piece at a time.
> > >
> > > > in my use case, my system is receiving lots of metrics for different
> > > users
> > > > and i need to maintain daily counters for each of them. it is at day
> > > > granularity and not a typical TSD series. my row key has user id,
> > metric
> > > > name as prefix and day timestamp as suffix. and i keep incrementing
> the
> > > > values. the scale issue happens because i store information about the
> > > > source of the metric too. e.g. i store the id of the person who
> > mentioned
> > > > my user in a tweet.. I am storing all that information in different
> > > columns
> > > > of the same row. so the pattern here is variable - you can have a
> > million
> > > > people tweet about someone and just 2 people tweet about someone else
> > on
> > > a
> > > > given day. is it a bad idea to use columns here? i did it this way
> > > because
> > > > it makes it easy for a different process to run later and aggregate
> > > > information such as list all people who mentioned my user during a
> > given
> > > > date range.
> > > >
> > >
> > > All in one column family?  Would it make sense to have more than one
> CF?
> > >
> > > St.Ack
> > >
> >
>

Re: size and column count recommendations for rows in hbase

Reply via email to