If there're (many) other tables besides table A, the data may not be evenly
distributed across cluster.

See https://issues.apache.org/jira/browse/HBASE-3373

On Sat, Jan 1, 2011 at 2:46 AM, Eric <[email protected]> wrote:

> I have little experience with HBase so far, but my feeling says it should
> not matter how much rows you store and that it's better to save on cpu time
> and bandwidth. HBase will distribute the data evenly over your cluster and
> should be very good at making rows accessible quickly by key because it's
> cheap to find out where a (key,value) pair lives in the cluster. So you
> should exploit that.
>
>
> 2010/12/31 Michael Russo <[email protected]>
>
> > Hi all,
> >
> > I have a schema design question that I would like to run by the group.
> >
> > Table "A" has a composite row key containing a user identifier and an
> item
> > identifier.
> >
> > I need to query this table by time, to determine the latest X new or
> > updated
> > rows (on a per-user basis).  (This is a "live" query and not part of a
> > MapReduce workflow.)
> >
> > To support this query, I have designed an additional table (Table "B")
> with
> > the following schema:
> >
> > - row key: composite key, containing a user identifier (fixed width) and
> a
> > reverse order timestamp (fixed width); both parts stored in their binary
> > form
> > - a single CF
> > - each column qualifier is the actual item identifier; this is used to
> form
> > part of the row key for table A
> > - cell values are always empty
> >
> > There are a few additional options re: table B:
> >
> > 1) use the exact timestamp that the event occurred at for the row key.
> >
> > or,
> >
> > 2) set a "max timespan" to control the maximum time delta that can be
> > stored
> > per row; quantize the timestamp of each row to start/end along these "max
> > timespan" boundaries.
> >
> > The number of rows added or modified in table A will vary on a per-user
> > basis, and will range anywhere from 50 to 5000 modifications per user per
> > day.  The size of the user base is measured in millions, so there will be
> > significant write traffic.
> >
> > With design 1), almost every addition/ modification will result in a new
> > row.
> >
> > With design 2), the number of rows is dependent upon the timespan; if set
> > to
> > one day, for example, there will be one row per user per day with ~ 50 to
> > 5000 columns.
> >
> > To satisfy the requisite query, given design 1): perform a scan,
> specifying
> > a start key that is simply the user's identifier, and limit the number of
> > rows to X.
> >
> > Given design 2), the query is less straightforward, because we do not
> know
> > how many columns are contained per row before issuing the scan.  However,
> I
> > can guess roughly how many rows I will need, add a buffer, and limit the
> > scan to this number of rows.  Then, given the response, I can sort the
> > results client-side by timestamp, and limit to X.
> >
> > I took a look at the OpenTSDB source code after hearing about it at
> > HadoopWorld, and the implementation is similar to the second design
> > (quantizing event timestamps and storing multiple events per row),
> although
> > the use case/ queries are different.
> >
> > I currently have option 2 mostly implemented, but I wanted to run this by
> > the community to make sure that I was on the right track. Is design 2)
> the
> > better choice?  The query is not as clean as that in design 1), and it
> > requires extra bandwidth per query and client-side CPU for sorting, but
> it
> > very significantly reduces the number of required rows (by several orders
> > of
> > magnitude).  Alternatively, is there an even better implementation path
> > that
> > I have not considered?
> >
> > Thanks all for any help and advice.
> >
> > Happy New Year,
> > Michael
> >
>

Reply via email to