If there're (many) other tables besides table A, the data may not be evenly distributed across cluster.
See https://issues.apache.org/jira/browse/HBASE-3373 On Sat, Jan 1, 2011 at 2:46 AM, Eric <[email protected]> wrote: > I have little experience with HBase so far, but my feeling says it should > not matter how much rows you store and that it's better to save on cpu time > and bandwidth. HBase will distribute the data evenly over your cluster and > should be very good at making rows accessible quickly by key because it's > cheap to find out where a (key,value) pair lives in the cluster. So you > should exploit that. > > > 2010/12/31 Michael Russo <[email protected]> > > > Hi all, > > > > I have a schema design question that I would like to run by the group. > > > > Table "A" has a composite row key containing a user identifier and an > item > > identifier. > > > > I need to query this table by time, to determine the latest X new or > > updated > > rows (on a per-user basis). (This is a "live" query and not part of a > > MapReduce workflow.) > > > > To support this query, I have designed an additional table (Table "B") > with > > the following schema: > > > > - row key: composite key, containing a user identifier (fixed width) and > a > > reverse order timestamp (fixed width); both parts stored in their binary > > form > > - a single CF > > - each column qualifier is the actual item identifier; this is used to > form > > part of the row key for table A > > - cell values are always empty > > > > There are a few additional options re: table B: > > > > 1) use the exact timestamp that the event occurred at for the row key. > > > > or, > > > > 2) set a "max timespan" to control the maximum time delta that can be > > stored > > per row; quantize the timestamp of each row to start/end along these "max > > timespan" boundaries. > > > > The number of rows added or modified in table A will vary on a per-user > > basis, and will range anywhere from 50 to 5000 modifications per user per > > day. The size of the user base is measured in millions, so there will be > > significant write traffic. > > > > With design 1), almost every addition/ modification will result in a new > > row. > > > > With design 2), the number of rows is dependent upon the timespan; if set > > to > > one day, for example, there will be one row per user per day with ~ 50 to > > 5000 columns. > > > > To satisfy the requisite query, given design 1): perform a scan, > specifying > > a start key that is simply the user's identifier, and limit the number of > > rows to X. > > > > Given design 2), the query is less straightforward, because we do not > know > > how many columns are contained per row before issuing the scan. However, > I > > can guess roughly how many rows I will need, add a buffer, and limit the > > scan to this number of rows. Then, given the response, I can sort the > > results client-side by timestamp, and limit to X. > > > > I took a look at the OpenTSDB source code after hearing about it at > > HadoopWorld, and the implementation is similar to the second design > > (quantizing event timestamps and storing multiple events per row), > although > > the use case/ queries are different. > > > > I currently have option 2 mostly implemented, but I wanted to run this by > > the community to make sure that I was on the right track. Is design 2) > the > > better choice? The query is not as clean as that in design 1), and it > > requires extra bandwidth per query and client-side CPU for sorting, but > it > > very significantly reduces the number of required rows (by several orders > > of > > magnitude). Alternatively, is there an even better implementation path > > that > > I have not considered? > > > > Thanks all for any help and advice. > > > > Happy New Year, > > Michael > > >
