Great summary! The one other thing I would also note is that your consistency requirements need to be considered. If you want to make atomic list changes you'll need to go wide. On Apr 3, 2012 1:00 PM, "Ian Varley" <[email protected]> wrote:
> Hi Derek, > > If I understand you correctly, you're ultimately trying to store triples > in the form "user, valueid, value", right? E.g., something like: > > "user123, firstname, Paul", > "user234, lastname, Smith" > > (But the usernames are fixed width, and the valueids are fixed width). > > And, your access pattern is along the lines of: "for user X, list the next > 30 values, starting with valueid Y". Is that right? And these values should > be returned sorted by valueid? > > The tl;dr version is that you should probably go with one row per > user+value, and not build a complicated intra-row pagination scheme on your > own unless you're really sure it is needed. > > Your two options mirror a common question people have when designing HBase > schemas: should I go "tall" or "wide"? Your first schema is "tall": each > row represents one value for one user, and so there are many rows in the > table for each user; the row key is user + valueid, and there would be > (presumably) a single column qualifier that means "the value". This is > great if you want to scan over rows in sorted order by row key (thus my > question above, about whether these ids are sorted correctly). You can > start a scan at any user+valueid, read the next 30, and be done. What > you're giving up is the ability to have transactional guarantees around all > the rows for one user, but it doesn't sound like you need that. Doing it > this way is generally recommended (see here< > http://hbase.apache.org/book.html#schema.smackdown>). > > Your second option is "wide": you store a bunch of values in one row, > using different qualifiers (where the qualifier is the valueid). The simple > way to do that would be to just store ALL values for one user in a single > row. I'm guessing you jumped to the "paginated" version because you're > assuming that storing millions of columns in a single row would be bad for > performance, which may or may not be true; as long as you're not trying to > do too much in a single request, or do things like scanning over and > returning all of the cells in the row, it shouldn't be fundamentally worse. > The client has methods that allow you to get specific slices of columns. > > Note that neither case fundamentally uses more disk space than the other; > you're just "shifting" part of the identifying information for a value > either to the left (into the row key, in option one) or to the right (into > the column qualifiers in option 2). Under the covers, every key/value still > stores the whole row key, and column family name. (If this is a bit > confusing, take an hour and watch Lars George's excellent video about > understanding HBase schema design: > http://www.youtube.com/watch?v=_HLoH_PgrLk). > > A manually paginated version has lots more complexities, as you note, like > having to keep track of how many things are in each page, re-shuffling if > new values are inserted, etc. That seems significantly more complex. It > might have some slight speed advantages (or disadvantages!) at extremely > high throughput, and the only way to really know that would be to try it > out. If you don't have time to build it both ways and compare, my advice > would be to start with the simplest option (one row per user+value). Start > simple & iterate! :) > > (Let me know if I've misunderstood your situation.) > > Ian > > On Apr 2, 2012, at 3:10 PM, Derek Wollenstein wrote: > > We're looking at how to store a large amount of (per-user) list data in > hbase, and we were trying to figure out what kind of access pattern made > the most sense. One option is store the majority of the data in a key, so > we could have something like > <FixedWidthUserName><FixedWidthValueId1>:"" (no value) > <FixedWidthUserName><FixedWidthValueId2>:"" (no value) > <FixedWidthUserName><FixedWidthValueId3>:"" (no value) > > The other option we hade was to do this entirely using > > <FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... > > <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... > > where each row would contain multiple values. > So in one case reading the first thirty values would be > scan { STARTROW => 'FixedWidthUsername' LIMIT => 30} > And in the second case it would be > get 'FixedWidthUserName\x00\x00\x00\x00' > > The general usage pattern would be to read only the first 30 values of > these lists, with infrequent access reading deeper into the lists. Some > users would have <= 30 total values in these lists, and some users would > have millions (i.e. power-law distribution) > > The single-value format seems like it would take up more space on hbase, > but would offer some improved retrieval / pagination flexibility. Would > there be any significant performance advantages to be able to paginate via > gets vs paginating with scans? > My initial understanding was that doing a scan should be faster if our > paging size is unknown (and caching is set appropriately), but that gets > should be faster if we'll always need the same page size. I've ended up > hearing different people tell me opposite things about performance. I > assume the page sizes would be relatively consistent, so for most use cases > we could guarantee that we only wanted one page of data in the > fixed-page-length case. I would also assume that we would have infrequent > updates, but may have inserts into the middle of these lists (meaning we'd > need to update all subsequent rows). > > Thanks for help / suggestions / followup questions > > --Derek > >
