You may be missing the point. The primary reason for the salt prefix pattern is to avoid hotspotting when inserting time series data AND at the same time provide a way to perform range scans. http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>NOTE: Many people worry about hot spotting when they really don’t have to do >so. Hot spotting that occurs on a the initial load of a table is .OK. Its when >you have a sequential row key that you run in to problems with hot spotting >and regions being only half filled. The data being inserted will be a constant stream of time ordered data so yes, hotspotting will be an issue > Adding a random value to give you a bit of randomness now means that you > can’t do a range scan.. That's not accurate. To perform a range scan you would just need to open up N scanners where N is the size of the buckets/random prefixes used. > Don’t take the modulo, just truncate to the first byte. Taking the modulo is > again a dumb idea, but not as dumb as using a salt. Well the only reason why I would think using a salt would be beneficial is to limit the number of scanners when performing a range scan. See above comment. And yes, performing a range scan will be our primary read pattern. On Sun, May 18, 2014 at 2:36 AM, Michael Segel <[email protected]> wrote: > I think I should dust off my schema design talk… clearly the talks given by > some of the vendors don’t really explain things … > (Hmmm. Strata London?) > > See my reply below…. Note I used SHA-1. MD-5 should also give you roughly the > same results. > > On May 18, 2014, at 4:28 AM, Software Dev <[email protected]> wrote: > >> I recently came across the pattern of adding a salting prefix to the >> row keys to prevent hotspotting. Still trying to wrap my head around >> it and I have a few questions. >> > > If you add a salt, you’re prepending a random number to a row in order to > avoid hot spotting. It amazes me that Sematext never went back and either > removed the blog or fixed it and now the bad idea is getting propagated. > Adding a random value to give you a bit of randomness now means that you > can’t do a range scan, or fetch the specific row with a single get() so > you’re going to end up boiling the ocean to get your data. You’re better off > using hive/spark/shark than hbase. > > As James tries to point out, you take the hash of the row so that you can > easily retrieve the value. But rather than prepend a 160 bit hash, you can > easily achieve the same thing by just truncating the hash to the first byte > in order to get enough randomness to avoid hot spotting. Of course, the one > question you should ask is why don’t you just take the hash as the row key > and then have a 160 bit row key (40 bytes in length)? Then store the actual > key as a column in the table. > > And then there’s a bigger question… why are you worried about hot spotting? > Are you adding rows where the row key is sequential? Or are you worried > about when you first start loading rows, that you are hot spotting, but the > underlying row key is random enough that once the first set of rows are > added, HBase splitting regions will be enough? > >> - Is there ever a reason to salt to more buckets than there are region >> servers? The only reason why I think that may be beneficial is to >> anticipate future growth??? >> > Doesn’t matter. > Think about how HBase splits regions. > Don’t take the modulo, just truncate to the first byte. Taking the modulo is > again a dumb idea, but not as dumb as using a salt. > > Keep in mind that the first byte of the hash is going to be 0-f in a > character representation. (4 bits of the 160bit key) So you have 16 values > to start with. > That should be enough. > >> - Is it beneficial to always hash against a known number of buckets >> (ie never change the size) that way for any individual row key you can >> always determine the prefix? >> > Your question doesn’t make sense. > >> - Are there any good use cases of this pattern out in the wild? >> > Yup. > Deduping data sets. > >> Thanks >> > NOTE: Many people worry about hot spotting when they really don’t have to do > so. Hot spotting that occurs on a the initial load of a table is OK. Its when > you have a sequential row key that you run in to problems with hot spotting > and regions being only half filled. >
