@Software Dev - might be feasible to implement a Thrift client that speaks Phoenix JDBC. I believe this is similar to what Hive has done. Thanks, James
On Sun, May 18, 2014 at 1:19 PM, Mike Axiak <[email protected]> wrote: > In our measurements, scanning is improved by performing against n > range scans rather than 1 (since you are effectively striping the > reads). This is even better when you don't necessary care about the > order of every row, but want every row in a given range (then you can > just get whatever row is available from a buffer in the client). > > -Mike > > On Sun, May 18, 2014 at 1:07 PM, Michael Segel > <[email protected]> wrote: > > No, you’re missing the point. > > Its not a good idea or design. > > > > Is your data mutable or static? > > > > To your point. Everytime you want to do a simple get() you have to open > up n get() statements. On your range scans you will have to do n range > scans, then join and sort the result sets. The fact that each result set is > in sort order will help a little, but still not that clean. > > > > > > > > On May 18, 2014, at 4:58 PM, Software Dev <[email protected]> > wrote: > > > >> You may be missing the point. The primary reason for the salt prefix > >> pattern is to avoid hotspotting when inserting time series data AND at > >> the same time provide a way to perform range scans. > >> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > >> > >>> NOTE: Many people worry about hot spotting when they really don’t > have to do so. Hot spotting that occurs on a the initial load of a table is > .OK. Its when you have a sequential row key that you run in to problems > with hot spotting and regions being only half filled. > >> > >> The data being inserted will be a constant stream of time ordered data > >> so yes, hotspotting will be an issue > >> > >>> Adding a random value to give you a bit of randomness now means that > you can’t do a range scan.. > >> > >> That's not accurate. To perform a range scan you would just need to > >> open up N scanners where N is the size of the buckets/random prefixes > >> used. > >> > >>> Don’t take the modulo, just truncate to the first byte. Taking the > modulo is again a dumb idea, but not as dumb as using a salt. > >> > >> Well the only reason why I would think using a salt would be > >> beneficial is to limit the number of scanners when performing a range > >> scan. See above comment. And yes, performing a range scan will be our > >> primary read pattern. > >> > >> On Sun, May 18, 2014 at 2:36 AM, Michael Segel > >> <[email protected]> wrote: > >>> I think I should dust off my schema design talk… clearly the talks > given by some of the vendors don’t really explain things … > >>> (Hmmm. Strata London?) > >>> > >>> See my reply below…. Note I used SHA-1. MD-5 should also give you > roughly the same results. > >>> > >>> On May 18, 2014, at 4:28 AM, Software Dev <[email protected]> > wrote: > >>> > >>>> I recently came across the pattern of adding a salting prefix to the > >>>> row keys to prevent hotspotting. Still trying to wrap my head around > >>>> it and I have a few questions. > >>>> > >>> > >>> If you add a salt, you’re prepending a random number to a row in order > to avoid hot spotting. It amazes me that Sematext never went back and > either removed the blog or fixed it and now the bad idea is getting > propagated. Adding a random value to give you a bit of randomness now > means that you can’t do a range scan, or fetch the specific row with a > single get() so you’re going to end up boiling the ocean to get your data. > You’re better off using hive/spark/shark than hbase. > >>> > >>> As James tries to point out, you take the hash of the row so that you > can easily retrieve the value. But rather than prepend a 160 bit hash, you > can easily achieve the same thing by just truncating the hash to the first > byte in order to get enough randomness to avoid hot spotting. Of course, > the one question you should ask is why don’t you just take the hash as the > row key and then have a 160 bit row key (40 bytes in length)? Then store > the actual key as a column in the table. > >>> > >>> And then there’s a bigger question… why are you worried about hot > spotting? Are you adding rows where the row key is sequential? Or are you > worried about when you first start loading rows, that you are hot spotting, > but the underlying row key is random enough that once the first set of rows > are added, HBase splitting regions will be enough? > >>> > >>>> - Is there ever a reason to salt to more buckets than there are region > >>>> servers? The only reason why I think that may be beneficial is to > >>>> anticipate future growth??? > >>>> > >>> Doesn’t matter. > >>> Think about how HBase splits regions. > >>> Don’t take the modulo, just truncate to the first byte. Taking the > modulo is again a dumb idea, but not as dumb as using a salt. > >>> > >>> Keep in mind that the first byte of the hash is going to be 0-f in a > character representation. (4 bits of the 160bit key) So you have 16 values > to start with. > >>> That should be enough. > >>> > >>>> - Is it beneficial to always hash against a known number of buckets > >>>> (ie never change the size) that way for any individual row key you can > >>>> always determine the prefix? > >>>> > >>> Your question doesn’t make sense. > >>> > >>>> - Are there any good use cases of this pattern out in the wild? > >>>> > >>> Yup. > >>> Deduping data sets. > >>> > >>>> Thanks > >>>> > >>> NOTE: Many people worry about hot spotting when they really don’t > have to do so. Hot spotting that occurs on a the initial load of a table is > OK. Its when you have a sequential row key that you run in to problems with > hot spotting and regions being only half filled. > >>> > >> > > >
