The downside of hashing is not that it's unpredictable, but that it's non-reversible (which is why you need to append the original key). Reversing should be fine, just make sure that you performa a byte-order reversal so that you have uniform distribution.
On 11/22/11 7:47 PM, "Mark" <[email protected]> wrote: >Ok so this would be "short scans"? > >In my use case this would be unnecessary so I think Im going to run with >the reversed id technique. I'm actually surprised I've never heard of >anyone using this over the non predictable hashing. > >On 11/22/11 5:35 PM, Sam Seigal wrote: >> If you are prefixing your keys with predictable hashes, you can do >> range scans - i.e. create a scanner for each prefix and then merge >> results at the client. With unpredictable hashes and key reversals , >> this might not be entirely possible. >> >> I remember someone on the mailing list mentioning that Mozilla Socorro >> uses a similar technique. I haven't had a chance to look at their code >> yet, but that is something you might want to look at. >> >> On Tue, Nov 22, 2011 at 5:11 PM, Mark<[email protected]> wrote: >>> What to you mean by "short scans"? >>> >>> I understand that scans will not be possible with this method but >>>neither >>> would they be if I hashed them so it seems like I'm in the same boat >>>anyway. >>> >>> On 11/22/11 5:00 PM, Amandeep Khurana wrote: >>>> Mark >>>> >>>> Key designs depend on expected access patterns and use cases. From a >>>> theoretical stand point, what you are saying will work to distribute >>>> writes but if you want to access a small range, you'll need to fan out >>>> your reads and can't leverage short scans. >>>> >>>> Amandeep >>>> >>>> On Nov 22, 2011, at 4:55 PM, Mark<[email protected]> wrote: >>>> >>>>> I just thought of something. >>>>> >>>>> In cases where the id is sequential couldn't one simply reverse the >>>>>id to >>>>> get more of a uniform distribution? >>>>> >>>>> 510911 => 119015 >>>>> 510912 => 219015 >>>>> 510913 => 319015 >>>>> 510914 => 419015 >>>>> >>>>> That seems like a reasonable alternative that doesn't require >>>>>prefixing >>>>> each row key with an extra 16 bytes. Am I wrong in thinking this >>>>>could work? >>>>> >>>>> >>>>> On 11/22/11 12:46 PM, Nicolas Spiegelberg wrote: >>>>>> If you increase the region size to 2GB, then all regions (current >>>>>>and >>>>>> new) >>>>>> will avoid a split until their aggregate StoreFile size reaches that >>>>>> limit. Reorganizing the regions for a uniform growth pattern is >>>>>>really >>>>>> a >>>>>> schema design problem. There is the capability to merge two >>>>>>adjacent >>>>>> regions if you know that your data growth pattern is non-uniform. >>>>>> StumbleUpon& other companies have more experience with those >>>>>>utilities >>>>>> than I do. >>>>>> >>>>>> Note: With the introduction of HFileV2 in 0.92, you'll definitely >>>>>>want >>>>>> to >>>>>> lean towards increasing the region size. HFile scalability code is >>>>>>more >>>>>> mature/stable than the region splitting code. Plus, automatic >>>>>>region >>>>>> splitting is harder to optimize& debug when failures occur. >>>>>> >>>>>> On 11/22/11 12:20 PM, "Srikanth P. Shreenivas" >>>>>> <[email protected]> wrote: >>>>>> >>>>>>> Thanks Nicolas for the clarification. I had a follow-up query. >>>>>>> >>>>>>> What will happen if we increased the region size, say from current >>>>>>> value >>>>>>> of 256 MB to a new value of 2GB? >>>>>>> Will existing regions continue to use only 256 MB space? >>>>>>> >>>>>>> Is there a way to reorganize the regions so that each regions >>>>>>>grows to >>>>>>> 2GB size? >>>>>>> >>>>>>> Thanks, >>>>>>> Srikanth >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Nicolas Spiegelberg [mailto:[email protected]] >>>>>>> Sent: Tuesday, November 22, 2011 10:59 PM >>>>>>> To: [email protected] >>>>>>> Subject: Re: Region Splits >>>>>>> >>>>>>> No. The purpose of major compactions is to merge& dedupe >>>>>>>within a >>>>>>> region >>>>>>> boundary. Compactions will not alter region boundaries, except in >>>>>>>the >>>>>>> case of splits where a compaction is necessary to filter out any >>>>>>>Rows >>>>>>> from >>>>>>> the parent region that are no longer applicable to the daughter >>>>>>>region. >>>>>>> >>>>>>> On 11/22/11 9:04 AM, "Srikanth P. Shreenivas" >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>>> Will major compactions take care of merging "older" regions or >>>>>>>>adding >>>>>>>> more key/values to them as number of regions grow? >>>>>>>> >>>>>>>> Regard, >>>>>>>> Srikanth >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Amandeep Khurana [mailto:[email protected]] >>>>>>>> Sent: Monday, November 21, 2011 7:25 AM >>>>>>>> To: [email protected] >>>>>>>> Subject: Re: Region Splits >>>>>>>> >>>>>>>> Mark, >>>>>>>> >>>>>>>> Yes, your understanding is correct. If your keys are sequential >>>>>>>> (timestamps >>>>>>>> etc), you will always be writing to the end of the table and >>>>>>>>"older" >>>>>>>> regions will not get any writes. This is one of the arguments >>>>>>>>against >>>>>>>> using >>>>>>>> sequential keys. >>>>>>>> >>>>>>>> -ak >>>>>>>> >>>>>>>> On Sun, Nov 20, 2011 at 11:33 AM, Mark<[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Say we have a use case that has sequential row keys and we have >>>>>>>>>rows >>>>>>>>> 0-100. Let's assume that 100 rows = the split size. Now when >>>>>>>>>there is >>>>>>>>> a >>>>>>>>> split it will split at the halfway mark so there will be two >>>>>>>>>regions >>>>>>>>> as >>>>>>>>> follows: >>>>>>>>> >>>>>>>>> Region1 [START-49] >>>>>>>>> Region2 [50-END] >>>>>>>>> >>>>>>>>> So now at this point all inserts will be writing to Region2 only >>>>>>>>> correct? >>>>>>>>> Now at some point Region2 will need to split and it will look >>>>>>>>>like >>>>>>>>> the >>>>>>>>> following before the split: >>>>>>>>> >>>>>>>>> Region1 [START-49] >>>>>>>>> Region2 [50-150] >>>>>>>>> >>>>>>>>> After the split it will look like: >>>>>>>>> >>>>>>>>> Region1 [START-49] >>>>>>>>> Region2 [50-100] >>>>>>>>> Region3 [150-END] >>>>>>>>> >>>>>>>>> And this pattern will continue correct? My question is when >>>>>>>>>there is >>>>>>>>> a >>>>>>>>> use >>>>>>>>> case that has sequential keys how would any of the older regions >>>>>>>>> every >>>>>>>>> receive anymore writes? It seems like they would always be stuck >>>>>>>>>at >>>>>>>>> MaxRegionSize/2. Can someone please confirm or clarify this >>>>>>>>>issue? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> ________________________________ >>>>>>>> >>>>>>>> http://www.mindtree.com/email/disclaimer.html
