Hey, I don't understand the 'random scan' question... if you want to scan a random key, just scan! For example:
byte [] random_key = generateRandomKeyUsingRandomNumberGenerator(); Scan s = new Scan(random_key); But you must mean something else... perhaps you could illuminate me? -ryan On Sun, Jan 30, 2011 at 10:06 PM, Lars George <[email protected]> wrote: > Hi Pete, > > Look into the Mozilla Socorro project > (http://code.google.com/p/socorro/) for how to "salt" the keys to get > better load balancing across sequential keys. The principle is to add > a salt, in this case a number reflecting the number of servers > available (some multiple of that to allow for growth) and then prefix > the sequential key with it so that writes are spread across all > servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to > open N scanners where N is the number of distinct salt values and scan > each subset with them while eventually combining the result in client > code. Assuming you want to scan all values in January and you have a > salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to > "0-201102010000", then another for "1-201101010000" to > "1-201102010000" and so on. Then do the scans (multithreaded for > example) and combine the results client side. The Socorro code shows > one way to implement this. > > Lars > > > On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <[email protected]> wrote: >> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD' >> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is >> '<date>|<id>|~' (tilde character) and this has worked for my data set. >> Unfortunately the key is not distributed very well. That is why I was >> wondering how you do a scan (using start and end row) with a random row key. >> >> Thanks >> >> -Pete >> >> PS. I use <date>|<id> since the id is variable length and this was my first >> attempt. I know have a months worth of data and for my next phase I will >> probably reverse the <date> <id> order since it will work either way. >> >> >> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <[email protected]> wrote: >> >>> Hey, >>> >>> So variable length keys and lexographical sorting makes it a little >>> tricky to do Scans and get exactly what you want. This has a lot to >>> do with the ascii table too, and the numerical values. Let consult >>> (http://www.asciitable.com/) while we work this example through: >>> >>> Take a separation character of | as your code uses. This is decimal >>> 124, placing it way above both the lower and upper case letters AND >>> numbers, that is good. >>> >>> Now you have something like this: >>> >>> 1234|a_string >>> 1234|other_string >>> >>> now we want to find all rows "belonging to" 1234, so we do a start row >>> of '1234|', but what for the end key? Well, let's try... '1234}', that >>> might work, oh wait, here is another key: >>> >>> 12345|foo >>> >>> ok so '5' < '|' so it should short like so: >>> 1234|a_string >>> 1234|other_string >>> 12345|foo >>> >>> hmm well how does our end row compare? well '5' < '}' so '1234}' is >>> still "larger" than '12345|foo' so that row would be incorrectly >>> included in the scan results assuming we only want '1234' related >>> rows. >>> >>> Ok, well maybe a better solution is to pick a lower ascii? Well >>> outside of the control characters, space is the lowest character at >>> 32, 33 is '!' so perhaps ! would be a better choice. So you could >>> choose an end double quote as in '1234"' to define your 'stop row'. >>> Now you would be prohibited from using any character smaller than '33' >>> in your strings, which is kind of a non ideal solution. >>> >>> This is all pretty clumsy, and doesnt work great in these variable >>> length separated strings. >>> >>> The ultimate solution is to use the PrefixFilter, which is configured as >>> such: >>> byte[] start_row = Bytes.toBytes("1234|"); >>> Scan s = new Scan(start_row); >>> s.setFilter(new PrefixFilter(start_row)); >>> // do scan. >>> >>> that way no matter what sortability your separator is, you will get >>> the answer you want every time. >>> >>> >>> >>> Another way to do compound keys is to go pure-binary. For example I >>> want a key that is 2 integers, so I can do this: >>> int part1 = ... ; >>> int part2 = ... ; >>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2)); >>> >>> Now you can also search for all rows starting with 'target' like such: >>> int target = ... ; >>> // start key is 'target', stop key is 'target+1' >>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1)); >>> >>> And you get exactly what you want, nothing more or less (all rows >>> starting with 'target'). >>> >>> The lexicographic comparison is very tricky sometimes. One quick tip >>> is that if your numbers (longs, ints) are big endian encoded (all the >>> utilities in Bytes.java do so), then the lexicographic sorting is >>> equal to the numeric sorting. Otherwise if you do strings you end up >>> with: >>> 1 >>> 11 >>> 2 >>> 3 >>> >>> and things are 'out of order'... if that is important, you can pad it >>> with 0s - dont forget to use the proper amount, which is 10 digits for >>> ints, and 19 for longs. Or consider using binary encoding as above. >>> >>> -ryan >>> >>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <[email protected]> >>> wrote: >>>> >>>> Hi Pete, >>>> >>>> You're right. If you use random keys, you will never know the start / >>>> end keys for scan. What you really want to do is to deign the key that >>>> will distribute well for writes but also has the certain locality for >>>> scans. >>>> >>>> You probably have the ideal key already (ID|Date). If you don't make >>>> entire key to be random but just the ID part, you could get a good >>>> distribution at write time because writes for different IDs will be >>>> distributed across the regions, and you also could get a good scan >>>> performance when you scan between certain dates for a specific ID >>>> because rows for the ID will be stored together in one region. >>>> >>>> Thanks, >>>> Tatsuya >>>> >>>> >>>> 2011/1/29 Peter Haidinyak <[email protected]>: >>>>> >>>>> I know they are always sorted but if they are how do you know which row >>>>> key belong to which data? Currently I use a row key of ID|Date so I always >>>>> know what the startrow and endrow should be. I know I'm missing something >>>>> really fundamental here. :-( >>>>> >>>>> Thanks >>>>> >>>>> -Pete >>>>> >>>>> -----Original Message----- >>>>> From: tsuna [mailto:[email protected]] >>>>> Sent: Friday, January 28, 2011 12:14 PM >>>>> To: [email protected] >>>>> Subject: Re: Row Keys >>>>> >>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <[email protected]> >>>>> wrote: >>>>>> >>>>>> This is going to seem like a dumb question but it is recommended >>>>>> that you use a random key to spread the insert/read load among your >>>>>> region >>>>>> servers. My question is if I am using a scan with startrow and endrow >>>>>> how >>>>>> does that work with random row keys? >>>>> >>>>> The keys are always sorted. So if you generate random keys, you'll >>>>> get your data back in a random order. >>>>> What is recommended depends on the specific problem you're trying to >>>>> solve. But generally, one of the strengths of HBase is that the rows >>>>> are sorted, so sequential scanning is efficient (thanks to data >>>>> locality). >>>>> >>>>> -- >>>>> Benoit "tsuna" Sigoure >>>>> Software Engineer @ www.StumbleUpon.com >>>>> >>>> >>>> >>>> >>>> -- >>>> 河野 達也 >>>> Tatsuya Kawano (Mr.) >>>> Tokyo, Japan >>>> >>>> twitter: http://twitter.com/tatsuya6502 >>>> >> >> >
