I want to do a scan of a subset of the data using startrow and endrow. If the keys are random I can't set a startrow/endrow, as far as I know. If I reverse the order of <date>|<id> for the row key I will get a better distribution. Unfortunately a large set of the data comes from just two ids.

-Pete

On Sun, 30 Jan 2011 22:10:07 -0800, Ryan Rawson <[email protected]> wrote:

Hey,

I don't understand the 'random scan' question... if you want to scan a
random key, just scan! For example:

byte [] random_key = generateRandomKeyUsingRandomNumberGenerator();
Scan s = new Scan(random_key);

But you must mean something else... perhaps you could illuminate me?

-ryan

On Sun, Jan 30, 2011 at 10:06 PM, Lars George <[email protected]> wrote:
Hi Pete,

Look into the Mozilla Socorro project
(http://code.google.com/p/socorro/) for how to "salt" the keys to get
better load balancing across sequential keys. The principle is to add
a salt, in this case a number reflecting the number of servers
available (some multiple of that to allow for growth) and then prefix
the sequential key with it so that writes are spread across all
servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
open N scanners where N is the number of distinct salt values and scan
each subset with them while eventually combining the result in client
code. Assuming you want to scan all values in January and you have a
salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
"0-201102010000", then another for "1-201101010000" to
"1-201102010000" and so on. Then do the scans (multithreaded for
example) and combine the results client side. The Socorro code shows
one way to implement this.

Lars


On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <[email protected]> wrote:
Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
'<date>|<id>|~' (tilde character) and this has worked for my data set.
Unfortunately the key is not distributed very well. That is why I was
wondering how you do a scan (using start and end row) with a random row key.

Thanks

-Pete

PS. I use <date>|<id> since the id is variable length and this was my first attempt. I know have a months worth of data and for my next phase I will
probably reverse the <date> <id> order since it will work either way.


On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <[email protected]> wrote:

Hey,

So variable length keys and lexographical sorting makes it a little
tricky to do Scans and get exactly what you want.  This has a lot to
do with the ascii table too, and the numerical values.  Let consult
(http://www.asciitable.com/) while we work this example through:

Take a separation character of | as your code uses.  This is decimal
124, placing it way above both the lower and upper case letters AND
numbers, that is good.

Now you have something like this:

1234|a_string
1234|other_string

now we want to find all rows "belonging to" 1234, so we do a start row
of '1234|', but what for the end key? Well, let's try... '1234}', that
might work, oh wait, here is another key:

12345|foo

ok so '5' < '|' so it should short like so:
1234|a_string
1234|other_string
12345|foo

hmm well how does our end row compare? well '5' < '}' so '1234}' is
still "larger" than '12345|foo' so that row would be incorrectly
included in the scan results assuming we only want '1234' related
rows.

Ok, well maybe a better solution is to pick a lower ascii?  Well
outside of the control characters, space is the lowest character at
32, 33 is '!' so perhaps ! would be a better choice.  So you could
choose an end double quote as in '1234"' to define your 'stop row'.
Now you would be prohibited from using any character smaller than '33'
in your strings, which is kind of a non ideal solution.

This is all pretty clumsy, and doesnt work great in these variable
length separated strings.

The ultimate solution is to use the PrefixFilter, which is configured as
such:
byte[] start_row = Bytes.toBytes("1234|");
Scan s = new Scan(start_row);
s.setFilter(new PrefixFilter(start_row));
// do scan.

that way no matter what sortability your separator is, you will get
the answer you want every time.



Another way to do compound keys is to go pure-binary.  For example I
want a key that is 2 integers, so I can do this:
int part1 = ... ;
int part2 = ... ;
byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));

Now you can also search for all rows starting with 'target' like such:
int target = ... ;
// start key is 'target', stop key is 'target+1'
Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));

And you get exactly what you want, nothing more or less (all rows
starting with 'target').

The lexicographic comparison is very tricky sometimes. One quick tip
is that if your numbers (longs, ints) are big endian encoded (all the
utilities in Bytes.java do so), then the lexicographic sorting is
equal to the numeric sorting.  Otherwise if you do strings you end up
with:
1
11
2
3

and things are 'out of order'... if that is important, you can pad it
with 0s - dont forget to use the proper amount, which is 10 digits for
ints, and 19 for longs.  Or consider using binary encoding as above.

-ryan

On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <[email protected]>
wrote:

Hi Pete,

You're right. If you use random keys, you will never know the start /
end keys for scan. What you really want to do is to deign the key that
will distribute well for writes but also has the certain locality for
scans.

You probably have the ideal key already (ID|Date). If you don't make
entire key to be random but just the ID part, you could get a good
distribution at write time because writes for different IDs will be
distributed across the regions, and you also could get a good scan
performance when you scan between certain dates for a specific ID
because rows for the ID will be stored together in one region.

Thanks,
Tatsuya


2011/1/29 Peter Haidinyak <[email protected]>:

I know they are always sorted but if they are how do you know which row key belong to which data? Currently I use a row key of ID|Date so I always know what the startrow and endrow should be. I know I'm missing something
really fundamental here. :-(

Thanks

-Pete

-----Original Message-----
From: tsuna [mailto:[email protected]]
Sent: Friday, January 28, 2011 12:14 PM
To: [email protected]
Subject: Re: Row Keys

On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <[email protected]>
wrote:

This is going to seem like a dumb question but it is recommended that you use a random key to spread the insert/read load among your region servers. My question is if I am using a scan with startrow and endrow how
does that work with random row keys?

The keys are always sorted.  So if you generate random keys, you'll
get your data back in a random order.
What is recommended depends on the specific problem you're trying to
solve. But generally, one of the strengths of HBase is that the rows
are sorted, so sequential scanning is efficient (thanks to data
locality).

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com




--
河野 達也
Tatsuya Kawano (Mr.)
Tokyo, Japan

twitter: http://twitter.com/tatsuya6502





Reply via email to