Hi Lars,
Are you talking about http://code.google.com/p/socorro/ ?
I can find python scripts, but no jruby one...
Aside the hash function I could reuse, are you saying that range queries
are possible even with hashed keys (randomly distributed)?
(If possible with the script, it will also be possible from the hbase
java client).
Even with your explanation, I can't figure out how compound keys
(hasedkey+key) can be range-queried.
Tks,
- Eric
On 16/03/2011 11:38, Lars George wrote:
Hi Eric,
Mozilla Socorro uses an approach where they bucket ranges using
leading hashes to distribute them across servers. When you want to do
scans you need to create N scans, where N is the number of hashes and
then do a next() on each scanner, putting all KVs into one sorted list
(use the KeyComparator for example) while stripping the prefix hash
first. You can then access the rows in sorted order where the first
element in the list is the one with the first key to read. Once you
took of the first element (being the lowest KV key) you next the
underlying scanner and reinsert it into the list, reordering it. You
keep taking from the top and therefore always see the entire range,
even if the same scanner would return the next logical rows to read.
The shell is written in JRuby, so any function you can use there would
make sense to use in the prefix, then you could compute it on the fly.
This will not help with merging the bucketed key ranges, you need to
do this with the above approach in code. Though since this is JRuby
you could write that code in Ruby and add it to you local shell giving
you what you need.
Lars
On Wed, Mar 16, 2011 at 9:01 AM, Eric Charles
<[email protected]> wrote:
Oops, forget my first question about range query (if keys are hashed, they
can not be queried based on a range...)
Still curious to have info on hash function in shell shell (2.) and advice
on md5/jenkins/sha1 (3.)
Tks,
Eric
On 16/03/2011 09:52, Eric Charles wrote:
Hi,
To help avoid hotspots, I'm planning to use hashed keys in some tables.
1. I wonder if this strategy is adviced for range queries (from/to key)
use case, because the rows will be randomly distributed in different
regions. Will it cause some performance loose?
2. Is it possible to query from hbase shell with something like "get 't1',
@hash('r1')", to let the shell compute the hash for you from the readable
key.
3. There are MD5 and Jenkins classes in hbase.util package. What would you
advice? what about SHA1?
Tks,
- Eric
PS: I searched the archive but didn't find the answers.