Hello list,

I'm kind of new to HBase, so I'll post this email with a request for
comment.
Very briefly, I do a lot of text processing with mapreduce, so it's very
useful for me to convert string to longs, so i can make my computations
faster.

My corpus keeps on growing and I want this String->Long mapping to be
persistent and dynamical (i want to add new mappings when i find new words).
At the moment i'm tackling the problem this way (pseudo-code):

longvalue = convert(word) # gets from hbase
if longvalue == -1:
    longvalue = insert(word) # puts in hbase

longvalue now contains the new mapped value. This approach requires a
global counter that saves the latest mapped long and increments at every
insert. I can easily do this two ways. A special row in hbase "_counter"
that I increment through IncrementColumnValue, or creating a sequential
non-ephemeral znode in zookeeper and use the version as my counter. The
first one is of course faster. So the solution would be:

insert(word):
    longvalue = hbase.incrementColumnValue("_counter", "v")
    hbase.put(word, longvalue)
    return longvalue

The problem is that between the time i realize there's no mapping for my
word and the time i insert the new longvalue, somebody else might have
done the same for me, so I have a corrupted dictionary.

One possible solution would be to acquire a lock on the "_counter" row,
recheck for the presence of the mapping and then insert my new value:

safe_insert(word):
    lock("_counter")
    longvalue = convert(word)
    if longvalue == -1: #nobody inserted the mapping in the meantime
        longvalue = insert(word)
    unlock("_counter")
    return longvalue

This way the counter row, with its lock, would behave as a global lock.
This would solve my problems but would create a bottleneck (although
with time my inserts tend to get very rare as the dictionary grows). A
solution to this problem would be to have locks on zookeeper based on words.

ZKsafe_insert(word):
    ZKlock("/words/"+ word)
    longvalue = convert(word)
    if longvalue == -1: #nobody inserted the mapping in the meantime
        longvalue = insert(word)
    ZKunlock("/words/"+word)
    return longvalue

This of course would allow me to have more finegrained locks and better
scalability, but I'd relay on a system with higher latency (ZK).

Does anybody have a better solution with hbase? I guess using
hbase_transational would also be a possibility, but again, what about
speed and the actual issues with the package (like recovering in the
face of hregion failure).


Thank you,

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
[email protected] http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to [email protected] in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.


Reply via email to