CheckAndPut interprets a 'null' value argument as a check for
existence.  That is if you set the expected value to null it will only
succeed if the value does not exist.

Would that help?

-ryan

On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella
<[email protected]> wrote:
> Hi Dave,
>
> thanks for you idea. I also considered this possibility. Although the
> possibility of a collision is very small, what scares me is the fact
> that i don't think the corruption can be corrected.
> I can for sure detect it afterwards in O(NlogN) time by scanning the
> table, but correcting my long-based corpus is impossible. Once the
> database is converted, the information is lost.
>
>
> On 11/30/10 1:43 AM, Buttler, David wrote:
>> A while back I had a strange idea to bypass this problem: create a 64-bit 
>> hash code for the word.  Your word space should be significantly smaller 
>> than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should 
>> make collisions extremely rare.  And, if you can always check your 
>> dictionary later for collisions if this feels wrong.
>> This should be a good deal simpler than trying to keep around an order 
>> dependent integer mapping for your dictionary.  And, it is somewhat 
>> recoverable if you ever lose your dictionary for some reason.
>>
>> Dave
>>
>> -----Original Message-----
>> From: Claudio Martella [mailto:[email protected]]
>> Sent: Monday, November 29, 2010 7:13 AM
>> To: [email protected]
>> Subject: incremental counters and a global String->Long Dictionary
>>
>> Hello list,
>>
>> I'm kind of new to HBase, so I'll post this email with a request for
>> comment.
>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>> useful for me to convert string to longs, so i can make my computations
>> faster.
>>
>> My corpus keeps on growing and I want this String->Long mapping to be
>> persistent and dynamical (i want to add new mappings when i find new words).
>> At the moment i'm tackling the problem this way (pseudo-code):
>>
>> longvalue = convert(word) # gets from hbase
>> if longvalue == -1:
>>     longvalue = insert(word) # puts in hbase
>>
>> longvalue now contains the new mapped value. This approach requires a
>> global counter that saves the latest mapped long and increments at every
>> insert. I can easily do this two ways. A special row in hbase "_counter"
>> that I increment through IncrementColumnValue, or creating a sequential
>> non-ephemeral znode in zookeeper and use the version as my counter. The
>> first one is of course faster. So the solution would be:
>>
>> insert(word):
>>     longvalue = hbase.incrementColumnValue("_counter", "v")
>>     hbase.put(word, longvalue)
>>     return longvalue
>>
>> The problem is that between the time i realize there's no mapping for my
>> word and the time i insert the new longvalue, somebody else might have
>> done the same for me, so I have a corrupted dictionary.
>>
>> One possible solution would be to acquire a lock on the "_counter" row,
>> recheck for the presence of the mapping and then insert my new value:
>>
>> safe_insert(word):
>>     lock("_counter")
>>     longvalue = convert(word)
>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>         longvalue = insert(word)
>>     unlock("_counter")
>>     return longvalue
>>
>> This way the counter row, with its lock, would behave as a global lock.
>> This would solve my problems but would create a bottleneck (although
>> with time my inserts tend to get very rare as the dictionary grows). A
>> solution to this problem would be to have locks on zookeeper based on words.
>>
>> ZKsafe_insert(word):
>>     ZKlock("/words/"+ word)
>>     longvalue = convert(word)
>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>         longvalue = insert(word)
>>     ZKunlock("/words/"+word)
>>     return longvalue
>>
>> This of course would allow me to have more finegrained locks and better
>> scalability, but I'd relay on a system with higher latency (ZK).
>>
>> Does anybody have a better solution with hbase? I guess using
>> hbase_transational would also be a possibility, but again, what about
>> speed and the actual issues with the package (like recovering in the
>> face of hregion failure).
>>
>>
>> Thank you,
>>
>> Claudio
>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of 
> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> process your personal data in order to fulfil contractual and fiscal 
> obligations and also to send you information regarding our services and 
> events. Your personal data are processed with and without electronic means 
> and by respecting data subjects' rights, fundamental freedoms and dignity, 
> particularly with regard to confidentiality, personal identity and the right 
> to personal data protection. At any time and without formalities you can 
> write an e-mail to [email protected] in order to object the processing of 
> your personal data for the purpose of sending advertising materials and also 
> to exercise the right to access personal data and other rights referred to in 
> Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation 
> Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete 
> information on the web site www.tis.bz.it.
>
>
>

Reply via email to