I think it does already Claudio:

http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[],
byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)

St.Ack

On Thu, Dec 2, 2010 at 7:42 AM, Claudio Martella
<[email protected]> wrote:
> Hi Ryan,
>
> yes that would help for sure. Shouldn't this feature be documented?
>
> Thanks
>
>
> On 12/1/10 4:03 AM, Ryan Rawson wrote:
>> CheckAndPut interprets a 'null' value argument as a check for
>> existence.  That is if you set the expected value to null it will only
>> succeed if the value does not exist.
>>
>> Would that help?
>>
>> -ryan
>>
>> On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella
>> <[email protected]> wrote:
>>> Hi Dave,
>>>
>>> thanks for you idea. I also considered this possibility. Although the
>>> possibility of a collision is very small, what scares me is the fact
>>> that i don't think the corruption can be corrected.
>>> I can for sure detect it afterwards in O(NlogN) time by scanning the
>>> table, but correcting my long-based corpus is impossible. Once the
>>> database is converted, the information is lost.
>>>
>>>
>>> On 11/30/10 1:43 AM, Buttler, David wrote:
>>>> A while back I had a strange idea to bypass this problem: create a 64-bit 
>>>> hash code for the word.  Your word space should be significantly smaller 
>>>> than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) 
>>>> should make collisions extremely rare.  And, if you can always check your 
>>>> dictionary later for collisions if this feels wrong.
>>>> This should be a good deal simpler than trying to keep around an order 
>>>> dependent integer mapping for your dictionary.  And, it is somewhat 
>>>> recoverable if you ever lose your dictionary for some reason.
>>>>
>>>> Dave
>>>>
>>>> -----Original Message-----
>>>> From: Claudio Martella [mailto:[email protected]]
>>>> Sent: Monday, November 29, 2010 7:13 AM
>>>> To: [email protected]
>>>> Subject: incremental counters and a global String->Long Dictionary
>>>>
>>>> Hello list,
>>>>
>>>> I'm kind of new to HBase, so I'll post this email with a request for
>>>> comment.
>>>> Very briefly, I do a lot of text processing with mapreduce, so it's very
>>>> useful for me to convert string to longs, so i can make my computations
>>>> faster.
>>>>
>>>> My corpus keeps on growing and I want this String->Long mapping to be
>>>> persistent and dynamical (i want to add new mappings when i find new 
>>>> words).
>>>> At the moment i'm tackling the problem this way (pseudo-code):
>>>>
>>>> longvalue = convert(word) # gets from hbase
>>>> if longvalue == -1:
>>>>     longvalue = insert(word) # puts in hbase
>>>>
>>>> longvalue now contains the new mapped value. This approach requires a
>>>> global counter that saves the latest mapped long and increments at every
>>>> insert. I can easily do this two ways. A special row in hbase "_counter"
>>>> that I increment through IncrementColumnValue, or creating a sequential
>>>> non-ephemeral znode in zookeeper and use the version as my counter. The
>>>> first one is of course faster. So the solution would be:
>>>>
>>>> insert(word):
>>>>     longvalue = hbase.incrementColumnValue("_counter", "v")
>>>>     hbase.put(word, longvalue)
>>>>     return longvalue
>>>>
>>>> The problem is that between the time i realize there's no mapping for my
>>>> word and the time i insert the new longvalue, somebody else might have
>>>> done the same for me, so I have a corrupted dictionary.
>>>>
>>>> One possible solution would be to acquire a lock on the "_counter" row,
>>>> recheck for the presence of the mapping and then insert my new value:
>>>>
>>>> safe_insert(word):
>>>>     lock("_counter")
>>>>     longvalue = convert(word)
>>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>         longvalue = insert(word)
>>>>     unlock("_counter")
>>>>     return longvalue
>>>>
>>>> This way the counter row, with its lock, would behave as a global lock.
>>>> This would solve my problems but would create a bottleneck (although
>>>> with time my inserts tend to get very rare as the dictionary grows). A
>>>> solution to this problem would be to have locks on zookeeper based on 
>>>> words.
>>>>
>>>> ZKsafe_insert(word):
>>>>     ZKlock("/words/"+ word)
>>>>     longvalue = convert(word)
>>>>     if longvalue == -1: #nobody inserted the mapping in the meantime
>>>>         longvalue = insert(word)
>>>>     ZKunlock("/words/"+word)
>>>>     return longvalue
>>>>
>>>> This of course would allow me to have more finegrained locks and better
>>>> scalability, but I'd relay on a system with higher latency (ZK).
>>>>
>>>> Does anybody have a better solution with hbase? I guess using
>>>> hbase_transational would also be a possibility, but again, what about
>>>> speed and the actual issues with the package (like recovering in the
>>>> face of hregion failure).
>>>>
>>>>
>>>> Thank you,
>>>>
>>>> Claudio
>>>>
>>>
>>> --
>>> Claudio Martella
>>> Digital Technologies
>>> Unit Research & Development - Analyst
>>>
>>> TIS innovation park
>>> Via Siemens 19 | Siemensstr. 19
>>> 39100 Bolzano | 39100 Bozen
>>> Tel. +39 0471 068 123
>>> Fax  +39 0471 068 129
>>> [email protected] http://www.tis.bz.it
>>>
>>> Short information regarding use of personal data. According to Section 13 
>>> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that 
>>> we process your personal data in order to fulfil contractual and fiscal 
>>> obligations and also to send you information regarding our services and 
>>> events. Your personal data are processed with and without electronic means 
>>> and by respecting data subjects' rights, fundamental freedoms and dignity, 
>>> particularly with regard to confidentiality, personal identity and the 
>>> right to personal data protection. At any time and without formalities you 
>>> can write an e-mail to [email protected] in order to object the processing 
>>> of your personal data for the purpose of sending advertising materials and 
>>> also to exercise the right to access personal data and other rights 
>>> referred to in Section 7 of Decree 196/2003. The data controller is TIS 
>>> Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find 
>>> the complete information on the web site www.tis.bz.it.
>>>
>>>
>>>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13 of 
> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> process your personal data in order to fulfil contractual and fiscal 
> obligations and also to send you information regarding our services and 
> events. Your personal data are processed with and without electronic means 
> and by respecting data subjects' rights, fundamental freedoms and dignity, 
> particularly with regard to confidentiality, personal identity and the right 
> to personal data protection. At any time and without formalities you can 
> write an e-mail to [email protected] in order to object the processing of 
> your personal data for the purpose of sending advertising materials and also 
> to exercise the right to access personal data and other rights referred to in 
> Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation 
> Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete 
> information on the web site www.tis.bz.it.
>
>
>

Reply via email to