I think it does already Claudio: http://hbase.apache.org/docs/r0.89.20100924/apidocs/org/apache/hadoop/hbase/client/HTable.html#checkAndPut(byte[], byte[], byte[], byte[], org.apache.hadoop.hbase.client.Put)
St.Ack On Thu, Dec 2, 2010 at 7:42 AM, Claudio Martella <[email protected]> wrote: > Hi Ryan, > > yes that would help for sure. Shouldn't this feature be documented? > > Thanks > > > On 12/1/10 4:03 AM, Ryan Rawson wrote: >> CheckAndPut interprets a 'null' value argument as a check for >> existence. That is if you set the expected value to null it will only >> succeed if the value does not exist. >> >> Would that help? >> >> -ryan >> >> On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella >> <[email protected]> wrote: >>> Hi Dave, >>> >>> thanks for you idea. I also considered this possibility. Although the >>> possibility of a collision is very small, what scares me is the fact >>> that i don't think the corruption can be corrected. >>> I can for sure detect it afterwards in O(NlogN) time by scanning the >>> table, but correcting my long-based corpus is impossible. Once the >>> database is converted, the information is lost. >>> >>> >>> On 11/30/10 1:43 AM, Buttler, David wrote: >>>> A while back I had a strange idea to bypass this problem: create a 64-bit >>>> hash code for the word. Your word space should be significantly smaller >>>> than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) >>>> should make collisions extremely rare. And, if you can always check your >>>> dictionary later for collisions if this feels wrong. >>>> This should be a good deal simpler than trying to keep around an order >>>> dependent integer mapping for your dictionary. And, it is somewhat >>>> recoverable if you ever lose your dictionary for some reason. >>>> >>>> Dave >>>> >>>> -----Original Message----- >>>> From: Claudio Martella [mailto:[email protected]] >>>> Sent: Monday, November 29, 2010 7:13 AM >>>> To: [email protected] >>>> Subject: incremental counters and a global String->Long Dictionary >>>> >>>> Hello list, >>>> >>>> I'm kind of new to HBase, so I'll post this email with a request for >>>> comment. >>>> Very briefly, I do a lot of text processing with mapreduce, so it's very >>>> useful for me to convert string to longs, so i can make my computations >>>> faster. >>>> >>>> My corpus keeps on growing and I want this String->Long mapping to be >>>> persistent and dynamical (i want to add new mappings when i find new >>>> words). >>>> At the moment i'm tackling the problem this way (pseudo-code): >>>> >>>> longvalue = convert(word) # gets from hbase >>>> if longvalue == -1: >>>> longvalue = insert(word) # puts in hbase >>>> >>>> longvalue now contains the new mapped value. This approach requires a >>>> global counter that saves the latest mapped long and increments at every >>>> insert. I can easily do this two ways. A special row in hbase "_counter" >>>> that I increment through IncrementColumnValue, or creating a sequential >>>> non-ephemeral znode in zookeeper and use the version as my counter. The >>>> first one is of course faster. So the solution would be: >>>> >>>> insert(word): >>>> longvalue = hbase.incrementColumnValue("_counter", "v") >>>> hbase.put(word, longvalue) >>>> return longvalue >>>> >>>> The problem is that between the time i realize there's no mapping for my >>>> word and the time i insert the new longvalue, somebody else might have >>>> done the same for me, so I have a corrupted dictionary. >>>> >>>> One possible solution would be to acquire a lock on the "_counter" row, >>>> recheck for the presence of the mapping and then insert my new value: >>>> >>>> safe_insert(word): >>>> lock("_counter") >>>> longvalue = convert(word) >>>> if longvalue == -1: #nobody inserted the mapping in the meantime >>>> longvalue = insert(word) >>>> unlock("_counter") >>>> return longvalue >>>> >>>> This way the counter row, with its lock, would behave as a global lock. >>>> This would solve my problems but would create a bottleneck (although >>>> with time my inserts tend to get very rare as the dictionary grows). A >>>> solution to this problem would be to have locks on zookeeper based on >>>> words. >>>> >>>> ZKsafe_insert(word): >>>> ZKlock("/words/"+ word) >>>> longvalue = convert(word) >>>> if longvalue == -1: #nobody inserted the mapping in the meantime >>>> longvalue = insert(word) >>>> ZKunlock("/words/"+word) >>>> return longvalue >>>> >>>> This of course would allow me to have more finegrained locks and better >>>> scalability, but I'd relay on a system with higher latency (ZK). >>>> >>>> Does anybody have a better solution with hbase? I guess using >>>> hbase_transational would also be a possibility, but again, what about >>>> speed and the actual issues with the package (like recovering in the >>>> face of hregion failure). >>>> >>>> >>>> Thank you, >>>> >>>> Claudio >>>> >>> >>> -- >>> Claudio Martella >>> Digital Technologies >>> Unit Research & Development - Analyst >>> >>> TIS innovation park >>> Via Siemens 19 | Siemensstr. 19 >>> 39100 Bolzano | 39100 Bozen >>> Tel. +39 0471 068 123 >>> Fax +39 0471 068 129 >>> [email protected] http://www.tis.bz.it >>> >>> Short information regarding use of personal data. According to Section 13 >>> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that >>> we process your personal data in order to fulfil contractual and fiscal >>> obligations and also to send you information regarding our services and >>> events. Your personal data are processed with and without electronic means >>> and by respecting data subjects' rights, fundamental freedoms and dignity, >>> particularly with regard to confidentiality, personal identity and the >>> right to personal data protection. At any time and without formalities you >>> can write an e-mail to [email protected] in order to object the processing >>> of your personal data for the purpose of sending advertising materials and >>> also to exercise the right to access personal data and other rights >>> referred to in Section 7 of Decree 196/2003. The data controller is TIS >>> Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find >>> the complete information on the web site www.tis.bz.it. >>> >>> >>> > > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [email protected] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 of > Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [email protected] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to in > Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation > Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete > information on the web site www.tis.bz.it. > > >
