CheckAndPut interprets a 'null' value argument as a check for existence. That is if you set the expected value to null it will only succeed if the value does not exist.
Would that help? -ryan On Tue, Nov 30, 2010 at 6:07 AM, Claudio Martella <[email protected]> wrote: > Hi Dave, > > thanks for you idea. I also considered this possibility. Although the > possibility of a collision is very small, what scares me is the fact > that i don't think the corruption can be corrected. > I can for sure detect it afterwards in O(NlogN) time by scanning the > table, but correcting my long-based corpus is impossible. Once the > database is converted, the information is lost. > > > On 11/30/10 1:43 AM, Buttler, David wrote: >> A while back I had a strange idea to bypass this problem: create a 64-bit >> hash code for the word. Your word space should be significantly smaller >> than 64 bits, so a good hash algorithm (the top 64 bits of sha1 say) should >> make collisions extremely rare. And, if you can always check your >> dictionary later for collisions if this feels wrong. >> This should be a good deal simpler than trying to keep around an order >> dependent integer mapping for your dictionary. And, it is somewhat >> recoverable if you ever lose your dictionary for some reason. >> >> Dave >> >> -----Original Message----- >> From: Claudio Martella [mailto:[email protected]] >> Sent: Monday, November 29, 2010 7:13 AM >> To: [email protected] >> Subject: incremental counters and a global String->Long Dictionary >> >> Hello list, >> >> I'm kind of new to HBase, so I'll post this email with a request for >> comment. >> Very briefly, I do a lot of text processing with mapreduce, so it's very >> useful for me to convert string to longs, so i can make my computations >> faster. >> >> My corpus keeps on growing and I want this String->Long mapping to be >> persistent and dynamical (i want to add new mappings when i find new words). >> At the moment i'm tackling the problem this way (pseudo-code): >> >> longvalue = convert(word) # gets from hbase >> if longvalue == -1: >> longvalue = insert(word) # puts in hbase >> >> longvalue now contains the new mapped value. This approach requires a >> global counter that saves the latest mapped long and increments at every >> insert. I can easily do this two ways. A special row in hbase "_counter" >> that I increment through IncrementColumnValue, or creating a sequential >> non-ephemeral znode in zookeeper and use the version as my counter. The >> first one is of course faster. So the solution would be: >> >> insert(word): >> longvalue = hbase.incrementColumnValue("_counter", "v") >> hbase.put(word, longvalue) >> return longvalue >> >> The problem is that between the time i realize there's no mapping for my >> word and the time i insert the new longvalue, somebody else might have >> done the same for me, so I have a corrupted dictionary. >> >> One possible solution would be to acquire a lock on the "_counter" row, >> recheck for the presence of the mapping and then insert my new value: >> >> safe_insert(word): >> lock("_counter") >> longvalue = convert(word) >> if longvalue == -1: #nobody inserted the mapping in the meantime >> longvalue = insert(word) >> unlock("_counter") >> return longvalue >> >> This way the counter row, with its lock, would behave as a global lock. >> This would solve my problems but would create a bottleneck (although >> with time my inserts tend to get very rare as the dictionary grows). A >> solution to this problem would be to have locks on zookeeper based on words. >> >> ZKsafe_insert(word): >> ZKlock("/words/"+ word) >> longvalue = convert(word) >> if longvalue == -1: #nobody inserted the mapping in the meantime >> longvalue = insert(word) >> ZKunlock("/words/"+word) >> return longvalue >> >> This of course would allow me to have more finegrained locks and better >> scalability, but I'd relay on a system with higher latency (ZK). >> >> Does anybody have a better solution with hbase? I guess using >> hbase_transational would also be a possibility, but again, what about >> speed and the actual issues with the package (like recovering in the >> face of hregion failure). >> >> >> Thank you, >> >> Claudio >> > > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [email protected] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 of > Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [email protected] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to in > Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation > Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete > information on the web site www.tis.bz.it. > > >
