Re: hash uniqueKey generation?

2010-11-16 Thread Dennis Gearon
: hash uniqueKey generation? I think the deduplication signature field will work as a multiValued field. So you can do copyField to it from all of the source fields. Dan Lynn wrote: Hi, I just finished reading on the wiki about deduplication and the solr.UUIDField type. What I'd like to do

Re: hash uniqueKey generation?

2010-11-16 Thread Dan Lynn
Thanks for the feedback, guys! On 11/15/2010 10:14 AM, Dan Lynn wrote: Hi, I just finished reading on the wiki about deduplication and the solr.UUIDField type. What I'd like to do is generate an ID for a document by hashing a subset of its fields. One route I thought would be to do this

Re: hash uniqueKey generation?

2010-11-16 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon gear...@sbcglobal.net wrote: hashing is not 100% guaranteed to produce unique values. But if you go to enough bits with a good hash function, you can get the odds lower than the odds of something else changing the value like cosmic rays flipping a

Re: hash uniqueKey generation?

2010-11-16 Thread Dennis Gearon
To Life, otherwise we all die. - Original Message From: Yonik Seeley yo...@lucidimagination.com To: solr-user@lucene.apache.org Sent: Tue, November 16, 2010 1:46:43 PM Subject: Re: hash uniqueKey generation? On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon gear...@sbcglobal.net wrote: hashing

Re: hash uniqueKey generation?

2010-11-16 Thread Yonik Seeley
On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon gear...@sbcglobal.net wrote: Read up on WikiPedia, but I believe that no Hash Function is much good above 50% of the address space it generates. 50% is way to high - collisions will happen before that. But given that something like MD5 has 128

Re: hash uniqueKey generation?

2010-11-16 Thread Lance Norskog
Nobody has ever reported seeing a collision 'in the wild' with MD5. It is broken, but that takes an algorithm. As to cosmic rays: it's a real problem. A recent Google paper reported that some ram chips will have 1 bit error per gigabit per century, while others have that much per hour. I've

hash uniqueKey generation?

2010-11-15 Thread Dan Lynn
Hi, I just finished reading on the wiki about deduplication and the solr.UUIDField type. What I'd like to do is generate an ID for a document by hashing a subset of its fields. One route I thought would be to do this ahead of time to CSV data, but I would think sticking something into the

Re: hash uniqueKey generation?

2010-11-15 Thread Lance Norskog
I think the deduplication signature field will work as a multiValued field. So you can do copyField to it from all of the source fields. Dan Lynn wrote: Hi, I just finished reading on the wiki about deduplication and the solr.UUIDField type. What I'd like to do is generate an ID for a

Re: hash uniqueKey generation?

2010-11-15 Thread Chris Hostetter
: I just finished reading on the wiki about deduplication and the solr.UUIDField : type. What I'd like to do is generate an ID for a document by hashing a subset : of its fields. One route I thought would be to do this ahead of time to CSV : data, but I would think sticking something into the