[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Yonik Seeley (JIRA) Wed, 08 Oct 2008 11:10:46 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638009#action_12638009
 ]


Yonik Seeley commented on SOLR-799:
-----------------------------------

Some thoughts...

- How should different "types" be handled (for example when we support binary 
fields).  For example, different base64 encoders might use different line 
lengths or different line endings (CR/LF).  Perhaps it's good enough to say 
that the string form must be identical, and leave it at that for now?  The 
alternative would be signatures based on the Lucene Document about to be 
indexed.

- It would be nice to be able to calculate a signature for a document w/o 
having to catenate all the fields together.
Perhaps change calculate(String content) to something like 
calculate(Iterable<CharSequence> content)?

An alternative option would be incremental hashing...
{code}
Signature sig = ourSignatureCreator.create();
sig.add(f1)
sig.add(f2)
sig.add(f3)
String s = sig.getSignature()
{code}

Looking at how TextProfileSignature works, i'd lean toward incremental hashing 
to avoid building yet another big string. Having a hashing object also opens up 
the possibility to easily add other method signatures for more efficient 
hashing.

- It appears that if you put fields in a different order that the signature 
will change

- It appears that documents with different field names but the same content 
will have the same signature.

- I don't understand the dedup logic in DUH2... it seems like we want to delete 
by id and by sig... unfortunately there is no 
  IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a 
separate non-atomic delete on the sig for now, right?

- There's probably no need for a separate test solrconfig-deduplicate.xml if 
all it adds is an update processor.  Tests could just explicitly specify the 
update handler on updates.


> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to