[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Hoss Man (JIRA) Mon, 13 Oct 2008 22:04:37 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639304#action_12639304
 ]


Hoss Man commented on SOLR-799:
-------------------------------

If we assume for a minute that users who want to prevent or overwrite 
duplicates using a signature should always use the signature field as their 
uniqueKey, then doesn't use case#1 simplify to just running using a 
SignatureUpdateProcessor and then another processor that forces 
"allowDups=false,overwritePending=false,overwriteCommitted=false" ?

Conceptually that seems right ... but at the moment DIH2 doesn't seem to care 
about allowDups at all (it only looks at overwriteCommitted and 
overwritePending to decide if it's allowing duplicates) and i'm not sure how to 
make it work off the top of my head, but assuming we need to muck with DIH2 
internals in some way to make signatures (and aborting if the signature already 
exists) work, implementing the changes to happen for those combination of 
existing options seems like the cleanest approach.: the functional changes to 
DIH2 become generally useful to anyone who doesn't want to overwrite existing 
docs with the same id, regardless of whether they are computing a signature.

the only hangup is whether we're okay with the initial assumption: that users 
who want duplicate detection by signature are willing to use the signature as 
the uniqueKey.  If not then perhaps the cleanest way to support that would be 
to add more generalized "unique field" support ... a list of field names in the 
schema.xml and a (hopefully) simple call writer.deleteDocuments(Term[]) call in 
DIH2 should do the trick right?  ... this could also be potentially useful to 
people for other purposes besides signatures, but i haven't thought throw all 
the permutations so i'm sure there would be funky corner cases.



> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to