[jira] Issue Comment Edited: (SOLR-799) Add support for hash based exact/near duplicate document handling

Mark Miller (JIRA) Thu, 23 Oct 2008 12:39:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642245#action_12642245
 ]


[EMAIL PROTECTED] edited comment on SOLR-799 at 10/23/08 12:38 PM:
-------------------------------------------------------------

I find the pluggable replace/prevent/append policy idea appealing, but I have 
not yet found a great way to plug it into the UpdateHandler. Any approach other 
than sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter 
to UpdateHandler. There is a connection now, UpdateHandler has a method to 
create a main IndexWriter, but further tying seems wrong without a stronger 
reason. That point is arguable, but in the end, sub-classing results in simpler 
code in any case. The trade off is that now you have a 
PreventDupesDirectUpdateHandler that extends DirectUpdateHandler2. This would 
have to be used in combination with the SignatureUpdateProcessor if you want to 
prevent dupes from entering the index. Other use cases (other than overwriting) 
would require another UpdateHandler. Less than ideal in both cases 
(subclassing, pluggable interface/class).

Both approaches lead to less than ideal solutions beyond that as well . Because 
many docs that have been added to Solr might not yet be visible to an 
IndexReader, you have to keep a pending commit set of docs to check against. 
This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. 
You'd essentially have to keep a mini index around to search against to 
accomplish this, due to delete by query. The other options are to either 
auto-commit sans a user commit before a delete, or just say we don't support 
that use case when using that UpdateHandler. None of it is very pretty.

Another option is to do things with an UpdateProcessor. This is the most 
elegant solution really, but it requires putting big,coarse syncs around the 
more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am 
not sure. The previous two options allow you to maintain similar syncs as to 
what is already there. Beyond that,  the UpdateProcessor approach still has the 
delete by query issues.

Maybe we just do overwrite dupe for now? It has none of these issues. I am open 
to whatever path you guys want. The other use cases do have their place - we 
will just have to compromise some to get there. Or maybe there are other 
suggestions?

Another point that was brought up is whether or not to delete any docs that 
match the update docs uniqueField id term, but not its similarity/update term. 
At the current moment, IMO, we shouldn't. You are choosing to use the 
updateTerm to do updates rather then the unique term. This allows you to have 
duplicate signatures but also uniqueField Ids for other operations (say 
delete). Also, if you already have a unique field that your using, it may be 
desirable to do dupe detection with a different field. There is always the 
option of setting the signature field to the uniqueField term. In the end, your 
call, I'll add it if we want it.

As far as search time dupe collapsing, I think I could see a search component 
that takes the page numbers to collapse (start, end) and does dupe elimination 
on that range at query time. It wouldn't be very fast, and I'm not sure how 
useful page at a time collapsing is, but it would be fairly easy to do. Not 
sure that it fits into this issue, but certainly could share some of its 
classes.



      was (Author: [EMAIL PROTECTED]):
    
I find the pluggable delete policy idea appealing, but I have not yet found a 
great way to plug it into the UpdateHandler. Any approach other than 
sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to 
UpdateHandler. There is a connection now, UpdateHandler has a method to create 
a main IndexWriter, but further tying seems wrong without a stronger reason. 
That point is arguable, but in the end, sub-classing results in simpler code in 
any case. The trade off is that now you have a PreventDupesDirectUpdateHandler 
that extends DirectUpdateHandler2. This would have to be used in combination 
with the SignatureUpdateProcessor if you want to prevent dupes from entering 
the index. Other use cases (other than overwriting) would require another 
UpdateHandler. Less than ideal in both cases (subclassing, pluggable 
interface/class).

Both approaches lead to less than ideal solutions beyond that as well . Because 
many docs that have been added to Solr might not yet be visible to an 
IndexReader, you have to keep a pending commit set of docs to check against. 
This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. 
You'd essentially have to keep a mini index around to search against to 
accomplish this, due to delete by query. The other options are to either 
auto-commit sans a user commit before a delete, or just say we don't support 
that use case when using that UpdateHandler. None of it is very pretty.

Another option is to do things with an UpdateProcessor. This is the most 
elegant solution really, but it requires putting big,coarse syncs around the 
more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am 
not sure. The previous two options allow you to maintain similar syncs as to 
what is already there. Beyond that,  the UpdateProcessor approach still has the 
delete by query issues.

Maybe we just do overwrite dupe for now? It has none of these issues. I am open 
to whatever path you guys want. The other use cases do have their place - we 
will just have to compromise some to get there. Or maybe there are other 
suggestions?

Another point that was brought up is whether or not to delete any docs that 
match the update docs uniqueField id term, but not its similarity/update term. 
At the current moment, IMO, we shouldn't. You are choosing to use the 
updateTerm to do updates rather then the unique term. This allows you to have 
duplicate signatures but also uniqueField Ids for other operations (say 
delete). Also, if you already have a unique field that your using, it may be 
desirable to do dupe detection with a different field. There is always the 
option of setting the signature field to the uniqueField term. In the end, your 
call, I'll add it if we want it.

As far as search time dupe collapsing, I think I could see a search component 
that takes the page numbers to collapse (start, end) and does dupe elimination 
on that range at query time. It wouldn't be very fast, and I'm not sure how 
useful page at a time collapsing is, but it would be fairly easy to do. Not 
sure that it fits into this issue, but certainly could share some of its 
classes.


  
> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to