[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642245#action_12642245 ]
[EMAIL PROTECTED] edited comment on SOLR-799 at 10/23/08 12:38 PM: ------------------------------------------------------------- I find the pluggable replace/prevent/append policy idea appealing, but I have not yet found a great way to plug it into the UpdateHandler. Any approach other than sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to UpdateHandler. There is a connection now, UpdateHandler has a method to create a main IndexWriter, but further tying seems wrong without a stronger reason. That point is arguable, but in the end, sub-classing results in simpler code in any case. The trade off is that now you have a PreventDupesDirectUpdateHandler that extends DirectUpdateHandler2. This would have to be used in combination with the SignatureUpdateProcessor if you want to prevent dupes from entering the index. Other use cases (other than overwriting) would require another UpdateHandler. Less than ideal in both cases (subclassing, pluggable interface/class). Both approaches lead to less than ideal solutions beyond that as well . Because many docs that have been added to Solr might not yet be visible to an IndexReader, you have to keep a pending commit set of docs to check against. This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. You'd essentially have to keep a mini index around to search against to accomplish this, due to delete by query. The other options are to either auto-commit sans a user commit before a delete, or just say we don't support that use case when using that UpdateHandler. None of it is very pretty. Another option is to do things with an UpdateProcessor. This is the most elegant solution really, but it requires putting big,coarse syncs around the more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am not sure. The previous two options allow you to maintain similar syncs as to what is already there. Beyond that, the UpdateProcessor approach still has the delete by query issues. Maybe we just do overwrite dupe for now? It has none of these issues. I am open to whatever path you guys want. The other use cases do have their place - we will just have to compromise some to get there. Or maybe there are other suggestions? Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. At the current moment, IMO, we shouldn't. You are choosing to use the updateTerm to do updates rather then the unique term. This allows you to have duplicate signatures but also uniqueField Ids for other operations (say delete). Also, if you already have a unique field that your using, it may be desirable to do dupe detection with a different field. There is always the option of setting the signature field to the uniqueField term. In the end, your call, I'll add it if we want it. As far as search time dupe collapsing, I think I could see a search component that takes the page numbers to collapse (start, end) and does dupe elimination on that range at query time. It wouldn't be very fast, and I'm not sure how useful page at a time collapsing is, but it would be fairly easy to do. Not sure that it fits into this issue, but certainly could share some of its classes. was (Author: [EMAIL PROTECTED]): I find the pluggable delete policy idea appealing, but I have not yet found a great way to plug it into the UpdateHandler. Any approach other than sub-classing DirectUpdateHandler2 appears to lead to tying an IndexWriter to UpdateHandler. There is a connection now, UpdateHandler has a method to create a main IndexWriter, but further tying seems wrong without a stronger reason. That point is arguable, but in the end, sub-classing results in simpler code in any case. The trade off is that now you have a PreventDupesDirectUpdateHandler that extends DirectUpdateHandler2. This would have to be used in combination with the SignatureUpdateProcessor if you want to prevent dupes from entering the index. Other use cases (other than overwriting) would require another UpdateHandler. Less than ideal in both cases (subclassing, pluggable interface/class). Both approaches lead to less than ideal solutions beyond that as well . Because many docs that have been added to Solr might not yet be visible to an IndexReader, you have to keep a pending commit set of docs to check against. This list should be resilient against AddDoc, DelDocWquery, AddDoc, Commit. You'd essentially have to keep a mini index around to search against to accomplish this, due to delete by query. The other options are to either auto-commit sans a user commit before a delete, or just say we don't support that use case when using that UpdateHandler. None of it is very pretty. Another option is to do things with an UpdateProcessor. This is the most elegant solution really, but it requires putting big,coarse syncs around the more precise syncs in DirectUpdateHandler2. That may not be a huge deal, I am not sure. The previous two options allow you to maintain similar syncs as to what is already there. Beyond that, the UpdateProcessor approach still has the delete by query issues. Maybe we just do overwrite dupe for now? It has none of these issues. I am open to whatever path you guys want. The other use cases do have their place - we will just have to compromise some to get there. Or maybe there are other suggestions? Another point that was brought up is whether or not to delete any docs that match the update docs uniqueField id term, but not its similarity/update term. At the current moment, IMO, we shouldn't. You are choosing to use the updateTerm to do updates rather then the unique term. This allows you to have duplicate signatures but also uniqueField Ids for other operations (say delete). Also, if you already have a unique field that your using, it may be desirable to do dupe detection with a different field. There is always the option of setting the signature field to the uniqueField term. In the end, your call, I'll add it if we want it. As far as search time dupe collapsing, I think I could see a search component that takes the page numbers to collapse (start, end) and does dupe elimination on that range at query time. It wouldn't be very fast, and I'm not sure how useful page at a time collapsing is, but it would be fairly easy to do. Not sure that it fits into this issue, but certainly could share some of its classes. > Add support for hash based exact/near duplicate document handling > ----------------------------------------------------------------- > > Key: SOLR-799 > URL: https://issues.apache.org/jira/browse/SOLR-799 > Project: Solr > Issue Type: New Feature > Components: update > Reporter: Mark Miller > Priority: Minor > Attachments: SOLR-799.patch > > > Hash based duplicate document detection is efficient and allows for blocking > as well as field collapsing. Lets put it into solr. > http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.