[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Mark Miller (JIRA) Wed, 08 Oct 2008 12:46:36 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12638048#action_12638048
 ]


Mark Miller commented on SOLR-799:
----------------------------------

bq.    I agree that it is wise to separate the detection of duplication from 
the handling of found duplicates

bq. Though in some implementations (like #2, which may be the default), 
detecting that duplicate and handling it are truly coupled... forcing a 
decoupling would not be a good thing in that case.

Still looking at this. Was hoping to avoid any of the old 'if solr crashes you 
can have 2 docs with same id in the index' type stuff. Guess I won't easily get 
away with that <g> Hopefully we can make it so the default implementation can 
still be as efficient and atomic.

bq. How should different "types" be handled (for example when we support binary 
fields). For example, different base64 encoders might use different line 
lengths or different line endings (CR/LF). Perhaps it's good enough to say that 
the string form must be identical, and leave it at that for now? The 
alternative would be signatures based on the Lucene Document about to be 
indexed.

Yeah, may be best to worry about it when we support binary fields...would be 
nice to look forward though. I think returning a byte[] rather than a String 
will future proof the sig implementations a bit along those lines (though 
doesn't address your point)...still mulling - this shouldn't trip up Fuzzy 
hashing implementations to much, and so how exact should MD5Signature be...

bq.     *  It appears that if you put fields in a different order that the 
signature will change
bq.     * It appears that documents with different field names but the same 
content will have the same signature.

Two good points I have addressed.

bq. It would be nice to be able to calculate a signature for a document w/o 
having to catenate all the fields together.
Perhaps change calculate(String content) to something like 
calculate(Iterable<CharSequence> content)?

I like the idea of incremental as well.

bq. I don't understand the dedup logic in DUH2... it seems like we want to 
delete by id and by sig... unfortunately there is no
IndexWriter.updateDocument(Term[] terms, Document doc) so we'll have to do a 
separate non-atomic delete on the sig for now, right?

Another one I was hoping to get away with. My current strategy was to say that 
setting an update term means that updating by id is overridden and *only* the 
update Term is used - effectively, the update Term (signature) becomes the 
update id - and you can control whether the id factors into that update 
signature or not.  Didn't get passes the goalie I suppose <g> I guess I give up 
on clean atomic imp and perhaps investigate update(terms[], doc) for the 
future. I wanted to deal with both signature and id, but figured its best to 
start with most efficient bare bones and work out.

bq. There's probably no need for a separate test solrconfig-deduplicate.xml if 
all it adds is an update processor. Tests could just explicitly specify the 
update handler on updates.

Its mainly for me at the moment (testing config settings loading and what not), 
I'll be sure to pull it once the patch is done.

Thanks for all of the feedback.


> Add support for hash based exact/near duplicate document handling
> -----------------------------------------------------------------
>
>                 Key: SOLR-799
>                 URL: https://issues.apache.org/jira/browse/SOLR-799
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: SOLR-799.patch
>
>
> Hash based duplicate document detection is efficient and allows for blocking 
> as well as field collapsing. Lets put it into solr. 
> http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

Reply via email to