RE: How to query for similar documents before indexing

Matthieu Labour Mon, 10 May 2010 14:29:57 -0700

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from 
entering the index. But is it going to be a silent action ? Or will the add 
method return that it failed indexing because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt

--- On Mon, 5/10/10, Markus Jelsma <markus.jel...@buyways.nl> wrote:

From: Markus Jelsma <markus.jel...@buyways.nl>
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

[1]: http://wiki.apache.org/solr/Deduplication

Cheers,

-----Original message-----
From: Matthieu Labour <matthieu_lab...@yahoo.com>
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help

RE: How to query for similar documents before indexing

Reply via email to