Re: How to query for similar documents before indexing

Ken Krugler Mon, 10 May 2010 16:40:04 -0700

Hi all (especially Yonik),

At the http://wiki.apache.org/solr/Deduplication page, it mentions"duplicate field collapsing" and later "Allow for both duplicatecollapsing in search results..."

But I don't see any mention of how deduplication happens during searchtime. Normally this requires that the field be stored (not justindexed), and for efficiency it might need to be in a FieldCache. I'mwondering about both status of this support, and thoughts on potentialimpact to index/memory size.


Thanks,

-- Ken


On May 10, 2010, at 3:07pm, Markus Jelsma wrote:

Hi Matthieu,
On the top of the wiki page you can see it's in 1.4 already. As faras i know the API doesn't return information on found duplicates inits response header, the wiki isn't clear on that subject. I, atleast, never saw any other response than an error or the usualstatus code and QTime.
Perhaps it would be a nice feature. On the other hand, you can alsohave a manual process that finds duplicates based on that signatureand gather that information yourself as long as such a feature isn'tthere.
Cheers,

-----Original message-----
From: Matthieu Labour <matthieu_lab...@yahoo.com>
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org;
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicatefrom entering the index. But is it going to be a silent action ? Orwill the add method return that it failed indexing because itdetected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma <markus.jel...@buyways.nl> wrote:

From: Markus Jelsma <markus.jel...@buyways.nl>
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,
Deduplication [1] is what you're looking for.It can utilizedifferent analyzers that will add a one or more signatures or hashesto your document depending on exact or partial matches forconfigurable fields. Based on that, it should be able to prevent newdocuments from entering the index.
The first part works very well but i have some issues with removingthose documents on which i also need to check with the communitytomorrow back at work ;-)
[1]: http://wiki.apache.org/solr/Deduplication

Cheers,



-----Original message-----
From: Matthieu Labour <matthieu_lab...@yahoo.com>
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org;
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:
Before I index a new document into the index, I want to check ifthere are already documents in the index with similar content to thecontent of the document about to be inserted. If the request returns1 or more documents, then I don't want to insert the document.
What is the best way to achieve the above functionality ?
I read about Fuzzy searches in logic. But can I really build arequest such asmydoc.title:wordexample~ AND mydoc.content:( all the contentwords)~0.9 ?
Thank you for your help


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: How to query for similar documents before indexing

Reply via email to