RE: How to query for similar documents before indexing

2010-05-11 Thread Matthieu Labour
Hi Markus

Thank you for your answer

Here is a use case where I think it would be nice to know there is a dup before 
I insert it. 

Let's say I create a summary out of the document and I only index the summary 
and store the document itself on a separate device (S3, Cassandra etc ...). 
Then I would need that addDocument on the summary failed because it detected a 
duplicate so that I don't neet to store the document.
  
When you write:
On the other hand, you can also have a manual process that finds
duplicates based on that signature and gather that information yourself
as long as such a feature isn't there.

Can you explain more what you have in mind ?

Thank you for your help!

matt

--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 5:07 PM

Hi Matthieu,

 

 

On the top of the wiki page you can see it's in 1.4 already. As far as i know 
the API doesn't return information on found duplicates in its response header, 
the wiki isn't clear on that subject. I, at least, never saw any other response 
than an error or the usual status code and QTime.

 

Perhaps it would be a nice feature. On the other hand, you can also have a 
manual process that finds duplicates based on that signature and gather that 
information yourself as long as such a feature isn't there.

 

 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org; 
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from 
entering the index. But is it going to be a silent action ? Or will the add 
method return that it failed indexing because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication



 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 



      


  

Re: How to query for similar documents before indexing

2010-05-11 Thread Markus Jelsma
If you set overwriteDupes = false the exact or near duplicate documents will 
not be deleted. The signature field is set, however, so you can later query 
yourself for duplicates in an external program and do whatever you want with 
the duplicates.


On Tuesday 11 May 2010 15:41:33 Matthieu Labour wrote:
 Hi Markus
 
 Thank you for your answer
 
 Here is a use case where I think it would be nice to know there is a dup
  before I insert it.
 
 Let's say I create a summary out of the document and I only index the
  summary and store the document itself on a separate device (S3, Cassandra
  etc ...). Then I would need that addDocument on the summary failed because
  it detected a duplicate so that I don't neet to store the document. 
 When you write:
 On the other hand, you can also have a manual process that finds
 duplicates based on that signature and gather that information yourself
 as long as such a feature isn't there.
 
 Can you explain more what you have in mind ?
 
 Thank you for your help!
 
 matt
 
 --- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:
 
 From: Markus Jelsma markus.jel...@buyways.nl
 Subject: RE: How to query for similar documents before indexing
 To: solr-user@lucene.apache.org
 Date: Monday, May 10, 2010, 5:07 PM
 
 Hi Matthieu,
 
  
 
  
 
 On the top of the wiki page you can see it's in 1.4 already. As far as i
  know the API doesn't return information on found duplicates in its
  response header, the wiki isn't clear on that subject. I, at least, never
  saw any other response than an error or the usual status code and QTime.
 
  
 
 Perhaps it would be a nice feature. On the other hand, you can also have a
  manual process that finds duplicates based on that signature and gather
  that information yourself as long as such a feature isn't there.
 
  
 
  
 
 Cheers,
 
 
  
 -Original message-
 From: Matthieu Labour matthieu_lab...@yahoo.com
 Sent: Mon 10-05-2010 23:30
 To: solr-user@lucene.apache.org;
 Subject: RE: How to query for similar documents before indexing
 
 Markus
 Thank you for your response
 That would be great if the index has the option to prevent duplicate from
  entering the index. But is it going to be a silent action ? Or will the
  add method return that it failed indexing because it detected a duplicate
  ? Is it commited to the 1.4 already ?
 Cheers
 matt
 
 
 --- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:
 
 From: Markus Jelsma markus.jel...@buyways.nl
 Subject: RE: How to query for similar documents before indexing
 To: solr-user@lucene.apache.org
 Date: Monday, May 10, 2010, 4:11 PM
 
 Hi,
 
  
 
  
 
 Deduplication [1] is what you're looking for.It can utilize different
  analyzers that will add a one or more signatures or hashes to your
  document depending on exact or partial matches for configurable fields.
  Based on that, it should be able to prevent new documents from entering
  the index.
 
  
 
 The first part works very well but i have some issues with removing those
  documents on which i also need to check with the community tomorrow back
  at work ;-)
 
  
 
  
 
 [1]: http://wiki.apache.org/solr/Deduplication
 
 
 
  
 
 Cheers,
 
 
  
 -Original message-
 From: Matthieu Labour matthieu_lab...@yahoo.com
 Sent: Mon 10-05-2010 22:41
 To: solr-user@lucene.apache.org;
 Subject: How to query for similar documents before indexing
 
 Hi
 
 I want to implement the following logic:
 
 Before I index a new document into the index, I want to check if there are
  already documents in the index with similar content to the content of the
  document about to be inserted. If the request returns 1 or more documents,
  then I don't want to insert the document.
 
 What is the best way to achieve the above functionality ?
 
 I read about Fuzzy searches in logic. But can I really build a request such
  as mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9
  ?
 
 Thank you for your help
 
 
 
 
  
  
 
 
 
  
 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



RE: How to query for similar documents before indexing

2010-05-10 Thread Markus Jelsma
Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication

 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 


RE: How to query for similar documents before indexing

2010-05-10 Thread Matthieu Labour
Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from 
entering the index. But is it going to be a silent action ? Or will the add 
method return that it failed indexing because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication


 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 



  

RE: How to query for similar documents before indexing

2010-05-10 Thread Markus Jelsma
Hi Matthieu,

 

 

On the top of the wiki page you can see it's in 1.4 already. As far as i know 
the API doesn't return information on found duplicates in its response header, 
the wiki isn't clear on that subject. I, at least, never saw any other response 
than an error or the usual status code and QTime.

 

Perhaps it would be a nice feature. On the other hand, you can also have a 
manual process that finds duplicates based on that signature and gather that 
information yourself as long as such a feature isn't there.

 

 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org; 
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from 
entering the index. But is it going to be a silent action ? Or will the add 
method return that it failed indexing because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers 
that will add a one or more signatures or hashes to your document depending on 
exact or partial matches for configurable fields. Based on that, it should be 
able to prevent new documents from entering the index. 

 

The first part works very well but i have some issues with removing those 
documents on which i also need to check with the community tomorrow back at 
work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication


 

Cheers,


 
-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are 
already documents in the index with similar content to the content of the 
document about to be inserted. If the request returns 1 or more documents, then 
I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 



      

Re: How to query for similar documents before indexing

2010-05-10 Thread Ken Krugler

Hi all (especially Yonik),

At the http://wiki.apache.org/solr/Deduplication page, it mentions  
duplicate field collapsing and later Allow for both duplicate  
collapsing in search results...


But I don't see any mention of how deduplication happens during search  
time. Normally this requires that the field be stored (not just  
indexed), and for efficiency it might need to be in a FieldCache. I'm  
wondering about both status of this support, and thoughts on potential  
impact to index/memory size.


Thanks,

-- Ken


On May 10, 2010, at 3:07pm, Markus Jelsma wrote:


Hi Matthieu,

On the top of the wiki page you can see it's in 1.4 already. As far  
as i know the API doesn't return information on found duplicates in  
its response header, the wiki isn't clear on that subject. I, at  
least, never saw any other response than an error or the usual  
status code and QTime.


Perhaps it would be a nice feature. On the other hand, you can also  
have a manual process that finds duplicates based on that signature  
and gather that information yourself as long as such a feature isn't  
there.


Cheers,

-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org;
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate  
from entering the index. But is it going to be a silent action ? Or  
will the add method return that it failed indexing because it  
detected a duplicate ?

Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

Deduplication [1] is what you're looking for.It can utilize  
different analyzers that will add a one or more signatures or hashes  
to your document depending on exact or partial matches for  
configurable fields. Based on that, it should be able to prevent new  
documents from entering the index.


The first part works very well but i have some issues with removing  
those documents on which i also need to check with the community  
tomorrow back at work ;-)



[1]: http://wiki.apache.org/solr/Deduplication

Cheers,



-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org;
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if  
there are already documents in the index with similar content to the  
content of the document about to be inserted. If the request returns  
1 or more documents, then I don't want to insert the document.


What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a  
request such as
mydoc.title:wordexample~ AND mydoc.content:( all the content  
words)~0.9 ?


Thank you for your help




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: How to query for similar documents before indexing

2010-05-10 Thread Mark Miller
There is no official support for dedupe at search time. You can take a 
look at the field collapse patch in JIRA though - we where thinking 
ahead when we added the ability to tag dupes during indexing for field 
collapsing at search time - but the search side support is not there yet.


On 5/10/10 7:39 PM, Ken Krugler wrote:

Hi all (especially Yonik),

At the http://wiki.apache.org/solr/Deduplication page, it mentions
duplicate field collapsing and later Allow for both duplicate
collapsing in search results...

But I don't see any mention of how deduplication happens during search
time. Normally this requires that the field be stored (not just
indexed), and for efficiency it might need to be in a FieldCache. I'm
wondering about both status of this support, and thoughts on potential
impact to index/memory size.

Thanks,

-- Ken


On May 10, 2010, at 3:07pm, Markus Jelsma wrote:


Hi Matthieu,

On the top of the wiki page you can see it's in 1.4 already. As far as
i know the API doesn't return information on found duplicates in its
response header, the wiki isn't clear on that subject. I, at least,
never saw any other response than an error or the usual status code
and QTime.

Perhaps it would be a nice feature. On the other hand, you can also
have a manual process that finds duplicates based on that signature
and gather that information yourself as long as such a feature isn't
there.

Cheers,

-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org;
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate
from entering the index. But is it going to be a silent action ? Or
will the add method return that it failed indexing because it detected
a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma markus.jel...@buyways.nl wrote:

From: Markus Jelsma markus.jel...@buyways.nl
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

Deduplication [1] is what you're looking for.It can utilize different
analyzers that will add a one or more signatures or hashes to your
document depending on exact or partial matches for configurable
fields. Based on that, it should be able to prevent new documents from
entering the index.

The first part works very well but i have some issues with removing
those documents on which i also need to check with the community
tomorrow back at work ;-)


[1]: http://wiki.apache.org/solr/Deduplication

Cheers,



-Original message-
From: Matthieu Labour matthieu_lab...@yahoo.com
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org;
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there
are already documents in the index with similar content to the content
of the document about to be inserted. If the request returns 1 or more
documents, then I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request
such as
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g







--
- Mark

http://www.lucidimagination.com