Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky
Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.

No ready solution.

Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 score, but avoids the
pollution problem.

I haven't heard of anybody in the Lucene space discussing a way to
categorize documents such that df is relative to a specified document
category and then the query specifies a document category. I support that
indexing and query of some hypothetical similarity schema could both
specify any number of document categories. But that's speculation on my
part.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 6:42 PM, Chris Morley  wrote:

> Hey Solr people:
>
>  Suppose that we did not want to break up our document set into separate
> indexes, but had certain cases where many versions of a document were not
> relevant for certain searches.
>
>  I guess this could be thought of as a "authorization" class of problem,
> however it is not that for us.  We have a few other fields that determine
> relevancy to the current query, based on what page the query is coming
> from.  It's kind of like authorization, but not really.
>
>  Anyway, I think the answer for how you would do it for authorization would
> solve it for our case too.
>
>  So I guess suppose you had 99 users and 100 documents and Document 1
> everybody could see it the same, but for the 99 documents, there was a
> slightly different document, and it was unique for each of 99 users, but
> not "very" unique.  Suppose for instance that the only thing different in
> the text of the 99 different documents was that it was watermarked with the
> users name.  Aren't you spamming your tf/idf at that point?  Is there a way
> around this?  Is there a way to say, hey, group these 99 documents together
> and only count 1 of them for tf/idf purposes?
>
>  When doing queries, each user would only ever see 2 documents, Document 1
> , plus whichever other document they specifically owned.
>
>  If there are web pages or book chapters I can read or re-read that address
> this class of problem, those references would be great.
>
>
>  -Chris.
>
>
>
>


Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Chris Morley
Hey Solr people:
  
 Suppose that we did not want to break up our document set into separate 
indexes, but had certain cases where many versions of a document were not 
relevant for certain searches.
  
 I guess this could be thought of as a "authorization" class of problem, 
however it is not that for us.  We have a few other fields that determine 
relevancy to the current query, based on what page the query is coming 
from.  It's kind of like authorization, but not really.
  
 Anyway, I think the answer for how you would do it for authorization would 
solve it for our case too.
  
 So I guess suppose you had 99 users and 100 documents and Document 1 
everybody could see it the same, but for the 99 documents, there was a 
slightly different document, and it was unique for each of 99 users, but 
not "very" unique.  Suppose for instance that the only thing different in 
the text of the 99 different documents was that it was watermarked with the 
users name.  Aren't you spamming your tf/idf at that point?  Is there a way 
around this?  Is there a way to say, hey, group these 99 documents together 
and only count 1 of them for tf/idf purposes?
  
 When doing queries, each user would only ever see 2 documents, Document 1 
, plus whichever other document they specifically owned.
  
 If there are web pages or book chapters I can read or re-read that address 
this class of problem, those references would be great.
  
  
 -Chris.
  
  



Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
 Otis,

 Thanks for your response.

 I just gave a quick look to the Nutch Forum and find that there is an
 implementation to obtain de-duplicate documents/pages but none for Near
 Duplicates documents. Can you guide me a little further as to where exactly
 under Nutch I should be concentrating, regarding near duplicate documents?

 Regards,
 Rishabh

 On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
 wrote:


  To whomever started this thread: look at Nutch.  I believe something
  related to this already exists in Nutch for near-duplicate detection.
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Mike Klaas [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Sunday, November 18, 2007 11:08:38 PM
  Subject: Re: Near Duplicate Documents
 
  On 18-Nov-07, at 8:17 AM, Eswar K wrote:
 
   Is there any idea implementing that feature in the up coming
   releases?
 
  Not currently.  Feel free to contribute something if you find a good
  solution g.
 
  -Mike
 
 
   On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:
  
   On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
   We have a scenario, where we want to find out documents which are
   similar in
   content. To elaborate a little more on what we mean here, lets
   take an
   example.
  
   The example of this email chain in which we are interacting on,
   can be
   best
   used for illustrating the concept of near dupes (We are not getting
   confused
   with threads, they are two different things.). Each email in this
   thread
   is
   treated as a document by the system. A reply to the original mail
   also
   includes the original mail in which case it becomes a near
   duplicate of
   the
   orginal mail (depending on the percentage of similarity).
   Similarly it
   goes
   on. The near dupes need not be limited to emails.
  
   I think this is what's known as shingling.  See
   http://en.wikipedia.org/wiki/W-shingling
   Lucene (and therefore Solr) does not implement shingling.  The
   MoreLikeThis query might be close enough, however.
  
   -Stuart
  
 
 
 
 
 




-- 
Regards,

Cuong Hoang


Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi
Thanks for the info Cuong!

Regards,
Rishabh

On Nov 21, 2007 1:59 PM, climbingrose [EMAIL PROTECTED] wrote:

 The duplication detection mechanism in Nutch is quite primitive. I
 think it uses a MD5 signature generated from the content of a field.
 The generation algorithm is described here:

 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html
 .

 The problem with this approach is MD5 hash is very sensitive: one
 letter difference will generate completely different hash. You
 probably have to roll your own near duplication detection algorithm.
 My advice is have a look at existing literature on near duplication
 detection techniques and then implement one of them. I know Google has
 some papers that describe a technique called minhash. I read the paper
 and found it's very interesting. I'm not sure if you can implement the
 algorithm because they have patented it. That said, there are plenty
 literature on near dup detection so you should be able to get one for
 free!

 On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
  Otis,
 
  Thanks for your response.
 
  I just gave a quick look to the Nutch Forum and find that there is an
  implementation to obtain de-duplicate documents/pages but none for Near
  Duplicates documents. Can you guide me a little further as to where
 exactly
  under Nutch I should be concentrating, regarding near duplicate
 documents?
 
  Regards,
  Rishabh
 
  On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
  wrote:
 
 
   To whomever started this thread: look at Nutch.  I believe something
   related to this already exists in Nutch for near-duplicate detection.
  
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message 
   From: Mike Klaas [EMAIL PROTECTED]
   To: solr-user@lucene.apache.org
   Sent: Sunday, November 18, 2007 11:08:38 PM
   Subject: Re: Near Duplicate Documents
  
   On 18-Nov-07, at 8:17 AM, Eswar K wrote:
  
Is there any idea implementing that feature in the up coming
releases?
  
   Not currently.  Feel free to contribute something if you find a good
   solution g.
  
   -Mike
  
  
On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED]
 wrote:
   
On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
We have a scenario, where we want to find out documents which are
similar in
content. To elaborate a little more on what we mean here, lets
take an
example.
   
The example of this email chain in which we are interacting on,
can be
best
used for illustrating the concept of near dupes (We are not
 getting
confused
with threads, they are two different things.). Each email in this
thread
is
treated as a document by the system. A reply to the original mail
also
includes the original mail in which case it becomes a near
duplicate of
the
orginal mail (depending on the percentage of similarity).
Similarly it
goes
on. The near dupes need not be limited to emails.
   
I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.
   
-Stuart
   
  
  
  
  
  
 



 --
 Regards,

 Cuong Hoang



Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas

On 21-Nov-07, at 12:29 AM, climbingrose wrote:


The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!


To help your googling: the main algorithm used for this is called  
'shingling' or 'shingle printing'.


-Mike


Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
Hi Ken,

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency  quantised
frequency, the two resulted hashes are completely different. If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be  3).


On Nov 22, 2007 2:35 AM, Ken Krugler [EMAIL PROTECTED] wrote:
 The duplication detection mechanism in Nutch is quite primitive. I
 think it uses a MD5 signature generated from the content of a field.
 The generation algorithm is described here:
 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
 
 The problem with this approach is MD5 hash is very sensitive: one
 letter difference will generate completely different hash.

 I'm confused by your answer, assuming it's based on the page
 referenced by the URL you provided.

 The approach by TextProfileSignature would only generate a different
 MD5 hash with a single letter change if that change resulted in a
 change in the quantized frequency for that word. And if it's an
 uncommon word, then it wouldn't even show up in the signature.

 -- Ken


 You
 probably have to roll your own near duplication detection algorithm.
 My advice is have a look at existing literature on near duplication
 detection techniques and then implement one of them. I know Google has
 some papers that describe a technique called minhash. I read the paper
 and found it's very interesting. I'm not sure if you can implement the
 algorithm because they have patented it. That said, there are plenty
 literature on near dup detection so you should be able to get one for
 free!
 
 On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
   Otis,
 
   Thanks for your response.
 
I just gave a quick look to the Nutch Forum and find that there is an
   implementation to obtain de-duplicate documents/pages but none for Near
   Duplicates documents. Can you guide me a little further as to where 
  exactly
under Nutch I should be concentrating, regarding near duplicate 
  documents?
   
Regards,
   Rishabh
 
   On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
   wrote:
 
 
To whomever started this thread: look at Nutch.  I believe something
related to this already exists in Nutch for near-duplicate detection.
   
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
   
- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents
   
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
   
 Is there any idea implementing that feature in the up coming
 releases?
   
 Not currently.  Feel free to contribute something if you find a good
solution g.

-Mike
   
   
 On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are
 similar in
 content. To elaborate a little more on what we mean here, lets
 take an
 example.

 The example of this email chain in which we are interacting on,
 can be
 best
 used for illustrating the concept of near dupes (We are not getting
 confused
 with threads, they are two different things.). Each email in this
 thread
 is
 treated as a document by the system. A reply to the original mail
 also
 includes the original mail in which case it becomes a near
 duplicate of
 the
 orginal mail (depending on the percentage of similarity).
 Similarly it
 goes
 on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

  -Stuart

 --
 Ken Krugler
 Krugle, Inc.
 +1 530-210-6378
 If you can't find it, you can't fix it




-- 
Regards,

Cuong Hoang


Re: Near Duplicate Documents

2007-11-20 Thread Otis Gospodnetic
To whomever started this thread: look at Nutch.  I believe something related to 
this already exists in Nutch for near-duplicate detection.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

 Is there any idea implementing that feature in the up coming
 releases?

Not currently.  Feel free to contribute something if you find a good  
solution g.

-Mike


 On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are
 similar in
 content. To elaborate a little more on what we mean here, lets  
 take an
 example.

 The example of this email chain in which we are interacting on,  
 can be
 best
 used for illustrating the concept of near dupes (We are not getting
 confused
 with threads, they are two different things.). Each email in this  
 thread
 is
 treated as a document by the system. A reply to the original mail  
 also
 includes the original mail in which case it becomes a near  
 duplicate of
 the
 orginal mail (depending on the percentage of similarity).   
 Similarly it
 goes
 on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

 -Stuart







Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi
Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating, regarding near duplicate documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 To whomever started this thread: look at Nutch.  I believe something
 related to this already exists in Nutch for near-duplicate detection.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Sunday, November 18, 2007 11:08:38 PM
 Subject: Re: Near Duplicate Documents

 On 18-Nov-07, at 8:17 AM, Eswar K wrote:

  Is there any idea implementing that feature in the up coming
  releases?

 Not currently.  Feel free to contribute something if you find a good
 solution g.

 -Mike


  On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:
 
  On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
  We have a scenario, where we want to find out documents which are
  similar in
  content. To elaborate a little more on what we mean here, lets
  take an
  example.
 
  The example of this email chain in which we are interacting on,
  can be
  best
  used for illustrating the concept of near dupes (We are not getting
  confused
  with threads, they are two different things.). Each email in this
  thread
  is
  treated as a document by the system. A reply to the original mail
  also
  includes the original mail in which case it becomes a near
  duplicate of
  the
  orginal mail (depending on the percentage of similarity).
  Similarly it
  goes
  on. The near dupes need not be limited to emails.
 
  I think this is what's known as shingling.  See
  http://en.wikipedia.org/wiki/W-shingling
  Lucene (and therefore Solr) does not implement shingling.  The
  MoreLikeThis query might be close enough, however.
 
  -Stuart
 







Re: Near Duplicate Documents

2007-11-18 Thread rishabh9

Can anyone help me?

Rishabh


rishabh9 wrote:
 
 Hi,
 
 I am evaluating Solr 1.2 for my project and wanted to know if it can
 return near duplicate documents (near dups) and how do i go about it? I am
 not sure, but is MoreLikeThisHandler the implementation for near dups?
 
 Rishabh
 
 

-- 
View this message in context: 
http://www.nabble.com/Near-Duplicate-Documents-tf4820111.html#a13819048
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Near Duplicate Documents

2007-11-18 Thread Stuart Sierra
On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are similar in
 content. To elaborate a little more on what we mean here, lets take an
 example.

 The example of this email chain in which we are interacting on, can be best
 used for illustrating the concept of near dupes (We are not getting confused
 with threads, they are two different things.). Each email in this thread is
 treated as a document by the system. A reply to the original mail also
 includes the original mail in which case it becomes a near duplicate of the
 orginal mail (depending on the percentage of similarity).  Similarly it goes
 on. The near dupes need not be limited to emails.

I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.

-Stuart


Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?

Regards,
Eswar

On Nov 18, 2007 9:06 PM, Ryan McKinley [EMAIL PROTECTED] wrote:

 I'm not sure I understand your question...

 A near duplicate document could mean a LOT of things depending on the
 context.

 perhaps you just need fuzzy searching?
 http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches

 or proximity searches?

 http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches


 MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is
 used to search for other similar documents based on the results of
 another query.

 ryan


 rishabh9 wrote:
  Can anyone help me?
 
  Rishabh
 
 
  rishabh9 wrote:
  Hi,
 
  I am evaluating Solr 1.2 for my project and wanted to know if it can
  return near duplicate documents (near dups) and how do i go about it? I
 am
  not sure, but is MoreLikeThisHandler the implementation for near
 dups?
 
  Rishabh
 
 
 




Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
Is there any idea implementing that feature in the up coming releases?

Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
  We have a scenario, where we want to find out documents which are
 similar in
  content. To elaborate a little more on what we mean here, lets take an
  example.
 
  The example of this email chain in which we are interacting on, can be
 best
  used for illustrating the concept of near dupes (We are not getting
 confused
  with threads, they are two different things.). Each email in this thread
 is
  treated as a document by the system. A reply to the original mail also
  includes the original mail in which case it becomes a near duplicate of
 the
  orginal mail (depending on the percentage of similarity).  Similarly it
 goes
  on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

 -Stuart



Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley

Eswar K wrote:

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?



mess around with the MoreLikeThisHandler, see if it gives you what you 
are looking for.


Check:
http://wiki.apache.org/solr/MoreLikeThis

For your example, you would want to make sure that the 'type' field 
(email) is in the mlt.fl param.  Perhaps: mlt.fl=type,content


Re: Near Duplicate Documents

2007-11-18 Thread Mike Klaas

On 18-Nov-07, at 8:17 AM, Eswar K wrote:


Is there any idea implementing that feature in the up coming releases?


Not currently.  Feel free to contribute something if you find a good  
solution g.


-Mike



On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:


On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:

We have a scenario, where we want to find out documents which are

similar in
content. To elaborate a little more on what we mean here, lets  
take an

example.

The example of this email chain in which we are interacting on,  
can be

best

used for illustrating the concept of near dupes (We are not getting

confused
with threads, they are two different things.). Each email in this  
thread

is
treated as a document by the system. A reply to the original mail  
also
includes the original mail in which case it becomes a near  
duplicate of

the
orginal mail (depending on the percentage of similarity).   
Similarly it

goes

on. The near dupes need not be limited to emails.


I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.

-Stuart





Near Duplicate Documents

2007-11-16 Thread Rishabh Joshi
Hi,

I am evaluating Solr 1.2 for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is MoreLikeThisHandler the implementation for near dups?

Rishabh