subject:"Near Duplicate Documents"

Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky

Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.

No ready solution.

Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 score, but avoids the
pollution problem.

I haven't heard of anybody in the Lucene space discussing a way to
categorize documents such that df is relative to a specified document
category and then the query specifies a document category. I support that
indexing and query of some hypothetical similarity schema could both
specify any number of document categories. But that's speculation on my
part.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 6:42 PM, Chris Morley  wrote:

> Hey Solr people:
>
>  Suppose that we did not want to break up our document set into separate
> indexes, but had certain cases where many versions of a document were not
> relevant for certain searches.
>
>  I guess this could be thought of as a "authorization" class of problem,
> however it is not that for us.  We have a few other fields that determine
> relevancy to the current query, based on what page the query is coming
> from.  It's kind of like authorization, but not really.
>
>  Anyway, I think the answer for how you would do it for authorization would
> solve it for our case too.
>
>  So I guess suppose you had 99 users and 100 documents and Document 1
> everybody could see it the same, but for the 99 documents, there was a
> slightly different document, and it was unique for each of 99 users, but
> not "very" unique.  Suppose for instance that the only thing different in
> the text of the 99 different documents was that it was watermarked with the
> users name.  Aren't you spamming your tf/idf at that point?  Is there a way
> around this?  Is there a way to say, hey, group these 99 documents together
> and only count 1 of them for tf/idf purposes?
>
>  When doing queries, each user would only ever see 2 documents, Document 1
> , plus whichever other document they specifically owned.
>
>  If there are web pages or book chapters I can read or re-read that address
> this class of problem, those references would be great.
>
>
>  -Chris.
>
>
>
>

Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Chris Morley

Hey Solr people:
  
 Suppose that we did not want to break up our document set into separate 
indexes, but had certain cases where many versions of a document were not 
relevant for certain searches.
  
 I guess this could be thought of as a "authorization" class of problem, 
however it is not that for us.  We have a few other fields that determine 
relevancy to the current query, based on what page the query is coming 
from.  It's kind of like authorization, but not really.
  
 Anyway, I think the answer for how you would do it for authorization would 
solve it for our case too.
  
 So I guess suppose you had 99 users and 100 documents and Document 1 
everybody could see it the same, but for the 99 documents, there was a 
slightly different document, and it was unique for each of 99 users, but 
not "very" unique.  Suppose for instance that the only thing different in 
the text of the 99 different documents was that it was watermarked with the 
users name.  Aren't you spamming your tf/idf at that point?  Is there a way 
around this?  Is there a way to say, hey, group these 99 documents together 
and only count 1 of them for tf/idf purposes?
  
 When doing queries, each user would only ever see 2 documents, Document 1 
, plus whichever other document they specifically owned.
  
 If there are web pages or book chapters I can read or re-read that address 
this class of problem, those references would be great.
  
  
 -Chris.

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose

The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:
http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.

The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!

On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating, regarding near duplicate documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

To whomever started this thread: look at Nutch. I believe something
related to this already exists in Nutch for near-duplicate detection.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

Is there any idea implementing that feature in the up coming
releases?

Not currently. Feel free to contribute something if you find a good
solution g.

-Mike

On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
We have a scenario, where we want to find out documents which are
similar in
content. To elaborate a little more on what we mean here, lets
take an
example.

The example of this email chain in which we are interacting on,
can be
best
used for illustrating the concept of near dupes (We are not getting
confused
with threads, they are two different things.). Each email in this
thread
is
treated as a document by the system. A reply to the original mail
also
includes the original mail in which case it becomes a near
duplicate of
the
orginal mail (depending on the percentage of similarity).
Similarly it
goes
on. The near dupes need not be limited to emails.

I think this is what's known as shingling. See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling. The
MoreLikeThis query might be close enough, however.

-Stuart

--
Regards,

Cuong Hoang

Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi

Thanks for the info Cuong!

Regards,
Rishabh

On Nov 21, 2007 1:59 PM, climbingrose [EMAIL PROTECTED] wrote:

The duplication detection mechanism in Nutch is quite primitive. I
think it uses a MD5 signature generated from the content of a field.
The generation algorithm is described here:

http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html
.

On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where
exactly
under Nutch I should be concentrating, regarding near duplicate
documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

To whomever started this thread: look at Nutch. I believe something
related to this already exists in Nutch for near-duplicate detection.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

Is there any idea implementing that feature in the up coming
releases?

Not currently. Feel free to contribute something if you find a good
solution g.

-Mike

On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED]
wrote:

The example of this email chain in which we are interacting on,
can be
best
used for illustrating the concept of near dupes (We are not
getting
confused
with threads, they are two different things.). Each email in this
thread
is
treated as a document by the system. A reply to the original mail
also
includes the original mail in which case it becomes a near
duplicate of
the
orginal mail (depending on the percentage of similarity).
Similarly it
goes
on. The near dupes need not be limited to emails.

-Stuart

--
Regards,

Cuong Hoang

Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas


On 21-Nov-07, at 12:29 AM, climbingrose wrote:


The problem with this approach is MD5 hash is very sensitive: one
letter difference will generate completely different hash. You
probably have to roll your own near duplication detection algorithm.
My advice is have a look at existing literature on near duplication
detection techniques and then implement one of them. I know Google has
some papers that describe a technique called minhash. I read the paper
and found it's very interesting. I'm not sure if you can implement the
algorithm because they have patented it. That said, there are plenty
literature on near dup detection so you should be able to get one for
free!


To help your googling: the main algorithm used for this is called  
'shingling' or 'shingle printing'.


-Mike

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose

Hi Ken,

It's correct that uncommon words are most likely not showing up in the
signature. However, I was trying to say that if two documents has 99%
common tokens and differ in one token with frequency  quantised
frequency, the two resulted hashes are completely different. If you
want true near dup detection, what you would like to have is two
hashes that differ only in 1-2 bytes. That way, the signatures will
truely reflect the content of the document they present. However, with
this approach, you need a bit more work to cluster near dup documents.
Basically, once you have the hash function as I describe above,
finding similar documents comes down to Hamming distance problem: two
docs are near dup if ther hashes different in k positions (with k
small, might be  3).


On Nov 22, 2007 2:35 AM, Ken Krugler [EMAIL PROTECTED] wrote:
 The duplication detection mechanism in Nutch is quite primitive. I
 think it uses a MD5 signature generated from the content of a field.
 The generation algorithm is described here:
 http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html.
 
 The problem with this approach is MD5 hash is very sensitive: one
 letter difference will generate completely different hash.

 I'm confused by your answer, assuming it's based on the page
 referenced by the URL you provided.

 The approach by TextProfileSignature would only generate a different
 MD5 hash with a single letter change if that change resulted in a
 change in the quantized frequency for that word. And if it's an
 uncommon word, then it wouldn't even show up in the signature.

 -- Ken


 You
 probably have to roll your own near duplication detection algorithm.
 My advice is have a look at existing literature on near duplication
 detection techniques and then implement one of them. I know Google has
 some papers that describe a technique called minhash. I read the paper
 and found it's very interesting. I'm not sure if you can implement the
 algorithm because they have patented it. That said, there are plenty
 literature on near dup detection so you should be able to get one for
 free!
 
 On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote:
   Otis,
 
   Thanks for your response.
 
I just gave a quick look to the Nutch Forum and find that there is an
   implementation to obtain de-duplicate documents/pages but none for Near
   Duplicates documents. Can you guide me a little further as to where 
  exactly
under Nutch I should be concentrating, regarding near duplicate 
  documents?
   
Regards,
   Rishabh
 
   On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
   wrote:
 
 
To whomever started this thread: look at Nutch.  I believe something
related to this already exists in Nutch for near-duplicate detection.
   
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
   
- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents
   
On 18-Nov-07, at 8:17 AM, Eswar K wrote:
   
 Is there any idea implementing that feature in the up coming
 releases?
   
 Not currently.  Feel free to contribute something if you find a good
solution g.

-Mike
   
   
 On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are
 similar in
 content. To elaborate a little more on what we mean here, lets
 take an
 example.

 The example of this email chain in which we are interacting on,
 can be
 best
 used for illustrating the concept of near dupes (We are not getting
 confused
 with threads, they are two different things.). Each email in this
 thread
 is
 treated as a document by the system. A reply to the original mail
 also
 includes the original mail in which case it becomes a near
 duplicate of
 the
 orginal mail (depending on the percentage of similarity).
 Similarly it
 goes
 on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

  -Stuart

 --
 Ken Krugler
 Krugle, Inc.
 +1 530-210-6378
 If you can't find it, you can't fix it




-- 
Regards,

Cuong Hoang

Re: Near Duplicate Documents

2007-11-20 Thread Otis Gospodnetic

To whomever started this thread: look at Nutch.  I believe something related to 
this already exists in Nutch for near-duplicate detection.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Mike Klaas [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Sunday, November 18, 2007 11:08:38 PM
Subject: Re: Near Duplicate Documents

On 18-Nov-07, at 8:17 AM, Eswar K wrote:

 Is there any idea implementing that feature in the up coming
 releases?

Not currently.  Feel free to contribute something if you find a good  
solution g.

-Mike

 On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are
 similar in
 content. To elaborate a little more on what we mean here, lets  
 take an
 example.

 The example of this email chain in which we are interacting on,  
 can be
 best
 used for illustrating the concept of near dupes (We are not getting
 confused
 with threads, they are two different things.). Each email in this  
 thread
 is
 treated as a document by the system. A reply to the original mail  
 also
 includes the original mail in which case it becomes a near  
 duplicate of
 the
 orginal mail (depending on the percentage of similarity).   
 Similarly it
 goes
 on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

 -Stuart

Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi

Otis,

Thanks for your response.

I just gave a quick look to the Nutch Forum and find that there is an
implementation to obtain de-duplicate documents/pages but none for Near
Duplicates documents. Can you guide me a little further as to where exactly
under Nutch I should be concentrating, regarding near duplicate documents?

Regards,
Rishabh

On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 To whomever started this thread: look at Nutch.  I believe something
 related to this already exists in Nutch for near-duplicate detection.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Sunday, November 18, 2007 11:08:38 PM
 Subject: Re: Near Duplicate Documents

 On 18-Nov-07, at 8:17 AM, Eswar K wrote:

  Is there any idea implementing that feature in the up coming
  releases?

 Not currently.  Feel free to contribute something if you find a good
 solution g.

 -Mike


  On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:
 
  On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
  We have a scenario, where we want to find out documents which are
  similar in
  content. To elaborate a little more on what we mean here, lets
  take an
  example.
 
  The example of this email chain in which we are interacting on,
  can be
  best
  used for illustrating the concept of near dupes (We are not getting
  confused
  with threads, they are two different things.). Each email in this
  thread
  is
  treated as a document by the system. A reply to the original mail
  also
  includes the original mail in which case it becomes a near
  duplicate of
  the
  orginal mail (depending on the percentage of similarity).
  Similarly it
  goes
  on. The near dupes need not be limited to emails.
 
  I think this is what's known as shingling.  See
  http://en.wikipedia.org/wiki/W-shingling
  Lucene (and therefore Solr) does not implement shingling.  The
  MoreLikeThis query might be close enough, however.
 
  -Stuart

Re: Near Duplicate Documents

2007-11-18 Thread rishabh9


Can anyone help me?

Rishabh


rishabh9 wrote:
 
 Hi,
 
 I am evaluating Solr 1.2 for my project and wanted to know if it can
 return near duplicate documents (near dups) and how do i go about it? I am
 not sure, but is MoreLikeThisHandler the implementation for near dups?
 
 Rishabh
 
 

-- 
View this message in context: 
http://www.nabble.com/Near-Duplicate-Documents-tf4820111.html#a13819048
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Near Duplicate Documents

2007-11-18 Thread Stuart Sierra

On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
 We have a scenario, where we want to find out documents which are similar in
 content. To elaborate a little more on what we mean here, lets take an
 example.

 The example of this email chain in which we are interacting on, can be best
 used for illustrating the concept of near dupes (We are not getting confused
 with threads, they are two different things.). Each email in this thread is
 treated as a document by the system. A reply to the original mail also
 includes the original mail in which case it becomes a near duplicate of the
 orginal mail (depending on the percentage of similarity).  Similarly it goes
 on. The near dupes need not be limited to emails.

I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.

-Stuart

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?

Regards,
Eswar

On Nov 18, 2007 9:06 PM, Ryan McKinley [EMAIL PROTECTED] wrote:

 I'm not sure I understand your question...

 A near duplicate document could mean a LOT of things depending on the
 context.

 perhaps you just need fuzzy searching?
 http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches

 or proximity searches?

 http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches


 MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is
 used to search for other similar documents based on the results of
 another query.

 ryan


 rishabh9 wrote:
  Can anyone help me?
 
  Rishabh
 
 
  rishabh9 wrote:
  Hi,
 
  I am evaluating Solr 1.2 for my project and wanted to know if it can
  return near duplicate documents (near dups) and how do i go about it? I
 am
  not sure, but is MoreLikeThisHandler the implementation for near
 dups?
 
  Rishabh

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K

Is there any idea implementing that feature in the up coming releases?

Regards,
Eswar
On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:

 On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:
  We have a scenario, where we want to find out documents which are
 similar in
  content. To elaborate a little more on what we mean here, lets take an
  example.
 
  The example of this email chain in which we are interacting on, can be
 best
  used for illustrating the concept of near dupes (We are not getting
 confused
  with threads, they are two different things.). Each email in this thread
 is
  treated as a document by the system. A reply to the original mail also
  includes the original mail in which case it becomes a near duplicate of
 the
  orginal mail (depending on the percentage of similarity).  Similarly it
 goes
  on. The near dupes need not be limited to emails.

 I think this is what's known as shingling.  See
 http://en.wikipedia.org/wiki/W-shingling
 Lucene (and therefore Solr) does not implement shingling.  The
 MoreLikeThis query might be close enough, however.

 -Stuart

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley


Eswar K wrote:

We have a scenario, where we want to find out documents which are similar in
content. To elaborate a little more on what we mean here, lets take an
example.

The example of this email chain in which we are interacting on, can be best
used for illustrating the concept of near dupes (We are not getting confused
with threads, they are two different things.). Each email in this thread is
treated as a document by the system. A reply to the original mail also
includes the original mail in which case it becomes a near duplicate of the
orginal mail (depending on the percentage of similarity).  Similarly it goes
on. The near dupes need not be limited to emails.

If we want to have such capability using Solr, can we use
MoreLikeThisHandler or is there any other appropriate handler in Solr which
we can use? What is the best way for achieving such a functionality?



mess around with the MoreLikeThisHandler, see if it gives you what you 
are looking for.


Check:
http://wiki.apache.org/solr/MoreLikeThis

For your example, you would want to make sure that the 'type' field 
(email) is in the mlt.fl param.  Perhaps: mlt.fl=type,content

Re: Near Duplicate Documents

2007-11-18 Thread Mike Klaas


On 18-Nov-07, at 8:17 AM, Eswar K wrote:


Is there any idea implementing that feature in the up coming releases?


Not currently.  Feel free to contribute something if you find a good  
solution g.


-Mike



On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote:


On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote:

We have a scenario, where we want to find out documents which are

similar in
content. To elaborate a little more on what we mean here, lets  
take an

example.

The example of this email chain in which we are interacting on,  
can be

best

used for illustrating the concept of near dupes (We are not getting

confused
with threads, they are two different things.). Each email in this  
thread

is
treated as a document by the system. A reply to the original mail  
also
includes the original mail in which case it becomes a near  
duplicate of

the
orginal mail (depending on the percentage of similarity).   
Similarly it

goes

on. The near dupes need not be limited to emails.


I think this is what's known as shingling.  See
http://en.wikipedia.org/wiki/W-shingling
Lucene (and therefore Solr) does not implement shingling.  The
MoreLikeThis query might be close enough, however.

-Stuart

Near Duplicate Documents

2007-11-16 Thread Rishabh Joshi

Hi,

I am evaluating Solr 1.2 for my project and wanted to know if it can
return near duplicate documents (near dups) and how do i go about it? I am
not sure, but is MoreLikeThisHandler the implementation for near dups?

Rishabh

Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Re: Near Duplicate Documents

Near Duplicate Documents

15 matches

Site Navigation

Mail list logo

Footer information