Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?
Sounds a lot like multi-tenancy, where you don't want the document frequencies of one tenant to influence the query relevancy scores for other tenants. No ready solution. Although, I have thought of a simplified document scoring using just tf and leaving out df/idf. Not as good a tf*idf or BM25 score, but avoids the pollution problem. I haven't heard of anybody in the Lucene space discussing a way to categorize documents such that df is relative to a specified document category and then the query specifies a document category. I support that indexing and query of some hypothetical similarity schema could both specify any number of document categories. But that's speculation on my part. -- Jack Krupansky On Mon, Feb 15, 2016 at 6:42 PM, Chris Morleywrote: > Hey Solr people: > > Suppose that we did not want to break up our document set into separate > indexes, but had certain cases where many versions of a document were not > relevant for certain searches. > > I guess this could be thought of as a "authorization" class of problem, > however it is not that for us. We have a few other fields that determine > relevancy to the current query, based on what page the query is coming > from. It's kind of like authorization, but not really. > > Anyway, I think the answer for how you would do it for authorization would > solve it for our case too. > > So I guess suppose you had 99 users and 100 documents and Document 1 > everybody could see it the same, but for the 99 documents, there was a > slightly different document, and it was unique for each of 99 users, but > not "very" unique. Suppose for instance that the only thing different in > the text of the 99 different documents was that it was watermarked with the > users name. Aren't you spamming your tf/idf at that point? Is there a way > around this? Is there a way to say, hey, group these 99 documents together > and only count 1 of them for tf/idf purposes? > > When doing queries, each user would only ever see 2 documents, Document 1 > , plus whichever other document they specifically owned. > > If there are web pages or book chapters I can read or re-read that address > this class of problem, those references would be great. > > > -Chris. > > > >
Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?
Hey Solr people: Suppose that we did not want to break up our document set into separate indexes, but had certain cases where many versions of a document were not relevant for certain searches. I guess this could be thought of as a "authorization" class of problem, however it is not that for us. We have a few other fields that determine relevancy to the current query, based on what page the query is coming from. It's kind of like authorization, but not really. Anyway, I think the answer for how you would do it for authorization would solve it for our case too. So I guess suppose you had 99 users and 100 documents and Document 1 everybody could see it the same, but for the 99 documents, there was a slightly different document, and it was unique for each of 99 users, but not "very" unique. Suppose for instance that the only thing different in the text of the 99 different documents was that it was watermarked with the users name. Aren't you spamming your tf/idf at that point? Is there a way around this? Is there a way to say, hey, group these 99 documents together and only count 1 of them for tf/idf purposes? When doing queries, each user would only ever see 2 documents, Document 1 , plus whichever other document they specifically owned. If there are web pages or book chapters I can read or re-read that address this class of problem, those references would be great. -Chris.
Re: Near Duplicate Documents
The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote: Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart -- Regards, Cuong Hoang
Re: Near Duplicate Documents
Thanks for the info Cuong! Regards, Rishabh On Nov 21, 2007 1:59 PM, climbingrose [EMAIL PROTECTED] wrote: The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html . The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote: Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart -- Regards, Cuong Hoang
Re: Near Duplicate Documents
On 21-Nov-07, at 12:29 AM, climbingrose wrote: The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! To help your googling: the main algorithm used for this is called 'shingling' or 'shingle printing'. -Mike
Re: Near Duplicate Documents
Hi Ken, It's correct that uncommon words are most likely not showing up in the signature. However, I was trying to say that if two documents has 99% common tokens and differ in one token with frequency quantised frequency, the two resulted hashes are completely different. If you want true near dup detection, what you would like to have is two hashes that differ only in 1-2 bytes. That way, the signatures will truely reflect the content of the document they present. However, with this approach, you need a bit more work to cluster near dup documents. Basically, once you have the hash function as I describe above, finding similar documents comes down to Hamming distance problem: two docs are near dup if ther hashes different in k positions (with k small, might be 3). On Nov 22, 2007 2:35 AM, Ken Krugler [EMAIL PROTECTED] wrote: The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. I'm confused by your answer, assuming it's based on the page referenced by the URL you provided. The approach by TextProfileSignature would only generate a different MD5 hash with a single letter change if that change resulted in a change in the quantized frequency for that word. And if it's an uncommon word, then it wouldn't even show up in the signature. -- Ken You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote: Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it -- Regards, Cuong Hoang
Re: Near Duplicate Documents
To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Re: Near Duplicate Documents
Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Re: Near Duplicate Documents
Can anyone help me? Rishabh rishabh9 wrote: Hi, I am evaluating Solr 1.2 for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is MoreLikeThisHandler the implementation for near dups? Rishabh -- View this message in context: http://www.nabble.com/Near-Duplicate-Documents-tf4820111.html#a13819048 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Near Duplicate Documents
On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Re: Near Duplicate Documents
We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. If we want to have such capability using Solr, can we use MoreLikeThisHandler or is there any other appropriate handler in Solr which we can use? What is the best way for achieving such a functionality? Regards, Eswar On Nov 18, 2007 9:06 PM, Ryan McKinley [EMAIL PROTECTED] wrote: I'm not sure I understand your question... A near duplicate document could mean a LOT of things depending on the context. perhaps you just need fuzzy searching? http://lucene.apache.org/java/docs/queryparsersyntax.html#Fuzzy%20Searches or proximity searches? http://lucene.apache.org/java/docs/queryparsersyntax.html#Proximity%20Searches MoreLikeThisHandler (added in 1.3-dev) may be able to help, but it is used to search for other similar documents based on the results of another query. ryan rishabh9 wrote: Can anyone help me? Rishabh rishabh9 wrote: Hi, I am evaluating Solr 1.2 for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is MoreLikeThisHandler the implementation for near dups? Rishabh
Re: Near Duplicate Documents
Is there any idea implementing that feature in the up coming releases? Regards, Eswar On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Re: Near Duplicate Documents
Eswar K wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. If we want to have such capability using Solr, can we use MoreLikeThisHandler or is there any other appropriate handler in Solr which we can use? What is the best way for achieving such a functionality? mess around with the MoreLikeThisHandler, see if it gives you what you are looking for. Check: http://wiki.apache.org/solr/MoreLikeThis For your example, you would want to make sure that the 'type' field (email) is in the mlt.fl param. Perhaps: mlt.fl=type,content
Re: Near Duplicate Documents
On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart
Near Duplicate Documents
Hi, I am evaluating Solr 1.2 for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is MoreLikeThisHandler the implementation for near dups? Rishabh