Re: Removing similar documents from search results

2005-03-21 Thread Miles Barr
On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote: > Actually, your "Split across several pages" comment implies that you want > a system which can tell that page 1 of a multipage article should be > grouped with page 2 -- which may be radically different content. Most > multipage documents

Re: Removing similar documents from search results

2005-03-20 Thread Chris Hostetter
: At the moment I need something quite simple. To identify a page that : appears in many forms, e.g.: : : - Normal version : - Split across several pages : - Print version : - From a different section (different styling and navigation elements) : : Basically identical content, presented in differe

Re: Removing similar documents from search results

2005-03-15 Thread David Spencer
Miles Barr wrote: On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote: I think what they do at Google is a fancy heuristic -- as David Spencer mentioned, suburls of a given page, identical snippets, or titles... My idea was more towards providing a 'realistic overview' of subjects in pages. So

Re: Removing similar documents from search results

2005-03-15 Thread sergiu gordea
Chris Lamprecht wrote: It's a nice idea, and makes sense. I think that it can be broken if boosting is used and the search is performed on multiple fileds, especially unstored ones. In this case the distance between very similar documents might be increased. I think that also the duplications sho

Re: Removing similar documents from search results

2005-03-15 Thread Chris Lamprecht
Miles, I'm assuming that you want to detect documents that are "almost" exactly the same (since if they were identical, you could just do a straight string compare or md5 compare, etc). If you're storing term vectors in your index, you could compare the term vectors for the search results, and if

Re: Removing similar documents from search results

2005-03-15 Thread Miles Barr
On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote: > I think what they do at Google is a fancy heuristic -- as David Spencer > mentioned, suburls of a given page, identical snippets, or titles... My > idea was more towards providing a 'realistic overview' of subjects in > pages. So you could

Re: Removing similar documents from search results

2005-03-15 Thread Miles Barr
On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote: > Yes, in theory the "similarity" package in the sandbox can help. > The code generates a query for a source document to find documents that > are similar to it - the MoreLikeThis class uses the heuristic that 2 > docs are similar if they sh

Re: Removing similar documents from search results

2005-03-14 Thread Dawid Weiss
I think what they do at Google is a fancy heuristic -- as David Spencer mentioned, suburls of a given page, identical snippets, or titles... My idea was more towards providing a 'realistic overview' of subjects in pages. So you could pick, say, the first document from each cluster and show them

Re: Removing similar documents from search results

2005-03-14 Thread David Spencer
Otis Gospodnetic wrote: The problem with 2c is that scores are currently relative, and not absolute. I am hoping Chuck's patch makes it into the source, as making scores absolute would be helpful in situations like this one. Good point. If the orig MoreLikeThis query allows the source doc to be re

Re: Removing similar documents from search results

2005-03-14 Thread Otis Gospodnetic
The problem with 2c is that scores are currently relative, and not absolute. I am hoping Chuck's patch makes it into the source, as making scores absolute would be helpful in situations like this one. Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > Miles Barr wrote: > > > Has anyone tried

Re: Removing similar documents from search results

2005-03-14 Thread David Spencer
Miles Barr wrote: Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see: "In order to show you the most relevant results, we have omitted some entrie

Re: Removing similar documents from search results

2005-03-14 Thread Miles Barr
Hi Dawid, On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote: > I can imagine if you apply clustering to search results anyway then the > information about clusters can help you determine 'similar' results and > reorder the output list. That's an interesting idea. How easy is it to 'tighten'

Re: Removing similar documents from search results

2005-03-14 Thread Dawid Weiss
Hi Miles :) I can imagine if you apply clustering to search results anyway then the information about clusters can help you determine 'similar' results and reorder the output list. Just a thought. D. Miles Barr wrote: Has anyone tried to remove similar documents from their search results? It loo

Removing similar documents from search results

2005-03-14 Thread Miles Barr
Has anyone tried to remove similar documents from their search results? It looks like Google does some on the fly filtering of the results, hiding pages which is thinks are too similar, i.e. when you see: "In order to show you the most relevant results, we have omitted some entries very similar to