On Sun, 2005-03-20 at 00:49 -0800, Chris Hostetter wrote:
> Actually, your "Split across several pages" comment implies that you want
> a system which can tell that page 1 of a multipage article should be
> grouped with page 2 -- which may be radically different content. Most
> multipage documents
: At the moment I need something quite simple. To identify a page that
: appears in many forms, e.g.:
:
: - Normal version
: - Split across several pages
: - Print version
: - From a different section (different styling and navigation elements)
:
: Basically identical content, presented in differe
Miles Barr wrote:
On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote:
I think what they do at Google is a fancy heuristic -- as David Spencer
mentioned, suburls of a given page, identical snippets, or titles... My
idea was more towards providing a 'realistic overview' of subjects in
pages. So
Chris Lamprecht wrote:
It's a nice idea, and makes sense. I think that it can be broken if
boosting is used and the search is performed on multiple fileds, especially
unstored ones. In this case the distance between very similar documents
might be increased.
I think that also the duplications sho
Miles,
I'm assuming that you want to detect documents that are "almost"
exactly the same (since if they were identical, you could just do a
straight string compare or md5 compare, etc).
If you're storing term vectors in your index, you could compare the
term vectors for the search results, and if
On Mon, 2005-03-14 at 20:48 +0100, Dawid Weiss wrote:
> I think what they do at Google is a fancy heuristic -- as David Spencer
> mentioned, suburls of a given page, identical snippets, or titles... My
> idea was more towards providing a 'realistic overview' of subjects in
> pages. So you could
On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote:
> Yes, in theory the "similarity" package in the sandbox can help.
> The code generates a query for a source document to find documents that
> are similar to it - the MoreLikeThis class uses the heuristic that 2
> docs are similar if they sh
I think what they do at Google is a fancy heuristic -- as David Spencer
mentioned, suburls of a given page, identical snippets, or titles... My
idea was more towards providing a 'realistic overview' of subjects in
pages. So you could pick, say, the first document from each cluster and
show them
Otis Gospodnetic wrote:
The problem with 2c is that scores are currently relative, and not
absolute. I am hoping Chuck's patch makes it into the source, as
making scores absolute would be helpful in situations like this one.
Good point.
If the orig MoreLikeThis query allows the source doc to be re
The problem with 2c is that scores are currently relative, and not
absolute. I am hoping Chuck's patch makes it into the source, as
making scores absolute would be helpful in situations like this one.
Otis
--- David Spencer <[EMAIL PROTECTED]> wrote:
> Miles Barr wrote:
>
> > Has anyone tried
Miles Barr wrote:
Has anyone tried to remove similar documents from their search results?
It looks like Google does some on the fly filtering of the results,
hiding pages which is thinks are too similar, i.e. when you see:
"In order to show you the most relevant results, we have omitted some
entrie
Hi Dawid,
On Mon, 2005-03-14 at 18:55 +0100, Dawid Weiss wrote:
> I can imagine if you apply clustering to search results anyway then the
> information about clusters can help you determine 'similar' results and
> reorder the output list.
That's an interesting idea. How easy is it to 'tighten'
Hi Miles :)
I can imagine if you apply clustering to search results anyway then the
information about clusters can help you determine 'similar' results and
reorder the output list.
Just a thought.
D.
Miles Barr wrote:
Has anyone tried to remove similar documents from their search results?
It loo
Has anyone tried to remove similar documents from their search results?
It looks like Google does some on the fly filtering of the results,
hiding pages which is thinks are too similar, i.e. when you see:
"In order to show you the most relevant results, we have omitted some
entries very similar to
14 matches
Mail list logo