Dawid Weiss wrote:
It seems to me that there are two separate problems:
1) content parsing to avoid site structure - influences the index and
rankings
2) content parsing for KWIC snippet generation - influences the user
perception of the engine.
I'd agree that (2) is quite important for
Hmm... I'm not convinced. How would you generate the best snippet from a
relevant, but ignored chunk?
Good point... I guess you simply wouldn't generate anything at all (show
the title?). I guess structure text should not be relevant enough to
actually cause a hit on top of the search
I'd agree that (2) is quite important for the end user; Richard's
continuous text heuristic may actually work for that. I'd extend the
meaning of continuous block to ignore inline tags such as SPAN, I, B, TT
etc, so only certain tags would actually break the content into chunks.
Snippets then
It seems to me that there are two separate problems:
1) content parsing to avoid site structure - influences the index and
rankings
2) content parsing for KWIC snippet generation - influences the user
perception of the engine.
I'd agree that (2) is quite important for the end user;
I too have noticed menu text appearing in the search results.
-Original Message-
From: jamie [mailto:[EMAIL PROTECTED]
Sent: Friday, March 10, 2006 4:39 AM
To: nutch-dev@lucene.apache.org
Subject: quality of search text
hi everyone
i dont know if we're doing something wrong
Richard Braman wrote:
I too have noticed menu text appearing in the search results.
The proper place to fix it would be in parse-html, perhaps in
DOMContentUtils.
However, be warned that this is definitely NOT trivial - i.e. it doesn't
say in pages this is menu, this is body text, you
please
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, March 10, 2006 1:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text
Richard Braman wrote:
I too have noticed menu text appearing in the search results.
The proper place
Richard Braman wrote:
Here is a potential algorithm:
Look first to Meta Description, if none exists
Look for continuous block of text, ignore content that doesn't contain a
continous block of text. If a given html tag only contains a few words
of text, it is not content , but rather a part of
.
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, March 10, 2006 2:51 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text
Richard Braman wrote:
Here is a potential algorithm:
Look first to Meta Description, if none exists
Look for continuous
I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.
Are you sure?
Take a look at this search results:
10 matches
Mail list logo