Re: quality of search text

2006-03-12 Thread Andrzej Bialecki
Dawid Weiss wrote: It seems to me that there are two separate problems: 1) content parsing to avoid site structure - influences the index and rankings 2) content parsing for KWIC snippet generation - influences the user perception of the engine. I'd agree that (2) is quite important for

Re: quality of search text

2006-03-12 Thread Dawid Weiss
Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk? Good point... I guess you simply wouldn't generate anything at all (show the title?). I guess structure text should not be relevant enough to actually cause a hit on top of the search

Re: quality of search text

2006-03-12 Thread Howie Wang
I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of continuous block to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then

Re: quality of search text

2006-03-11 Thread Dawid Weiss
It seems to me that there are two separate problems: 1) content parsing to avoid site structure - influences the index and rankings 2) content parsing for KWIC snippet generation - influences the user perception of the engine. I'd agree that (2) is quite important for the end user;

RE: quality of search text

2006-03-10 Thread Richard Braman
I too have noticed menu text appearing in the search results. -Original Message- From: jamie [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 4:39 AM To: nutch-dev@lucene.apache.org Subject: quality of search text hi everyone i dont know if we're doing something wrong

Re: quality of search text

2006-03-10 Thread Andrzej Bialecki
Richard Braman wrote: I too have noticed menu text appearing in the search results. The proper place to fix it would be in parse-html, perhaps in DOMContentUtils. However, be warned that this is definitely NOT trivial - i.e. it doesn't say in pages this is menu, this is body text, you

RE: quality of search text

2006-03-10 Thread Richard Braman
please -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 1:57 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: I too have noticed menu text appearing in the search results. The proper place

Re: quality of search text

2006-03-10 Thread Andrzej Bialecki
Richard Braman wrote: Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous block of text, ignore content that doesn't contain a continous block of text. If a given html tag only contains a few words of text, it is not content , but rather a part of

RE: quality of search text

2006-03-10 Thread Richard Braman
. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 2:51 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous

Re: quality of search text

2006-03-10 Thread Jérôme Charron
I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Are you sure? Take a look at this search results: