subject:"quality of search text"

Re: quality of search text

2006-03-12 Thread Andrzej Bialecki


Dawid Weiss wrote:


It seems to me that there are two separate problems:

1) content parsing to avoid site structure - influences the index and 
rankings
2) content parsing for KWIC snippet generation - influences the user 
perception of the engine.


I'd agree that (2) is quite important for the end user; Richard's 
continuous text heuristic may actually work for that. I'd extend the 
meaning of continuous block to ignore inline tags such as SPAN, I, 
B, TT etc, so only certain tags would actually break the content into 
chunks. Snippets then would be generated from these chunks alone, 
ignoring the rest of the content. If this heuristic is applied only at 
snippet-generation time then Andrzej's concern about missing content 
is not relevant anymore. 


Hmm... I'm not convinced. How would you generate the best snippet from a 
relevant, but ignored chunk?


But I agree that for some (perhaps large) percentage of sites this 
heuristic could work well, and it's simple enough to be easily implemented.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: quality of search text

2006-03-12 Thread Dawid Weiss



Hmm... I'm not convinced. How would you generate the best snippet from a 
relevant, but ignored chunk?


Good point... I guess you simply wouldn't generate anything at all (show 
the title?). I guess structure text should not be relevant enough to 
actually cause a hit on top of the search result by itself; there should 
be some other continuous block of text more relevant to the query that 
caused the hit. It's a bit like assigning priorities to longer chunks of 
text over the shorter ones, don't know if my intuition is clear...


D.

Re: quality of search text

2006-03-12 Thread Howie Wang

I'd agree that (2) is quite important for the end user; Richard's 
continuous text heuristic may actually work for that. I'd extend the 
meaning of continuous block to ignore inline tags such as SPAN, I, B, TT 
etc, so only certain tags would actually break the content into chunks. 
Snippets then would be generated from these chunks alone, ignoring the 
rest of the content. If this heuristic is applied only at 
snippet-generation time then Andrzej's concern about missing content is 
not relevant anymore.


Hmm... I'm not convinced. How would you generate the best snippet from a 
relevant, but ignored chunk?


Maybe eventually this could be the start of using tags to boost
certain sections of the page as Google probably does. Normal
text blocks would have a boost of 1.0, while stuff within B, H*
might be boosted by 1.5. Stuff within suspected navigation text
could be de-boosted by 0.25 or something. Maybe that would
be a more appropriate way of handling relevance of navigation
text. It should have some relevance, but not as much as content.

Maybe the summary text could somehow ignore the de-boosted
sections to improve readability unless the content doesn't have
a better match. You basically construct a snippet giving preference
according to the boost value of the section of text.

This all sounds like a lot of work though :)

Howie

Re: quality of search text

2006-03-11 Thread Dawid Weiss

It seems to me that there are two separate problems:

1) content parsing to avoid site structure - influences the index and
rankings
2) content parsing for KWIC snippet generation - influences the user
perception of the engine.

I'd agree that (2) is quite important for the end user; Richard's
continuous text heuristic may actually work for that. I'd extend the
meaning of continuous block to ignore inline tags such as SPAN, I, B,
TT etc, so only certain tags would actually break the content into
chunks. Snippets then would be generated from these chunks alone,
ignoring the rest of the content. If this heuristic is applied only at
snippet-generation time then Andrzej's concern about missing content is
not relevant anymore. Of course I realize it is tricky in the current
architecture because different filters would be used for KWICs and
indexing...

Jérôme Charron wrote:

I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=enhs=otTlr=c2coff=1safe=offclient=firefox-arls=org.mozilla:en-US:officialpwst=1q=+site:gamingalmanac.com+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed in
summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus and
navigation, it will be a real improvement.
A first step, that I have developed in a previous project many years ago is
to remove pages that contains textual content only in links: it avoid
indexing frames or iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

RE: quality of search text

2006-03-10 Thread Richard Braman

I too have noticed menu text appearing in the search results.

-Original Message-
From: jamie [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 10, 2006 4:39 AM
To: nutch-dev@lucene.apache.org
Subject: quality of search text

hi everyone

i dont know if we're doing something wrong, but the quality of the text
in the Nutch search results is appauling. 

To give you an example:

the text outputted for http://www.gamingalmanac.com/ is the following:

 ... Gaming Industry Research Publications, Worldwide Gaming Almanacs,
Bear Stearns Gaming Almanac, Gaming Revenue and Statistics PRODUCT
OVERVIEW COMPLETE ANALYST PACKAGE NORTH AMERICAN ALMANAC INDIAN GAMING
INDUSTRY REPORT NEVADA GAMING ALMANAC GLOBAL GAMING ALMANAC GLOBAL
GAMBLING REPORT MARKET RESEARCH HANDBOOK MICROSOFT MAP POINT Save up to
45% with a Gaming Analyst Package! The Gaming Almanac Family of
Products Find every fact, figure, and trend you need on the gaming
industry. With current property profiles and statistics, historical and
forward-looking financial data, local, regional, and worldwide gaming
market summaries, and key player profiles, the Gaming Almanac products
from Casino City Press offer information essential to every gaming
executive, supplier, and analyst. Titles Include: Casino City ... 

whereas Google outputs:

Gaming Industry Research Publications, Worldwide Gaming Almanacs ... The
Gaming Almanac products from Casino City Press serve as excellent
reference tools for anyone interested in the worldwide and domestic
gaming markets. gamingalmanac.com/

Is there any easy way to fix this? The Nutch search results appear to
include text in the website menu's, etc. which affects the usability of
the search results.

Where in Nutch would I go about fixing this?

Thanks

Jamie

Re: quality of search text

2006-03-10 Thread Andrzej Bialecki


Richard Braman wrote:

I too have noticed menu text appearing in the search results.
  


The proper place to fix it would be in parse-html, perhaps in 
DOMContentUtils.


However, be warned that this is definitely NOT trivial - i.e. it doesn't 
say in pages this is menu, this is body text, you have to figure it 
out, and it's hard to come up with a method that works for any layout. 
You may hardcode something that works well for your target group of 
hosts, with pre-determined page layouts.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: quality of search text

2006-03-10 Thread Richard Braman

it doesn't say in pages this is menu, this is body text,
Agreed it doesn't say that

this is definitely NOT trivial
This isn't trivial, but is rather important

it's hard to come up with a method that works for any layout. 

Here is a potential algorithm:

Look first to Meta Description, if none exists
Look for continuous block of text, ignore content that doesn't contain a
continous block of text.  If a given html tag only contains a few words
of text, it is not content , but rather a part of the nav structure of
the page.

Here is yet another algorithm.

When fetching pages from a particular web, analyze the structure of the
page, try to make a determination of what content stays similar from
page to page within the same web.  That would usually be menus, headers,
footers, etc. 
Granted the menus may change slightly from page to page, which is why
the algorithm would be pattern based instead of literal.
When you determine what is navigation and what is content, you would
only parse and index the content.

I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.

Comments please

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 10, 2006 1:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text


Richard Braman wrote:
 I too have noticed menu text appearing in the search results.
   

The proper place to fix it would be in parse-html, perhaps in 
DOMContentUtils.

However, be warned that this is definitely NOT trivial - i.e. it doesn't

say in pages this is menu, this is body text, you have to figure it 
out, and it's hard to come up with a method that works for any layout. 
You may hardcode something that works well for your target group of 
hosts, with pre-determined page layouts.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com

Re: quality of search text

2006-03-10 Thread Andrzej Bialecki


Richard Braman wrote:

Here is a potential algorithm:

Look first to Meta Description, if none exists
Look for continuous block of text, ignore content that doesn't contain a
continous block of text.  If a given html tag only contains a few words
of text, it is not content , but rather a part of the nav structure of
the page.

  


You may potentially miss a lot of content this way, nowadays many pages 
freely mix in markup in the main content area...



Here is yet another algorithm.

When fetching pages from a particular web, analyze the structure of the
page, try to make a determination of what content stays similar from
page to page within the same web.  That would usually be menus, headers,
footers, etc. 
  


This requires collecting pages in advance to train the structure 
recognizer, and preparing profiles for groups of pages with common layout.



Granted the menus may change slightly from page to page, which is why
the algorithm would be pattern based instead of literal.
When you determine what is navigation and what is content, you would
only parse and index the content.

I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.

Comments please
  


The best way to evaluate this would be to ..erhm.. evaluate these 
algorithms on a set of reference pages. Would you like to implement one 
or both algorithms and test them?


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: quality of search text

2006-03-10 Thread Richard Braman

nowadays many pages freely mix in markup in the main content area...
Yes, but if that content was nested in a larger block of content, then
it would be included.

I will probably end up implmenting some of these algorithms, but I would
like some good feedback before I go out on a limb.


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 10, 2006 2:51 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text


Richard Braman wrote:
 Here is a potential algorithm:

 Look first to Meta Description, if none exists
 Look for continuous block of text, ignore content that doesn't contain

 a continous block of text.  If a given html tag only contains a few 
 words of text, it is not content , but rather a part of the nav 
 structure of the page.

   

You may potentially miss a lot of content this way, nowadays many pages 
freely mix in markup in the main content area...

 Here is yet another algorithm.

 When fetching pages from a particular web, analyze the structure of 
 the page, try to make a determination of what content stays similar 
 from page to page within the same web.  That would usually be menus, 
 headers, footers, etc.
   

This requires collecting pages in advance to train the structure 
recognizer, and preparing profiles for groups of pages with common
layout.

 Granted the menus may change slightly from page to page, which is why 
 the algorithm would be pattern based instead of literal. When you 
 determine what is navigation and what is content, you would only parse

 and index the content.

 I think algortihm # 1 is what google uses.
 google ignores content that does not change from page to page, as well

 as content that isn't part of a pblock of text.

 Comments please
   

The best way to evaluate this would be to ..erhm.. evaluate these 
algorithms on a set of reference pages. Would you like to implement one 
or both algorithms and test them?

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com

Re: quality of search text

2006-03-10 Thread Jérôme Charron

 I think algortihm # 1 is what google uses.
 google ignores content that does not change from page to page, as well
 as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=enhs=otTlr=c2coff=1safe=offclient=firefox-arls=org.mozilla:en-US:officialpwst=1q=+site:gamingalmanac.com+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed in
summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus and
navigation, it will be a real improvement.
A first step, that I have developed in a previous project many years ago is
to remove pages that contains textual content only in links: it avoid
indexing frames or iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: quality of search text

Re: quality of search text

Re: quality of search text

Re: quality of search text

RE: quality of search text

Re: quality of search text

RE: quality of search text

Re: quality of search text

RE: quality of search text

Re: quality of search text

10 matches

Site Navigation

Mail list logo

Footer information