Re: quality of search text
Dawid Weiss wrote: It seems to me that there are two separate problems: 1) content parsing to avoid site structure - influences the index and rankings 2) content parsing for KWIC snippet generation - influences the user perception of the engine. I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of continuous block to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then would be generated from these chunks alone, ignoring the rest of the content. If this heuristic is applied only at snippet-generation time then Andrzej's concern about missing content is not relevant anymore. Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk? But I agree that for some (perhaps large) percentage of sites this heuristic could work well, and it's simple enough to be easily implemented. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: quality of search text
Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk? Good point... I guess you simply wouldn't generate anything at all (show the title?). I guess structure text should not be relevant enough to actually cause a hit on top of the search result by itself; there should be some other continuous block of text more relevant to the query that caused the hit. It's a bit like assigning priorities to longer chunks of text over the shorter ones, don't know if my intuition is clear... D.
Re: quality of search text
I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of continuous block to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then would be generated from these chunks alone, ignoring the rest of the content. If this heuristic is applied only at snippet-generation time then Andrzej's concern about missing content is not relevant anymore. Hmm... I'm not convinced. How would you generate the best snippet from a relevant, but ignored chunk? Maybe eventually this could be the start of using tags to boost certain sections of the page as Google probably does. Normal text blocks would have a boost of 1.0, while stuff within B, H* might be boosted by 1.5. Stuff within suspected navigation text could be de-boosted by 0.25 or something. Maybe that would be a more appropriate way of handling relevance of navigation text. It should have some relevance, but not as much as content. Maybe the summary text could somehow ignore the de-boosted sections to improve readability unless the content doesn't have a better match. You basically construct a snippet giving preference according to the boost value of the section of text. This all sounds like a lot of work though :) Howie
Re: quality of search text
It seems to me that there are two separate problems: 1) content parsing to avoid site structure - influences the index and rankings 2) content parsing for KWIC snippet generation - influences the user perception of the engine. I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of continuous block to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then would be generated from these chunks alone, ignoring the rest of the content. If this heuristic is applied only at snippet-generation time then Andrzej's concern about missing content is not relevant anymore. Of course I realize it is tricky in the current architecture because different filters would be used for KWICs and indexing... D. Jérôme Charron wrote: I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Are you sure? Take a look at this search results: http://www.google.com/search?hl=enhs=otTlr=c2coff=1safe=offclient=firefox-arls=org.mozilla:en-US:officialpwst=1q=+site:gamingalmanac.com+global+gaming+almanac ... and you will notice that menus are indexed by google and displayed in summaries. But if you can contribute a HtmlParseFilter with ability to remove menus and navigation, it will be a real improvement. A first step, that I have developed in a previous project many years ago is to remove pages that contains textual content only in links: it avoid indexing frames or iframes that only contains some navigation text... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: quality of search text
I too have noticed menu text appearing in the search results. -Original Message- From: jamie [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 4:39 AM To: nutch-dev@lucene.apache.org Subject: quality of search text hi everyone i dont know if we're doing something wrong, but the quality of the text in the Nutch search results is appauling. To give you an example: the text outputted for http://www.gamingalmanac.com/ is the following: ... Gaming Industry Research Publications, Worldwide Gaming Almanacs, Bear Stearns Gaming Almanac, Gaming Revenue and Statistics PRODUCT OVERVIEW COMPLETE ANALYST PACKAGE NORTH AMERICAN ALMANAC INDIAN GAMING INDUSTRY REPORT NEVADA GAMING ALMANAC GLOBAL GAMING ALMANAC GLOBAL GAMBLING REPORT MARKET RESEARCH HANDBOOK MICROSOFT MAP POINT Save up to 45% with a Gaming Analyst Package! The Gaming Almanac Family of Products Find every fact, figure, and trend you need on the gaming industry. With current property profiles and statistics, historical and forward-looking financial data, local, regional, and worldwide gaming market summaries, and key player profiles, the Gaming Almanac products from Casino City Press offer information essential to every gaming executive, supplier, and analyst. Titles Include: Casino City ... whereas Google outputs: Gaming Industry Research Publications, Worldwide Gaming Almanacs ... The Gaming Almanac products from Casino City Press serve as excellent reference tools for anyone interested in the worldwide and domestic gaming markets. gamingalmanac.com/ Is there any easy way to fix this? The Nutch search results appear to include text in the website menu's, etc. which affects the usability of the search results. Where in Nutch would I go about fixing this? Thanks Jamie
Re: quality of search text
Richard Braman wrote: I too have noticed menu text appearing in the search results. The proper place to fix it would be in parse-html, perhaps in DOMContentUtils. However, be warned that this is definitely NOT trivial - i.e. it doesn't say in pages this is menu, this is body text, you have to figure it out, and it's hard to come up with a method that works for any layout. You may hardcode something that works well for your target group of hosts, with pre-determined page layouts. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: quality of search text
it doesn't say in pages this is menu, this is body text, Agreed it doesn't say that this is definitely NOT trivial This isn't trivial, but is rather important it's hard to come up with a method that works for any layout. Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous block of text, ignore content that doesn't contain a continous block of text. If a given html tag only contains a few words of text, it is not content , but rather a part of the nav structure of the page. Here is yet another algorithm. When fetching pages from a particular web, analyze the structure of the page, try to make a determination of what content stays similar from page to page within the same web. That would usually be menus, headers, footers, etc. Granted the menus may change slightly from page to page, which is why the algorithm would be pattern based instead of literal. When you determine what is navigation and what is content, you would only parse and index the content. I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Comments please -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 1:57 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: I too have noticed menu text appearing in the search results. The proper place to fix it would be in parse-html, perhaps in DOMContentUtils. However, be warned that this is definitely NOT trivial - i.e. it doesn't say in pages this is menu, this is body text, you have to figure it out, and it's hard to come up with a method that works for any layout. You may hardcode something that works well for your target group of hosts, with pre-determined page layouts. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: quality of search text
Richard Braman wrote: Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous block of text, ignore content that doesn't contain a continous block of text. If a given html tag only contains a few words of text, it is not content , but rather a part of the nav structure of the page. You may potentially miss a lot of content this way, nowadays many pages freely mix in markup in the main content area... Here is yet another algorithm. When fetching pages from a particular web, analyze the structure of the page, try to make a determination of what content stays similar from page to page within the same web. That would usually be menus, headers, footers, etc. This requires collecting pages in advance to train the structure recognizer, and preparing profiles for groups of pages with common layout. Granted the menus may change slightly from page to page, which is why the algorithm would be pattern based instead of literal. When you determine what is navigation and what is content, you would only parse and index the content. I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Comments please The best way to evaluate this would be to ..erhm.. evaluate these algorithms on a set of reference pages. Would you like to implement one or both algorithms and test them? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: quality of search text
nowadays many pages freely mix in markup in the main content area... Yes, but if that content was nested in a larger block of content, then it would be included. I will probably end up implmenting some of these algorithms, but I would like some good feedback before I go out on a limb. -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, March 10, 2006 2:51 PM To: nutch-dev@lucene.apache.org Subject: Re: quality of search text Richard Braman wrote: Here is a potential algorithm: Look first to Meta Description, if none exists Look for continuous block of text, ignore content that doesn't contain a continous block of text. If a given html tag only contains a few words of text, it is not content , but rather a part of the nav structure of the page. You may potentially miss a lot of content this way, nowadays many pages freely mix in markup in the main content area... Here is yet another algorithm. When fetching pages from a particular web, analyze the structure of the page, try to make a determination of what content stays similar from page to page within the same web. That would usually be menus, headers, footers, etc. This requires collecting pages in advance to train the structure recognizer, and preparing profiles for groups of pages with common layout. Granted the menus may change slightly from page to page, which is why the algorithm would be pattern based instead of literal. When you determine what is navigation and what is content, you would only parse and index the content. I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Comments please The best way to evaluate this would be to ..erhm.. evaluate these algorithms on a set of reference pages. Would you like to implement one or both algorithms and test them? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: quality of search text
I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Are you sure? Take a look at this search results: http://www.google.com/search?hl=enhs=otTlr=c2coff=1safe=offclient=firefox-arls=org.mozilla:en-US:officialpwst=1q=+site:gamingalmanac.com+global+gaming+almanac ... and you will notice that menus are indexed by google and displayed in summaries. But if you can contribute a HtmlParseFilter with ability to remove menus and navigation, it will be a real improvement. A first step, that I have developed in a previous project many years ago is to remove pages that contains textual content only in links: it avoid indexing frames or iframes that only contains some navigation text... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/