RE: How to ignore search results that don't have related keywords in main body?

2009-10-11 Thread MilleBii
Just use the creativecommon library to manipulate the DOM in your new parser.

Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread winz
Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans,

Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki
winz wrote: Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs,

RE: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread BELLINI ADAM
hi guyes it's just what im talking about in my post 'indexing just certain content‏'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and

Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi guyes it's just what im talking about in my post 'indexing just certain content‏'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own

RE: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread BELLINI ADAM
As i explained in my poste the sections i dont wnat to index areare headers, top menus, right menus, left menus : this is what i mean by garbage. div id = 'header' bla bla /div div id = 'top_menu' bla bla /div div id = 'left_menu' bla bla /div div id = 'right_menu' bla bla /diveach

Re: How to ignore search results that don't have related keywords in main body?

2009-03-23 Thread Venkateshprasanna
Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from

How to ignore search results that don't have related keywords in main body?

2009-03-22 Thread dealmaker
Most webpages have sections like navigation, header, left column for related links, footer, etc. How can I prevent Nutch from returning search results that contain keywords only in the non-main body of the page? e.g. keywords can appear in navigation bar or footer, but they may not appear in