Just use the creativecommon library to manipulate the DOM in your new parser.
Venkateshprasanna wrote:
Hi,
You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs, spans,
winz wrote:
Venkateshprasanna wrote:
Hi,
You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs,
hi guyes
it's just what im talking about in my post 'indexing just certain content'...
you can read it mabe it could help you...
i was asking how to get rid of the garbage sections in a document and to parse
only the important data...so i guess you will create your own parser and
BELLINI ADAM wrote:
hi guyes it's just what im talking about in my post 'indexing
just certain content'... you can read it mabe it could help you... i
was asking how to get rid of the garbage sections in a document and
to parse only the important data...so i guess you will create your
own
As i explained in my poste the sections i dont wnat to index areare headers,
top menus, right menus, left menus :
this is what i mean by garbage.
div id = 'header' bla bla /div
div id = 'top_menu' bla bla /div
div id = 'left_menu' bla bla /div
div id = 'right_menu' bla bla /diveach
Hi,
You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs, spans, css classes and the likes - from
Most webpages have sections like navigation, header, left column for related
links, footer, etc. How can I prevent Nutch from returning search results
that contain keywords only in the non-main body of the page? e.g. keywords
can appear in navigation bar or footer, but they may not appear in