Suppose I want nutch to fetch URLs and

a) follow links in HTML documents *only*
b) provide HTML and all other documents found as such to some external tool as is, i.e. unparsed.

Is it correct that it is sufficient to only activate the parse-html plugin from all the parse-* plugins or is even this not necessary?

(Is there a more detailed description of what the individual stages of nutch do beyond the tutorial?)

Thanks,
Harald.

Reply via email to