Why does nutch need to parse documents --- clarification needed

Suppose I want nutch to fetch URLs and

a) follow links in HTML documents *only*

b) provide HTML and all other documents found as such to some externaltool as is, i.e. unparsed.

Is it correct that it is sufficient to only activate the parse-htmlplugin from all the parse-* plugins or is even this not necessary?

(Is there a more detailed description of what the individual stages ofnutch do beyond the tutorial?)


Thanks,
Harald.

Reply via email to