Hi Harald, > it is sufficient to only activate the parse-html plugin Yes. If parse-tika is active also other document types (PDFs, etc.) searched for links.
> or is even this not necessary You need to parse HTMLs. It's impossible to extract links without parsing HTML. Think of relative links (base URL), <!-- comments -->, <![CDATA[...]]>, and other subtleties which will harm other approaches for link extraction (eg, regular expressions). > b) provide HTML and all other documents found as such to some external tool > as is, i.e. unparsed. Make sure that the raw content is stored (in segments or WebTable), cf. property fetcher.store.content. > (Is there a more detailed description of what the individual stages of nutch > do beyond the tutorial?) Still a good introduction: Andrzej BiaĆecki's chapter in "Hadoop: The definitive guide" by Tom White. Sebastian On 07/01/2014 03:12 PM, Harald Kirsch wrote: > Suppose I want nutch to fetch URLs and > > a) follow links in HTML documents *only* > b) provide HTML and all other documents found as such to some external tool > as is, i.e. unparsed. > > Is it correct that it is sufficient to only activate the parse-html plugin > from all the parse-* > plugins or is even this not necessary? > > (Is there a more detailed description of what the individual stages of nutch > do beyond the tutorial?) > > Thanks, > Harald. >

