Re: Why does nutch need to parse documents --- clarification needed

Sebastian Nagel Thu, 03 Jul 2014 13:31:24 -0700

Hi Harald,

> it is sufficient to only activate the parse-html plugin
Yes. If parse-tika is active also other document types
(PDFs, etc.) searched for links.


> or is even this not necessary
You need to parse HTMLs. It's impossible to extract links without
parsing HTML. Think of relative links (base URL), <!-- comments -->,
<![CDATA[...]]>, and other subtleties which will harm other
approaches for link extraction (eg, regular expressions).

> b) provide HTML and all other documents found as such to some external tool 
> as is, i.e. unparsed.
Make sure that the raw content is stored (in segments or WebTable), cf. 
property fetcher.store.content.

> (Is there a more detailed description of what the individual stages of nutch 
> do beyond the tutorial?)
Still a good introduction: Andrzej Białecki's chapter in "Hadoop: The 
definitive guide"
by Tom White.

Sebastian

On 07/01/2014 03:12 PM, Harald Kirsch wrote:
> Suppose I want nutch to fetch URLs and
> 
> a) follow links in HTML documents *only*
> b) provide HTML and all other documents found as such to some external tool 
> as is, i.e. unparsed.
> 
> Is it correct that it is sufficient to only activate the parse-html plugin 
> from all the parse-*
> plugins or is even this not necessary?
> 
> (Is there a more detailed description of what the individual stages of nutch 
> do beyond the tutorial?)
> 
> Thanks,
> Harald.
>

Re: Why does nutch need to parse documents --- clarification needed

Reply via email to