Hi,

coming back to this question. Now I have basically the following parse-plugins.xml:

        <mimeType name="text/html">
                <plugin id="parse-html" />
        </mimeType>

All other mime-types shall not be parsed for links. The documents shall be send as-is, i.e. as binary blobs to the index stage. (To preempt cryouts: this is a custom index stage that knows how to deal with binary blobs.)

Now where and how will the binary blob be amde available within the NutchDocument send to my indexer.

For parsed content I see text coming along in the content field, but

a) where does the binary blob appear in NutchDocument and
b) how does it get there?

Regards,
Harald.

On 03.07.2014 22:30, Sebastian Nagel wrote:
Hi Harald,

it is sufficient to only activate the parse-html plugin
Yes. If parse-tika is active also other document types
(PDFs, etc.) searched for links.

or is even this not necessary
You need to parse HTMLs. It's impossible to extract links without
parsing HTML. Think of relative links (base URL), <!-- comments -->,
<![CDATA[...]]>, and other subtleties which will harm other
approaches for link extraction (eg, regular expressions).

b) provide HTML and all other documents found as such to some external tool as 
is, i.e. unparsed.
Make sure that the raw content is stored (in segments or WebTable), cf. 
property fetcher.store.content.

(Is there a more detailed description of what the individual stages of nutch do 
beyond the tutorial?)
Still a good introduction: Andrzej BiaƂecki's chapter in "Hadoop: The definitive 
guide"
by Tom White.

Sebastian

On 07/01/2014 03:12 PM, Harald Kirsch wrote:
Suppose I want nutch to fetch URLs and

a) follow links in HTML documents *only*
b) provide HTML and all other documents found as such to some external tool as 
is, i.e. unparsed.

Is it correct that it is sufficient to only activate the parse-html plugin from 
all the parse-*
plugins or is even this not necessary?

(Is there a more detailed description of what the individual stages of nutch do 
beyond the tutorial?)

Thanks,
Harald.



Reply via email to