Re: Why does nutch need to parse documents --- clarification needed

Harald Kirsch Thu, 24 Jul 2014 00:02:26 -0700

Hi Sebastian,

currently I am stuck with 1.8. I understand your comment aboutIndexerMapReduce such that only this class will see the binary content,meaning I am out of luck with 1.8.

Except: I could think of writing a "parser" which does nothing butencode the binary content into a String and add it as some field whichwill then be found in the NutchDocument. Theorg.apache.nutch.parse.Parser interface looks reasonably simple toimplement. The ParseData would contain no outlinks, Parse.geText() wouldreturn the empty string and parseMeta would contain the single fieldwith the encoded binary content.


Any thoughts whether this would work?

Harald.

On 23.07.2014 18:01, Sebastian Nagel wrote:

Hi Harald,

have a look at NUTCH-1785 <https://issues.apache.org/jira/browse/NUTCH-1785>:
it's about the same problem.

a) where does the binary blob appear in NutchDocument and

Just add a NutchField. The value can be any type, but the indexer must
be able to handle it.

b) how does it get there?

In Nutch 1.x adding raw/binary content can only done within
IndexerMapReduce.
Indexing filters do not have the binary content at hand. In 2.x this is
different: an indexing
filter can request any field/column to be added. I didn't try but it should
be possible
to request the raw content (column has the same name).

Sebastian


2014-07-23 16:29 GMT+02:00 Harald Kirsch <[email protected]>:

Hi,

coming back to this question. Now I have basically the following
parse-plugins.xml:

         <mimeType name="text/html">
                 <plugin id="parse-html" />
         </mimeType>

All other mime-types shall not be parsed for links. The documents shall be
send as-is, i.e. as binary blobs to the index stage. (To preempt cryouts:
this is a custom index stage that knows how to deal with binary blobs.)

Now where and how will the binary blob be amde available within the
NutchDocument send to my indexer.

For parsed content I see text coming along in the content field, but

a) where does the binary blob appear in NutchDocument and
b) how does it get there?

Regards,
Harald.


On 03.07.2014 22:30, Sebastian Nagel wrote:

Hi Harald,

  it is sufficient to only activate the parse-html plugin

Yes. If parse-tika is active also other document types
(PDFs, etc.) searched for links.

  or is even this not necessary

You need to parse HTMLs. It's impossible to extract links without
parsing HTML. Think of relative links (base URL), <!-- comments -->,
<![CDATA[...]]>, and other subtleties which will harm other
approaches for link extraction (eg, regular expressions).

  b) provide HTML and all other documents found as such to some external

tool as is, i.e. unparsed.

Make sure that the raw content is stored (in segments or WebTable), cf.
property fetcher.store.content.

  (Is there a more detailed description of what the individual stages of

nutch do beyond the tutorial?)

Still a good introduction: Andrzej Białecki's chapter in "Hadoop: The
definitive guide"
by Tom White.

Sebastian

On 07/01/2014 03:12 PM, Harald Kirsch wrote:

Suppose I want nutch to fetch URLs and

a) follow links in HTML documents *only*
b) provide HTML and all other documents found as such to some external
tool as is, i.e. unparsed.

Is it correct that it is sufficient to only activate the parse-html
plugin from all the parse-*
plugins or is even this not necessary?

(Is there a more detailed description of what the individual stages of
nutch do beyond the tutorial?)

Thanks,
Harald.

Re: Why does nutch need to parse documents --- clarification needed

Reply via email to