Hi Sebastian,
currently I am stuck with 1.8. I understand your comment about
IndexerMapReduce such that only this class will see the binary content,
meaning I am out of luck with 1.8.
Except: I could think of writing a "parser" which does nothing but
encode the binary content into a String and add it as some field which
will then be found in the NutchDocument. The
org.apache.nutch.parse.Parser interface looks reasonably simple to
implement. The ParseData would contain no outlinks, Parse.geText() would
return the empty string and parseMeta would contain the single field
with the encoded binary content.
Any thoughts whether this would work?
Harald.
On 23.07.2014 18:01, Sebastian Nagel wrote:
Hi Harald,
have a look at NUTCH-1785 <https://issues.apache.org/jira/browse/NUTCH-1785>:
it's about the same problem.
a) where does the binary blob appear in NutchDocument and
Just add a NutchField. The value can be any type, but the indexer must
be able to handle it.
b) how does it get there?
In Nutch 1.x adding raw/binary content can only done within
IndexerMapReduce.
Indexing filters do not have the binary content at hand. In 2.x this is
different: an indexing
filter can request any field/column to be added. I didn't try but it should
be possible
to request the raw content (column has the same name).
Sebastian
2014-07-23 16:29 GMT+02:00 Harald Kirsch <[email protected]>:
Hi,
coming back to this question. Now I have basically the following
parse-plugins.xml:
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
All other mime-types shall not be parsed for links. The documents shall be
send as-is, i.e. as binary blobs to the index stage. (To preempt cryouts:
this is a custom index stage that knows how to deal with binary blobs.)
Now where and how will the binary blob be amde available within the
NutchDocument send to my indexer.
For parsed content I see text coming along in the content field, but
a) where does the binary blob appear in NutchDocument and
b) how does it get there?
Regards,
Harald.
On 03.07.2014 22:30, Sebastian Nagel wrote:
Hi Harald,
it is sufficient to only activate the parse-html plugin
Yes. If parse-tika is active also other document types
(PDFs, etc.) searched for links.
or is even this not necessary
You need to parse HTMLs. It's impossible to extract links without
parsing HTML. Think of relative links (base URL), <!-- comments -->,
<![CDATA[...]]>, and other subtleties which will harm other
approaches for link extraction (eg, regular expressions).
b) provide HTML and all other documents found as such to some external
tool as is, i.e. unparsed.
Make sure that the raw content is stored (in segments or WebTable), cf.
property fetcher.store.content.
(Is there a more detailed description of what the individual stages of
nutch do beyond the tutorial?)
Still a good introduction: Andrzej BiaĆecki's chapter in "Hadoop: The
definitive guide"
by Tom White.
Sebastian
On 07/01/2014 03:12 PM, Harald Kirsch wrote:
Suppose I want nutch to fetch URLs and
a) follow links in HTML documents *only*
b) provide HTML and all other documents found as such to some external
tool as is, i.e. unparsed.
Is it correct that it is sufficient to only activate the parse-html
plugin from all the parse-*
plugins or is even this not necessary?
(Is there a more detailed description of what the individual stages of
nutch do beyond the tutorial?)
Thanks,
Harald.