Hi Kieran,

see the command-line options

        -addBinaryContent
           index raw/binary content in field `binaryContent`
        -base64
           use Base64 encoding for binary content

of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also for HTML pages which use
a different encoding than UTF-8.

Best,
Sebastian

[1] https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842


On 5/28/21 5:28 PM, Kieran Munday wrote:
Hi users@,

I am new to Nutch (v.1.17) and my current project requires the indexing of
the html of crawled pages. It also requires fields that can be derived from
the raw html such as image count, and charset.

I have looked on StackOverflow for how to achieve this and most people from
my understanding seem to be recommending processing the segments to extract
the html and modify the documents post-crawl. This doesn't fit my use case
as I need to calculate these fields at crawl time before they are indexed
into Elasticsearch.

The other recommendations I have seen mention creating a plugin to override
the parse-html plugin. However, I have found rather limited documentation
on how to do this correctly and am not sure on how to return from the
plugin in a way that the field propagates into the NutchDocument which will
be processed in the Indexers' write method.

Do any of you have any advice or links to documentation that explains how
to modify what gets set in the NutchDocument?

Thank you in advance


Reply via email to