Hi Kieran,
see the command-line options
-addBinaryContent
index raw/binary content in field `binaryContent`
-base64
use Base64 encoding for binary content
of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also
Hi users@,
I am new to Nutch (v.1.17) and my current project requires the indexing of
the html of crawled pages. It also requires fields that can be derived from
the raw html such as image count, and charset.
I have looked on StackOverflow for how to achieve this and most people from
my
2 matches
Mail list logo