date:20210528

Re: Adding html field to NutchDocument

2021-05-28 Thread Sebastian Nagel

Hi Kieran, see the command-line options -addBinaryContent index raw/binary content in field `binaryContent` -base64 use Base64 encoding for binary content of the Nutch index job [1]. Note that the content maybe indeed binary, eg. for PDF documents but also

Adding html field to NutchDocument

2021-05-28 Thread Kieran Munday

Hi users@, I am new to Nutch (v.1.17) and my current project requires the indexing of the html of crawled pages. It also requires fields that can be derived from the raw html such as image count, and charset. I have looked on StackOverflow for how to achieve this and most people from my