Hi prateek,
Please see my comment inline below
On Thu, Jan 14, 2021 at 6:39 AM wrote:
>
> One of the requirements I have is to extract all
> the image and video links from the html in a separate object. Since I have
> the html content, I can use a library like jsoup to parse the content and
> extract img tags.
> I was wondering if there is a way in nutch to do this?
>
The problem here is your requirement of "... in a separate object". Will
this separate object be a new record?
> I am assuming I will have to override HtmlParseFilter class and then add my
> extraction logic there. Is my understanding correct? Any sample code
> reference will be helpful as well.
>
>
I think you can simply add parse-html OR parse-tika AND parse-xsl to the
'plugin.includes' configuration property and then use the ordered
HTMLParseFilter configuration option 'htmlparsefilter.order' as follows
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599
You can take a look at the parse-xsl plugin
https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36
N.B. This patch is not yet merged into the Nutch master branch so it is not
available in an official Nutch release. You would need to upgrade to Nutch
1.18-SNAPSHOT master branch and then apply the branch. Any feedback would
be greatly appreciated.
--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc