Re: Extract all image and video links from a web page

2021-01-14 Thread lewis john mcgibbney
Hi prateek,
Please see my comment inline below

On Thu, Jan 14, 2021 at 6:39 AM  wrote:

>
> One of the requirements I have is to extract all
> the image and video links from the html in a separate object. Since I have
> the html content, I can use a library like jsoup to parse the content and
> extract img tags.
> I was wondering if there is a way in nutch to do this?
>

The problem here is your requirement of "... in a separate object". Will
this separate object be a new record?


> I am assuming I will have to override HtmlParseFilter class and then add my
> extraction logic there. Is my understanding correct? Any sample code
> reference will be helpful as well.
>
>
I think you can simply add parse-html OR parse-tika AND parse-xsl to the
'plugin.includes' configuration property and then use the ordered
HTMLParseFilter configuration option 'htmlparsefilter.order' as follows
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599

You can take a look at the parse-xsl plugin
https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36

N.B. This patch is not yet merged into the Nutch master branch so it is not
available in an official Nutch release. You would need to upgrade to Nutch
1.18-SNAPSHOT master branch and then apply the branch. Any feedback would
be greatly appreciated.

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Extract all image and video links from a web page

2021-01-14 Thread prateek
Hi Folks,

A very happy new year to all of you.

I am currently using Apache nutch 1.16 and successfully extracting the html
content given seed urls. One of the requirements I have is to extract all
the image and video links from the html in a separate object. Since I have
the html content, I can use a library like jsoup to parse the content and
extract img tags.
I was wondering if there is a way in nutch to do this?
I am assuming I will have to override HtmlParseFilter class and then add my
extraction logic there. Is my understanding correct? Any sample code
reference will be helpful as well.

Thanks
Prateek