Re: Extract all image and video links from a web page
Hi Prateek, are there any URL filters which filter away image links? You can verify this using the URL filter checker: echo "https://example.com/image.jpg; \ | bin/nutch filterchecker -stdin The default rules in conf/regex-urlfilter.txt exclude common image suffixes. Note that there can be more URL filters activated in the property plugin.includes. Best, Sebastian On 1/26/21 3:14 PM, prateek wrote: Hi Lewis, Thanks for your suggestion. I looked at the class fetching outlinks and saw that "img" is already part of that - https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90. So I am confused as to why I don't see any images in outlinks. I have double checked that the property parser.html.outlinks.ignore_tags is also not set. So ideally images should be part of outlinks already. But when I run "bin/nutch readseg" to see the segments data, I don't see any images being captured. Any Idea what am I missing? If there is a way I can get all images in outlinks, then maybe I don't even need a plugin for that. Regards Prateek On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney wrote: Hi Prateek, On 2021/01/19 15:58:29, prateek wrote: Is the only other option is to override HtmlParseFilter and add a new plugin? Yes I think it is. Also regarding separate objects, what i meant is if i store the image links in Outlink, then those links will also be stored in DB (because all outlink are stored for next crawl of depth > 1). I don't want to store those in crawldb and just output in some other object within the record. I hope this makes sense I understand. Seeing as you cannot upgrade then yes I think you need to implement a new plugin to capture the outlinks as a new field in the NutchDocument. You should also look into using the 'parser.html.outlinks.ignore_tags' configuration setting. You can specify which tags are filtered. lewismc
Re: Extract all image and video links from a web page
Hi Lewis, Thanks for your suggestion. I looked at the class fetching outlinks and saw that "img" is already part of that - https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90. So I am confused as to why I don't see any images in outlinks. I have double checked that the property parser.html.outlinks.ignore_tags is also not set. So ideally images should be part of outlinks already. But when I run "bin/nutch readseg" to see the segments data, I don't see any images being captured. Any Idea what am I missing? If there is a way I can get all images in outlinks, then maybe I don't even need a plugin for that. Regards Prateek On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney wrote: > Hi Prateek, > > On 2021/01/19 15:58:29, prateek wrote: > > Is the only other option is to > > override HtmlParseFilter and add a new plugin? > > Yes I think it is. > > > > > Also regarding separate objects, what i meant is if i store the image > links > > in Outlink, then those links will also be stored in DB (because all > outlink > > are stored for next crawl of depth > 1). I don't want to store those in > > crawldb and just output in some other object within the record. I hope > this > > makes sense > > I understand. Seeing as you cannot upgrade then yes I think you need to > implement a new plugin to capture the outlinks as a new field in the > NutchDocument. You should also look into using the > 'parser.html.outlinks.ignore_tags' configuration setting. You can specify > which tags are filtered. > > lewismc >
Re: Extract all image and video links from a web page
Hi Prateek, On 2021/01/19 15:58:29, prateek wrote: > Is the only other option is to > override HtmlParseFilter and add a new plugin? Yes I think it is. > > Also regarding separate objects, what i meant is if i store the image links > in Outlink, then those links will also be stored in DB (because all outlink > are stored for next crawl of depth > 1). I don't want to store those in > crawldb and just output in some other object within the record. I hope this > makes sense I understand. Seeing as you cannot upgrade then yes I think you need to implement a new plugin to capture the outlinks as a new field in the NutchDocument. You should also look into using the 'parser.html.outlinks.ignore_tags' configuration setting. You can specify which tags are filtered. lewismc
Re: Extract all image and video links from a web page
Hi Lewis, Thanks for your reply. Unfortunately, I don't have the liberty to update my current version to an unreleased version and hence the suggestion to use parse-xsl won't be useful at this time. Is the only other option is to override HtmlParseFilter and add a new plugin? Also regarding separate objects, what i meant is if i store the image links in Outlink, then those links will also be stored in DB (because all outlink are stored for next crawl of depth > 1). I don't want to store those in crawldb and just output in some other object within the record. I hope this makes sense Regards Prateek On Thu, Jan 14, 2021 at 6:28 PM lewis john mcgibbney wrote: > Hi prateek, > Please see my comment inline below > > On Thu, Jan 14, 2021 at 6:39 AM wrote: > > > > > One of the requirements I have is to extract all > > the image and video links from the html in a separate object. Since I > have > > the html content, I can use a library like jsoup to parse the content and > > extract img tags. > > I was wondering if there is a way in nutch to do this? > > > > The problem here is your requirement of "... in a separate object". Will > this separate object be a new record? > > > > I am assuming I will have to override HtmlParseFilter class and then add > my > > extraction logic there. Is my understanding correct? Any sample code > > reference will be helpful as well. > > > > > I think you can simply add parse-html OR parse-tika AND parse-xsl to the > 'plugin.includes' configuration property and then use the ordered > HTMLParseFilter configuration option 'htmlparsefilter.order' as follows > https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599 > > You can take a look at the parse-xsl plugin > > https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36 > > N.B. This patch is not yet merged into the Nutch master branch so it is not > available in an official Nutch release. You would need to upgrade to Nutch > 1.18-SNAPSHOT master branch and then apply the branch. Any feedback would > be greatly appreciated. > > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc >
Re: Extract all image and video links from a web page
Hi prateek, Please see my comment inline below On Thu, Jan 14, 2021 at 6:39 AM wrote: > > One of the requirements I have is to extract all > the image and video links from the html in a separate object. Since I have > the html content, I can use a library like jsoup to parse the content and > extract img tags. > I was wondering if there is a way in nutch to do this? > The problem here is your requirement of "... in a separate object". Will this separate object be a new record? > I am assuming I will have to override HtmlParseFilter class and then add my > extraction logic there. Is my understanding correct? Any sample code > reference will be helpful as well. > > I think you can simply add parse-html OR parse-tika AND parse-xsl to the 'plugin.includes' configuration property and then use the ordered HTMLParseFilter configuration option 'htmlparsefilter.order' as follows https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599 You can take a look at the parse-xsl plugin https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36 N.B. This patch is not yet merged into the Nutch master branch so it is not available in an official Nutch release. You would need to upgrade to Nutch 1.18-SNAPSHOT master branch and then apply the branch. Any feedback would be greatly appreciated. -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Extract all image and video links from a web page
Hi Folks, A very happy new year to all of you. I am currently using Apache nutch 1.16 and successfully extracting the html content given seed urls. One of the requirements I have is to extract all the image and video links from the html in a separate object. Since I have the html content, I can use a library like jsoup to parse the content and extract img tags. I was wondering if there is a way in nutch to do this? I am assuming I will have to override HtmlParseFilter class and then add my extraction logic there. Is my understanding correct? Any sample code reference will be helpful as well. Thanks Prateek