Hi Yossi, I have 2 kinds of rss links which are domain.com/rss/feed.xml <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we see, which becomes the starting point for crawling further as we can pull links from it.
<item> <title> <![CDATA[ Article headline ]]> </title> <link> article url </link> <pubDate> date </pubDate> <dc:creator> <![CDATA[ author ]]> </dc:creator> <description> <![CDATA[ One line descriptor tag line ]]> </description> </item> <item> … </item> The other one also includes the content within the xml itself, so it doesn’t need further crawling. I have standalone xml parsers in java that I can use directly, but obviously, crawling is an important part, because it documents all the links traversed so far. What would you advice? Regards, Ankit Goel > On 02-Nov-2017, at 2:04 PM, Yossi Tamari <[email protected]> wrote: > > Hi Ankit, > > If you are looking for a Sitemap parser, I would suggest moving to 1.14 > (trunk). I've been using it, and it is probably in better shape than 1.13. > If you need to parse your own format, the answer depends on the details. Do > you need to crawl pages in this format where each page contains links in XML > that you need to crawl? Or is this more like Sitemap where the XML is just > the initial starting point? > In the second case, maybe just write something outside of Nutch that will > parse the XML and produce a seed file? > In the first case, the link you sent is not relevant. You need to implement > a > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h > tml. I haven't done that myself. My suggestion is that you take a look at > the built-in parser at > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o > rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on > developing a custom parser, which might be a good starting point: > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. > > Yossi. > > >> -----Original Message----- >> From: Ankit Goel [mailto:[email protected]] >> Sent: 02 November 2017 10:24 >> To: [email protected] >> Subject: Re: sitemap and xml crawl >> >> Hi Yossi, >> So I need to make a custom parser. Where do I start? I found this link >> https://wiki.apache.org/nutch/HowToMakeCustomSearch >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right >> place, or should I be looking at creating a plugin page. Any advice would > be >> helpful. >> >> Thank you, >> Ankit Goel >> >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <[email protected]> wrote: >>> >>> Hi Ankit, >>> >>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465, >>> sitemap is a 1.14 feature. >>> I just checked, and the command indeed exists in 1.14. I did not test >>> that it works. >>> >>> In general, Nutch supports crawling anything, but you might need to >>> write your own parser for custom protocols. >>> >>> Yossi. >>> >>>> -----Original Message----- >>>> From: Ankit Goel [mailto:[email protected]] >>>> Sent: 01 November 2017 18:55 >>>> To: [email protected] >>>> Subject: sitemap and xml crawl >>>> >>>> Hi, >>>> I need to crawl a xml feed, which includes url, title and content of >>>> the >>> articles on >>>> site. >>>> >>>> The documentation on the site says that bin/nutch sitemap exists, but >>>> on >>> my >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch >>>> support crawling sitemaps? Or xml links. >>>> >>>> Regards, >>>> Ankit Goel >>> >>> > >

