RE: sitemap and xml crawl

Yossi Tamari Thu, 02 Nov 2017 01:35:24 -0700

Hi Ankit,

If you are looking for a Sitemap parser, I would suggest moving to 1.14
(trunk). I've been using it, and it is probably in better shape than 1.13.
If you need to parse your own format, the answer depends on the details. Do
you need to crawl pages in this format where each page contains links in XML
that you need to crawl? Or is this more like Sitemap where the XML is just
the  initial starting point? 
In the second case, maybe just write something outside of Nutch that will
parse the XML and produce a seed file?
In the first case, the link you sent is not relevant. You need to implement
a
http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
tml. I haven't done that myself. My suggestion is that you take a look at
the built-in parser at
https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
developing a custom parser, which might be a good starting point:
http://www.treselle.com/blog/apache-nutch-with-custom-parser/.


        Yossi.


> -----Original Message-----
> From: Ankit Goel [mailto:[email protected]]
> Sent: 02 November 2017 10:24
> To: [email protected]
> Subject: Re: sitemap and xml crawl
> 
> Hi Yossi,
> So I need to make a custom parser. Where do I start? I found this link
> https://wiki.apache.org/nutch/HowToMakeCustomSearch
> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
> place, or should I be looking at creating a plugin page. Any advice would
be
> helpful.
> 
> Thank you,
> Ankit Goel
> 
> > On 02-Nov-2017, at 1:14 PM, Yossi Tamari <[email protected]> wrote:
> >
> > Hi Ankit,
> >
> > According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
> > sitemap is a 1.14 feature.
> > I just checked, and the command indeed exists in 1.14. I did not test
> > that it works.
> >
> > In general, Nutch supports crawling anything, but you might need to
> > write your own parser for custom protocols.
> >
> >     Yossi.
> >
> >> -----Original Message-----
> >> From: Ankit Goel [mailto:[email protected]]
> >> Sent: 01 November 2017 18:55
> >> To: [email protected]
> >> Subject: sitemap and xml crawl
> >>
> >> Hi,
> >> I need to crawl a xml feed, which includes url, title and content of
> >> the
> > articles on
> >> site.
> >>
> >> The documentation on the site says that bin/nutch sitemap exists, but
> >> on
> > my
> >> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> >> support crawling sitemaps? Or xml links.
> >>
> >> Regards,
> >> Ankit Goel
> >
> >

RE: sitemap and xml crawl

Reply via email to