RE: sitemap and xml crawl

Yossi Tamari Thu, 02 Nov 2017 02:51:13 -0700

Hi Ankit,

So I guess you want to remove the parser that is configured by default (since 
you don't need to parse HTML at all), add the RSS parser that Markus suggested, 
and then you probably need to add either a custom parser for the second XML 
format, or an indexing filter, or both. This would depend on exactly what you 
are trying to achieve at the end of the crawl.


        Yossi.

> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: 02 November 2017 11:29
> To: [email protected]
> Subject: RE: sitemap and xml crawl
> 
> Hi - Nutch has a parser for RSS and ATOM on-board:
> https://nutch.apache.org/apidocs/apidocs-
> 1.13/org/apache/nutch/parse/feed/FeedParser.html
> 
> You must configure it in your plugin.includes to use it.
> 
> Regards,
> Markus
> 
> 
> 
> -----Original message-----
> > From:Ankit Goel <[email protected]>
> > Sent: Thursday 2nd November 2017 10:11
> > To: [email protected]
> > Subject: Re: sitemap and xml crawl
> >
> > Hi Yossi,
> > I have 2 kinds of rss links which are domain.com/rss/feed.xml
> <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we
> see, which becomes the starting point for crawling further as we can pull 
> links
> from it.
> >
> >
> > <item>
> > <title>
> > <![CDATA[
> > Article headline
> > ]]>
> > </title>
> > <link>
> > article url
> > </link>
> > <pubDate> date </pubDate>
> > <dc:creator>
> > <![CDATA[ author ]]>
> > </dc:creator>
> > <description>
> > <![CDATA[
> > One line descriptor tag line
> > ]]>
> > </description>
> > </item>
> > <item>
> > …
> > </item>
> >
> > The other one also includes the content within the xml itself, so it 
> > doesn’t need
> further crawling.
> > I have standalone xml parsers in java that I can use directly, but 
> > obviously,
> crawling is an important part, because it documents all the links traversed 
> so far.
> >
> > What would you advice?
> >
> > Regards,
> > Ankit Goel
> >
> > > On 02-Nov-2017, at 2:04 PM, Yossi Tamari <[email protected]> wrote:
> > >
> > > Hi Ankit,
> > >
> > > If you are looking for a Sitemap parser, I would suggest moving to
> > > 1.14 (trunk). I've been using it, and it is probably in better shape than 
> > > 1.13.
> > > If you need to parse your own format, the answer depends on the
> > > details. Do you need to crawl pages in this format where each page
> > > contains links in XML that you need to crawl? Or is this more like
> > > Sitemap where the XML is just the  initial starting point?
> > > In the second case, maybe just write something outside of Nutch that
> > > will parse the XML and produce a seed file?
> > > In the first case, the link you sent is not relevant. You need to
> > > implement a
> > > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/
> > > Parser.h tml. I haven't done that myself. My suggestion is that you
> > > take a look at the built-in parser at
> > > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/sr
> > > c/java/o rg/apache/nutch/parse/html/HtmlParser.java. Google found
> > > this article on developing a custom parser, which might be a good
> > > starting point:
> > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> > >
> > >   Yossi.
> > >
> > >
> > >> -----Original Message-----
> > >> From: Ankit Goel [mailto:[email protected]]
> > >> Sent: 02 November 2017 10:24
> > >> To: [email protected]
> > >> Subject: Re: sitemap and xml crawl
> > >>
> > >> Hi Yossi,
> > >> So I need to make a custom parser. Where do I start? I found this
> > >> link https://wiki.apache.org/nutch/HowToMakeCustomSearch
> > >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the
> > >> right place, or should I be looking at creating a plugin page. Any
> > >> advice would
> > > be
> > >> helpful.
> > >>
> > >> Thank you,
> > >> Ankit Goel
> > >>
> > >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <[email protected]>
> wrote:
> > >>>
> > >>> Hi Ankit,
> > >>>
> > >>> According to this:
> > >>> https://issues.apache.org/jira/browse/NUTCH-1465,
> > >>> sitemap is a 1.14 feature.
> > >>> I just checked, and the command indeed exists in 1.14. I did not
> > >>> test that it works.
> > >>>
> > >>> In general, Nutch supports crawling anything, but you might need
> > >>> to write your own parser for custom protocols.
> > >>>
> > >>>         Yossi.
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Ankit Goel [mailto:[email protected]]
> > >>>> Sent: 01 November 2017 18:55
> > >>>> To: [email protected]
> > >>>> Subject: sitemap and xml crawl
> > >>>>
> > >>>> Hi,
> > >>>> I need to crawl a xml feed, which includes url, title and content
> > >>>> of the
> > >>> articles on
> > >>>> site.
> > >>>>
> > >>>> The documentation on the site says that bin/nutch sitemap exists,
> > >>>> but on
> > >>> my
> > >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> > >>>> support crawling sitemaps? Or xml links.
> > >>>>
> > >>>> Regards,
> > >>>> Ankit Goel
> > >>>
> > >>>
> > >
> > >
> >
> >

RE: sitemap and xml crawl

Reply via email to