Re: sitemap and xml crawl

Ankit Goel Thu, 02 Nov 2017 02:11:48 -0700

Hi Yossi,
I have 2 kinds of rss links which are domain.com/rss/feed.xml 
<http://domain.com/rss/feed.xml> links. One is the standard rss feed that we 
see, which becomes the starting point for crawling further as we can pull links 
from it.



<item>
<title>
<![CDATA[
Article headline
]]>
</title>
<link>
article url
</link>
<pubDate> date </pubDate>
<dc:creator>
<![CDATA[ author ]]>
</dc:creator>
<description>
<![CDATA[
One line descriptor tag line
]]>
</description>
</item>
<item>
…
</item>

The other one also includes the content within the xml itself, so it doesn’t 
need further crawling.
I have standalone xml parsers in java that I can use directly, but obviously, 
crawling is an important part, because it documents all the links traversed so 
far.

What would you advice?

Regards,
Ankit Goel

> On 02-Nov-2017, at 2:04 PM, Yossi Tamari <[email protected]> wrote:
> 
> Hi Ankit,
> 
> If you are looking for a Sitemap parser, I would suggest moving to 1.14
> (trunk). I've been using it, and it is probably in better shape than 1.13.
> If you need to parse your own format, the answer depends on the details. Do
> you need to crawl pages in this format where each page contains links in XML
> that you need to crawl? Or is this more like Sitemap where the XML is just
> the  initial starting point? 
> In the second case, maybe just write something outside of Nutch that will
> parse the XML and produce a seed file?
> In the first case, the link you sent is not relevant. You need to implement
> a
> http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
> tml. I haven't done that myself. My suggestion is that you take a look at
> the built-in parser at
> https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
> rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
> developing a custom parser, which might be a good starting point:
> http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> 
>       Yossi.
> 
> 
>> -----Original Message-----
>> From: Ankit Goel [mailto:[email protected]]
>> Sent: 02 November 2017 10:24
>> To: [email protected]
>> Subject: Re: sitemap and xml crawl
>> 
>> Hi Yossi,
>> So I need to make a custom parser. Where do I start? I found this link
>> https://wiki.apache.org/nutch/HowToMakeCustomSearch
>> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
>> place, or should I be looking at creating a plugin page. Any advice would
> be
>> helpful.
>> 
>> Thank you,
>> Ankit Goel
>> 
>>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <[email protected]> wrote:
>>> 
>>> Hi Ankit,
>>> 
>>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
>>> sitemap is a 1.14 feature.
>>> I just checked, and the command indeed exists in 1.14. I did not test
>>> that it works.
>>> 
>>> In general, Nutch supports crawling anything, but you might need to
>>> write your own parser for custom protocols.
>>> 
>>>     Yossi.
>>> 
>>>> -----Original Message-----
>>>> From: Ankit Goel [mailto:[email protected]]
>>>> Sent: 01 November 2017 18:55
>>>> To: [email protected]
>>>> Subject: sitemap and xml crawl
>>>> 
>>>> Hi,
>>>> I need to crawl a xml feed, which includes url, title and content of
>>>> the
>>> articles on
>>>> site.
>>>> 
>>>> The documentation on the site says that bin/nutch sitemap exists, but
>>>> on
>>> my
>>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
>>>> support crawling sitemaps? Or xml links.
>>>> 
>>>> Regards,
>>>> Ankit Goel
>>> 
>>> 
> 
>

Re: sitemap and xml crawl

Reply via email to