Did you check rss xml? When Iwas implementing RSS feed parser I met a lot of
feeds which were incorrectly formed and nutch parser couldn't correctly
parse them.

Can you give me example URL which doesn't work. I can test it on my
environment
Best Regards
Alexander Aristov


On 17 July 2010 03:07, Savannah Beckett <[email protected]>wrote:

> I tried to parse RSS/atom feeds and nutch 1.1 can't parse 95% of them.  I
> put
> the rss/atom feed in seed.txt, set the regex-urlfilter.txt, run on rss/feed
> plugin in nutch-site.xml.  It fetches the rss/atom feed, but it "almost"
> always
> find 0 urls to fetch after that.  I checked the log, and I found whole
> bunch of
> these:
>
>  WARN  parse.ParserFactory - ParserFactory:Plugin: org.a\
> pache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml
> via\
>  parse-plugins.xml, but its plugin.xml file does not claim to support
> contentTy\
> pe: application/rss+xml
>
> WARN  parse.rss - org.apache.commons.feedparser.FeedPar\
> serException: org.jdom.input.JDOMParseException: Error on line 497: XML
> documen\
> t structures must start and end within the same entity.
>
> I surf the web to look for answers and can't find any.  All I found was
> something like maybe the webserver gives nutch the wrong type?...But the
> xml
> already says RSS and version # , what more does nutch want?
>
> Anyway to solve it?  workaround?
> Thanks.
>
>
>
>

Reply via email to