Did you check rss xml? When Iwas implementing RSS feed parser I met a lot of feeds which were incorrectly formed and nutch parser couldn't correctly parse them.
Can you give me example URL which doesn't work. I can test it on my environment Best Regards Alexander Aristov On 17 July 2010 03:07, Savannah Beckett <[email protected]>wrote: > I tried to parse RSS/atom feeds and nutch 1.1 can't parse 95% of them. I > put > the rss/atom feed in seed.txt, set the regex-urlfilter.txt, run on rss/feed > plugin in nutch-site.xml. It fetches the rss/atom feed, but it "almost" > always > find 0 urls to fetch after that. I checked the log, and I found whole > bunch of > these: > > WARN parse.ParserFactory - ParserFactory:Plugin: org.a\ > pache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml > via\ > parse-plugins.xml, but its plugin.xml file does not claim to support > contentTy\ > pe: application/rss+xml > > WARN parse.rss - org.apache.commons.feedparser.FeedPar\ > serException: org.jdom.input.JDOMParseException: Error on line 497: XML > documen\ > t structures must start and end within the same entity. > > I surf the web to look for answers and can't find any. All I found was > something like maybe the webserver gives nutch the wrong type?...But the > xml > already says RSS and version # , what more does nutch want? > > Anyway to solve it? workaround? > Thanks. > > > >

