I tried to parse RSS/atom feeds and nutch 1.1 can't parse 95% of them. I put
the rss/atom feed in seed.txt, set the regex-urlfilter.txt, run on rss/feed
plugin in nutch-site.xml. It fetches the rss/atom feed, but it "almost" always
find 0 urls to fetch after that. I checked the log, and I found whole bunch of
these:
WARN parse.ParserFactory - ParserFactory:Plugin: org.a\
pache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml via\
parse-plugins.xml, but its plugin.xml file does not claim to support contentTy\
pe: application/rss+xml
WARN parse.rss - org.apache.commons.feedparser.FeedPar\
serException: org.jdom.input.JDOMParseException: Error on line 497: XML documen\
t structures must start and end within the same entity.
I surf the web to look for answers and can't find any. All I found was
something like maybe the webserver gives nutch the wrong type?...But the xml
already says RSS and version # , what more does nutch want?
Anyway to solve it? workaround?
Thanks.