I tried to parse RSS/atom feeds and nutch 1.1 can't parse 95% of them.  I put 
the rss/atom feed in seed.txt, set the regex-urlfilter.txt, run on rss/feed 
plugin in nutch-site.xml.  It fetches the rss/atom feed, but it "almost" always 
find 0 urls to fetch after that.  I checked the log, and I found whole bunch of 
these:

 WARN  parse.ParserFactory - ParserFactory:Plugin: org.a\
pache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml via\
 parse-plugins.xml, but its plugin.xml file does not claim to support contentTy\
pe: application/rss+xml

WARN  parse.rss - org.apache.commons.feedparser.FeedPar\
serException: org.jdom.input.JDOMParseException: Error on line 497: XML documen\
t structures must start and end within the same entity.

I surf the web to look for answers and can't find any.  All I found was 
something like maybe the webserver gives nutch the wrong type?...But the xml 
already says RSS and version # , what more does nutch want?

Anyway to solve it?  workaround?
Thanks.



      

Reply via email to