Hi everyone,
I'm facing a problem with the RSS connector. The feed I'm crawling is --> http://blog.fme.de/feed A NPE occurs at processing time. After some debugging I've found out the following: Variable previousContext is null in method --> public final void endElement(String namespace, String localName, String qName) Parameter qName is content:encoded, but there are many tags like this in the feed, so I'm not sure about at which point the error occurs. The variable previousContext(=null) is written to currentContext. As the stack trace shows, the error happens at org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86), at this point currentContext.cleanup(); is called with currentContext = null. manifoldcf.log is attached. Any thoughts on this? I tried different settings regarding dechromed content. Benjamin
[2013-10-30 10:01:26,322|DEBUG|Thread-9300 |org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146] - RSS: Performing a read wait on server 'blog.fme.de' of 64 ms. [2013-10-30 10:01:26,387|DEBUG|Thread-9300 |org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146] - RSS: Performing a read wait on server 'blog.fme.de' of 63 ms. [2013-10-30 10:01:26,454|DEBUG|Thread-9300 |org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146] - RSS: Performing a read wait on server 'blog.fme.de' of 60 ms. [2013-10-30 10:01:26,516|DEBUG|Thread-9300 |org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146] - RSS: Performing a read wait on server 'blog.fme.de' of 21 ms. [2013-10-30 10:01:26,947|INFO |Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$ThrottledConnection::doneFetch()|710 ] - RSS: FETCH Data|http://blog.fme.de/feed|1383123676922+10008|200|526203| [2013-10-30 10:01:26,948|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector::processDocuments()|1233] - RSS: Processing 'http://blog.fme.de/feed' [2013-10-30 10:01:26,949|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector::processDocuments()|1240] - RSS: Interpreting document 'http://blog.fme.de/feed' as a feed [2013-10-30 10:02:28,572|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$OuterContextClass::beginTag()|3578] - RSS: Parsed bottom-level XML for RSS document 'http://blog.fme.de/feed' [2013-10-30 10:02:28,670|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036] - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 'http://blog.fme.de/allgemein/2013-08/english-version-fme-file-exchange-platform-for-easy-file-exchange-with-external-parties', which has origination date 1377415857000 [2013-10-30 10:02:28,701|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036] - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 'http://blog.fme.de/allgemein/2013-08/how-to-automatically-keep-your-sales-team-up-to-date', which has origination date 1376390547000 [2013-10-30 10:02:28,743|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036] - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 'http://blog.fme.de/allgemein/2013-08/fme-thoughts-about-documentum-d2-4-x', which has origination date 1375864142000 [2013-10-30 10:02:28,773|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036] - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 'http://blog.fme.de/allgemein/2013-07/fme-file-exchange-plattform-austausch-mit-externen-leichtgemacht', which has origination date 1374133143000 [2013-10-30 10:02:28,804|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSChannelContextClass::process()|3796] - RSS: In RSS document 'http://blog.fme.de/feed' setting rescan time to 1383127348804 [2013-10-30 10:16:43,145|DEBUG|Worker thread '2'|org.apache.manifoldcf.crawler.interfaces.QueueTracker::endProcessing()|339 ] - Worker thread done document with bins [blog.fme.de ] [2013-10-30 10:16:43,207|FATAL|Worker thread '2'|org.apache.manifoldcf.crawler.system.WorkerThread::run()|925 ] - Error tossed: null java.lang.NullPointerException at org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86) at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.handleRSSFeedSAX(RSSConnector.java:3487) at org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.processDocuments(RSSConnector.java:1256) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:559)
