Hi everyone,


I'm facing a problem with the RSS connector. The feed I'm crawling is -->
http://blog.fme.de/feed

A NPE occurs at processing time. After some debugging I've found out the
following:



Variable previousContext is null in method --> public final void
endElement(String namespace, String localName, String qName)

Parameter qName is content:encoded, but there are many tags like this in
the feed, so I'm not sure about at which point the error occurs.

The variable previousContext(=null) is written to currentContext. As the
stack trace shows, the error happens at
org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86),

at this point currentContext.cleanup(); is called with currentContext =
null.



manifoldcf.log is attached.



Any thoughts on this? I tried different settings regarding dechromed
content.



Benjamin
[2013-10-30 10:01:26,322|DEBUG|Thread-9300    
|org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146]
 - RSS: Performing a read wait on server 'blog.fme.de' of 64 ms.
[2013-10-30 10:01:26,387|DEBUG|Thread-9300    
|org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146]
 - RSS: Performing a read wait on server 'blog.fme.de' of 63 ms.
[2013-10-30 10:01:26,454|DEBUG|Thread-9300    
|org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146]
 - RSS: Performing a read wait on server 'blog.fme.de' of 60 ms.
[2013-10-30 10:01:26,516|DEBUG|Thread-9300    
|org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$Server::beginRead()|1146]
 - RSS: Performing a read wait on server 'blog.fme.de' of 21 ms.
[2013-10-30 10:01:26,947|INFO |Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.ThrottledFetcher$ThrottledConnection::doneFetch()|710
 ] - RSS: FETCH Data|http://blog.fme.de/feed|1383123676922+10008|200|526203|
[2013-10-30 10:01:26,948|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector::processDocuments()|1233]
 - RSS: Processing 'http://blog.fme.de/feed'
[2013-10-30 10:01:26,949|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector::processDocuments()|1240]
 - RSS: Interpreting document 'http://blog.fme.de/feed' as a feed
[2013-10-30 10:02:28,572|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$OuterContextClass::beginTag()|3578]
 - RSS: Parsed bottom-level XML for RSS document 'http://blog.fme.de/feed'
[2013-10-30 10:02:28,670|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036]
 - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 
'http://blog.fme.de/allgemein/2013-08/english-version-fme-file-exchange-platform-for-easy-file-exchange-with-external-parties',
 which has origination date 1377415857000
[2013-10-30 10:02:28,701|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036]
 - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 
'http://blog.fme.de/allgemein/2013-08/how-to-automatically-keep-your-sales-team-up-to-date',
 which has origination date 1376390547000
[2013-10-30 10:02:28,743|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036]
 - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 
'http://blog.fme.de/allgemein/2013-08/fme-thoughts-about-documentum-d2-4-x', 
which has origination date 1375864142000
[2013-10-30 10:02:28,773|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSItemContextClass::process()|4036]
 - RSS: In RSS document 'http://blog.fme.de/feed', found a link to 
'http://blog.fme.de/allgemein/2013-07/fme-file-exchange-plattform-austausch-mit-externen-leichtgemacht',
 which has origination date 1374133143000
[2013-10-30 10:02:28,804|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.connectors.rss.RSSConnector$RSSChannelContextClass::process()|3796]
 - RSS: In RSS document 'http://blog.fme.de/feed' setting rescan time to 
1383127348804
[2013-10-30 10:16:43,145|DEBUG|Worker thread 
'2'|org.apache.manifoldcf.crawler.interfaces.QueueTracker::endProcessing()|339 
] - Worker thread done document with bins [blog.fme.de ]
[2013-10-30 10:16:43,207|FATAL|Worker thread 
'2'|org.apache.manifoldcf.crawler.system.WorkerThread::run()|925 ] - Error 
tossed: null
java.lang.NullPointerException
            at 
org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86)
            at 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.handleRSSFeedSAX(RSSConnector.java:3487)
            at 
org.apache.manifoldcf.crawler.connectors.rss.RSSConnector.processDocuments(RSSConnector.java:1256)
            at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:559)

Reply via email to