I've checked in a fix to trunk for this issue, and included a second patch in the ticket.
Karl On Wed, Oct 30, 2013 at 9:02 AM, Karl Wright <[email protected]> wrote: > Hi Benjamin, > > I will have to look at the feed itself to see why only four of the links > are extracted. It is not likely to be due to the patch, but rather the > feed format. As you know, RSS standards are fluid at best and feed > publishers often do things in unique ways. > > I can't look at this in detail though until this evening. > > Karl > > > > On Wed, Oct 30, 2013 at 8:57 AM, Benjamin Brandmeier <[email protected]>wrote: > >> I've patched mcf and started the job. The log (attached) doesn't contain >> any error messages and the documents crawled are indexed in Solr correctly. >> >> However, only four(!) documents are crawled/indexed, but 58 items exist >> in the feed. Could this be a configuration issue or might the patch have >> led to that? >> >> Thanks! >> Benjamin >> >> >> 2013/10/30 Karl Wright <[email protected]> >> >>> I've attached a patch to the ticket, but haven't tried it yet (no access >>> to outside network right now). Can you try this and see if it works? >>> >>> Thanks, >>> Karl >>> >>> >>> >>> On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier >>> <[email protected]>wrote: >>> >>>> Hi Karl, >>>> >>>> the stack trace at the point where the NPE occurs is just as long as >>>> the one provided in the log. >>>> >>>> I've fetched a stack trace at the point where previousContext is null >>>> for the first time. After that, the currentContext will be set to null and >>>> this leads to the error described. >>>> Maybe this helps: >>>> >>>> Daemon Thread [Worker thread '42'] (Suspended (entry into method >>>> endElement in XMLParsingContext)) >>>> RSSConnector$OuterContextClass(XMLParsingContext).endElement(String, >>>> String, String) line: 109 >>>> XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String) >>>> line: 110 >>>> XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String) >>>> line: 131 >>>> XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char) >>>> line: 755 >>>> XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader) >>>> line: 51 >>>> DecodingByteReceiver.dealWithBytes(InputStream) line: 48 >>>> BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream) >>>> line: 248 >>>> BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream) >>>> line: 52 >>>> Parser.parseWithCharsetDetection(String, InputStream, >>>> CharacterReceiver) line: 82 >>>> RSSConnector.handleRSSFeedSAX(String, IProcessActivity, >>>> RSSConnector$Filter) line: 3481 >>>> RSSConnector.processDocuments(String[], String[], IProcessActivity, >>>> DocumentSpecification, boolean[], int) line: 1256 >>>> WorkerThread.run() line: 559 >>>> >>>> >>>> I've tested this with MCF 1.3 AND 1.4 (from tag). The same error occurs >>>> with both versions. >>>> >>>> Benjamin >>>> >>>> >>>> 2013/10/30 Karl Wright <[email protected]> >>>> >>>>> Hi Benjamin, >>>>> >>>>> It may be malformed XML that we don't treat properly. If the log has >>>>> a full stack trace that would be very helpful. If not can you get one? >>>>> >>>>> Thanks! >>>>> >>>>> Karl >>>>> >>>>> Sent from my Windows Phone >>>>> ------------------------------ >>>>> From: Benjamin Brandmeier >>>>> Sent: 10/30/2013 6:51 AM >>>>> To: [email protected] >>>>> Subject: RSS Crawl -> NullPointerException >>>>> >>>>> Hi everyone, >>>>> >>>>> >>>>> >>>>> I'm facing a problem with the RSS connector. The feed I'm crawling is >>>>> --> http://blog.fme.de/feed >>>>> >>>>> A NPE occurs at processing time. After some debugging I've found out >>>>> the following: >>>>> >>>>> >>>>> >>>>> Variable previousContext is null in method --> public final void >>>>> endElement(String namespace, String localName, String qName) >>>>> >>>>> Parameter qName is content:encoded, but there are many tags like this >>>>> in the feed, so I'm not sure about at which point the error occurs. >>>>> >>>>> The variable previousContext(=null) is written to currentContext. As >>>>> the stack trace shows, the error happens at >>>>> org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86), >>>>> >>>>> at this point currentContext.cleanup(); is called with currentContext >>>>> = null. >>>>> >>>>> >>>>> >>>>> manifoldcf.log is attached. >>>>> >>>>> >>>>> >>>>> Any thoughts on this? I tried different settings regarding dechromed >>>>> content. >>>>> >>>>> >>>>> >>>>> Benjamin >>>>> >>>> >>>> >>> >> >
