Great Job! This fixed the issue and crawling works as expected. Thanks for being super responsive!
Benjamin 2013/10/30 Karl Wright <[email protected]> > I've checked in a fix to trunk for this issue, and included a second patch > in the ticket. > > Karl > > > > On Wed, Oct 30, 2013 at 9:02 AM, Karl Wright <[email protected]> wrote: > >> Hi Benjamin, >> >> I will have to look at the feed itself to see why only four of the links >> are extracted. It is not likely to be due to the patch, but rather the >> feed format. As you know, RSS standards are fluid at best and feed >> publishers often do things in unique ways. >> >> I can't look at this in detail though until this evening. >> >> Karl >> >> >> >> On Wed, Oct 30, 2013 at 8:57 AM, Benjamin Brandmeier <[email protected]>wrote: >> >>> I've patched mcf and started the job. The log (attached) doesn't contain >>> any error messages and the documents crawled are indexed in Solr correctly. >>> >>> However, only four(!) documents are crawled/indexed, but 58 items exist >>> in the feed. Could this be a configuration issue or might the patch have >>> led to that? >>> >>> Thanks! >>> Benjamin >>> >>> >>> 2013/10/30 Karl Wright <[email protected]> >>> >>>> I've attached a patch to the ticket, but haven't tried it yet (no >>>> access to outside network right now). Can you try this and see if it >>>> works? >>>> >>>> Thanks, >>>> Karl >>>> >>>> >>>> >>>> On Wed, Oct 30, 2013 at 7:41 AM, Benjamin Brandmeier >>>> <[email protected]>wrote: >>>> >>>>> Hi Karl, >>>>> >>>>> the stack trace at the point where the NPE occurs is just as long as >>>>> the one provided in the log. >>>>> >>>>> I've fetched a stack trace at the point where previousContext is null >>>>> for the first time. After that, the currentContext will be set to null and >>>>> this leads to the error described. >>>>> Maybe this helps: >>>>> >>>>> Daemon Thread [Worker thread '42'] (Suspended (entry into method >>>>> endElement in XMLParsingContext)) >>>>> RSSConnector$OuterContextClass(XMLParsingContext).endElement(String, >>>>> String, String) line: 109 >>>>> XMLFuzzyHierarchicalParseState.noteEndTagEx(String, String, String) >>>>> line: 110 >>>>> XMLFuzzyHierarchicalParseState(XMLFuzzyParseState).noteEndTag(String) >>>>> line: 131 >>>>> XMLFuzzyHierarchicalParseState(TagParseState).dealWithCharacter(char) >>>>> line: 755 >>>>> XMLFuzzyHierarchicalParseState(SingleCharacterReceiver).dealWithCharacters(Reader) >>>>> line: 51 >>>>> DecodingByteReceiver.dealWithBytes(InputStream) line: 48 >>>>> BOMEncodingDetector.dealWithRemainder(byte[], int, int, InputStream) >>>>> line: 248 >>>>> BOMEncodingDetector(SingleByteReceiver).dealWithBytes(InputStream) >>>>> line: 52 >>>>> Parser.parseWithCharsetDetection(String, InputStream, >>>>> CharacterReceiver) line: 82 >>>>> RSSConnector.handleRSSFeedSAX(String, IProcessActivity, >>>>> RSSConnector$Filter) line: 3481 >>>>> RSSConnector.processDocuments(String[], String[], IProcessActivity, >>>>> DocumentSpecification, boolean[], int) line: 1256 >>>>> WorkerThread.run() line: 559 >>>>> >>>>> >>>>> I've tested this with MCF 1.3 AND 1.4 (from tag). The same error >>>>> occurs with both versions. >>>>> >>>>> Benjamin >>>>> >>>>> >>>>> 2013/10/30 Karl Wright <[email protected]> >>>>> >>>>>> Hi Benjamin, >>>>>> >>>>>> It may be malformed XML that we don't treat properly. If the log has >>>>>> a full stack trace that would be very helpful. If not can you get one? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Karl >>>>>> >>>>>> Sent from my Windows Phone >>>>>> ------------------------------ >>>>>> From: Benjamin Brandmeier >>>>>> Sent: 10/30/2013 6:51 AM >>>>>> To: [email protected] >>>>>> Subject: RSS Crawl -> NullPointerException >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> >>>>>> >>>>>> I'm facing a problem with the RSS connector. The feed I'm crawling is >>>>>> --> http://blog.fme.de/feed >>>>>> >>>>>> A NPE occurs at processing time. After some debugging I've found out >>>>>> the following: >>>>>> >>>>>> >>>>>> >>>>>> Variable previousContext is null in method --> public final void >>>>>> endElement(String namespace, String localName, String qName) >>>>>> >>>>>> Parameter qName is content:encoded, but there are many tags like this >>>>>> in the feed, so I'm not sure about at which point the error occurs. >>>>>> >>>>>> The variable previousContext(=null) is written to currentContext. As >>>>>> the stack trace shows, the error happens at >>>>>> org.apache.manifoldcf.core.fuzzyml.XMLFuzzyHierarchicalParseState.cleanup(XMLFuzzyHierarchicalParseState.java:86), >>>>>> >>>>>> at this point currentContext.cleanup(); is called with currentContext >>>>>> = null. >>>>>> >>>>>> >>>>>> >>>>>> manifoldcf.log is attached. >>>>>> >>>>>> >>>>>> >>>>>> Any thoughts on this? I tried different settings regarding dechromed >>>>>> content. >>>>>> >>>>>> >>>>>> >>>>>> Benjamin >>>>>> >>>>> >>>>> >>>> >>> >> >
