Well, disabling the code isn't a good idea as everything gets messed up. I've encapsulated the pop in another isEmpty check and it's fixed now. The question remaining is why this only seems to happen with Boilerplate parsing pages with frames?
Thanks On Friday 15 July 2011 15:23:18 Markus Jelsma wrote: > Hi, > > With the Boilerpipe patch enabled i get an exception in > DOMBuilder.endElement when parsing certain pages. Looking at the pages at > random it seems the problem is limited to sites with frames. > > Commenting out the two lines of code in the method `fixes` the problem it > looks like everything else still works. > > m_elemStack.pop(); > m_currentNode = m_elemStack.isEmpty() ? null : (Node)m_elemStack.peek(); > > But, as i am unsure what this code is doing and more imporantly why it is > needed i'm checking here to see if someone can offer an explanation. > > Cheers, > > 2011-07-15 15:11:01,095 ERROR tika.TikaParser - Error parsing > http://www.zeemuseum.nl/ > java.util.EmptyStackException > at java.util.Stack.peek(Stack.java:85) > at java.util.Stack.pop(Stack.java:67) > at > org.apache.nutch.parse.tika.DOMBuilder.endElement(DOMBuilder.java:349) > at > org.apache.tika.parser.html.BoilerpipeContentHandler.endDocument(Boilerpipe > ContentHandler.java:315) at > org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecor > ator.java:115) at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.jav > a:212) at > org.apache.tika.sax.TextContentHandler.endDocument(TextContentHandler.java: > 57) at > org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecor > ator.java:115) at org.ccil.cowan.tagsoup.Parser.eof(Parser.java:639) > at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:589) > at > org.apache.tika.parser.html.HtmlParser$1.scan(HtmlParser.java:209) at > org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) > at > org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:213) at > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at > java.util.concurrent.FutureTask.run(FutureTask.java:138) at > java.lang.Thread.run(Thread.java:662) > 2011-07-15 15:11:01,095 WARN parse.ParseSegment - Error parsing: > http://www.zeemuseum.nl/: failed(2,0): null -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

