Hi,

With the Boilerpipe patch enabled i get an exception in DOMBuilder.endElement 
when parsing certain pages. Looking at the pages at random it seems the 
problem is limited to sites with frames.

Commenting out the two lines of code in the method `fixes` the problem it 
looks like everything else still works.

m_elemStack.pop();
m_currentNode = m_elemStack.isEmpty() ? null : (Node)m_elemStack.peek();

But, as i am unsure what this code is doing and more imporantly why it is 
needed i'm checking here to see if someone can offer an explanation.

Cheers,

2011-07-15 15:11:01,095 ERROR tika.TikaParser - Error parsing 
http://www.zeemuseum.nl/
java.util.EmptyStackException
        at java.util.Stack.peek(Stack.java:85)
        at java.util.Stack.pop(Stack.java:67)
        at 
org.apache.nutch.parse.tika.DOMBuilder.endElement(DOMBuilder.java:349)
        at 
org.apache.tika.parser.html.BoilerpipeContentHandler.endDocument(BoilerpipeContentHandler.java:315)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
        at 
org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:212)
        at 
org.apache.tika.sax.TextContentHandler.endDocument(TextContentHandler.java:57)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
        at org.ccil.cowan.tagsoup.Parser.eof(Parser.java:639)
        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:589)
        at org.apache.tika.parser.html.HtmlParser$1.scan(HtmlParser.java:209)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:213)
        at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)
2011-07-15 15:11:01,095 WARN  parse.ParseSegment - Error parsing: 
http://www.zeemuseum.nl/: failed(2,0): null

Reply via email to