Hello,

I have been running nutch 1.5.1 without a problem but I have run across a
couple web pages that are giving me a null pointer exception when I try to
crawl them.

2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error:
java.lang.NullPointerException
        at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown
Source)
        at
org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
        at
org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033)
        at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836)
        at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)
        at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
        at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347)
        at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244)
        at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)
2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error
parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: failed(2,200):
java.lang.NullPointerException


What information do I need to provide for you to help me debug the issue?

Thanks,
Steve

Reply via email to