Hi Steve, I tried with Nutch 1.9 RC1 and am not getting this exception. => ./nutch parsechecker -D http.agent.name=tralala http://www.my-ebenefits.com/PenguinRandomHouse/
Probably something that we fixed since 1.5.1 which is rather outdated. Why don't you give 1.9 a try instead? Julien On 12 August 2014 20:34, Steve Cohen <[email protected]> wrote: > Hello, > > I have been running nutch 1.5.1 without a problem but I have run across a > couple web pages that are giving me a null pointer exception when I try to > crawl them. > > 2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error: > java.lang.NullPointerException > at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown > Source) > at > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) > at > > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836) > at > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809) > at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) > at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > at > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347) > at > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244) > at > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.lang.Thread.run(Thread.java:662) > 2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error > parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: failed(2,200): > java.lang.NullPointerException > > > What information do I need to provide for you to help me debug the issue? > > Thanks, > Steve > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

