I forgot about the parsechecker and indexchecker command line options. When I run it parsechecker with the default nutch with the standard job file it works.
14/08/13 11:35:28 INFO http.Http: http.proxy.host = null 14/08/13 11:35:28 INFO http.Http: http.proxy.port = 8080 14/08/13 11:35:28 INFO http.Http: http.timeout = 10000 14/08/13 11:35:28 INFO http.Http: http.content.limit = 65536 14/08/13 11:35:28 INFO http.Http: http.agent = tralala/Nutch-1.5.1 (Lucene Random House Crawler; http://www.randomhouse.com/; [email protected] ) 14/08/13 11:35:28 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 14/08/13 11:35:28 INFO http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 14/08/13 11:35:28 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-unjar7029442108299209520/parse-plugins.xml 14/08/13 11:35:29 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature 14/08/13 11:35:29 INFO parse.ParserChecker: parsing: http://www.my-ebenefits.com/PenguinRandomHouse/ 14/08/13 11:35:29 INFO parse.ParserChecker: contentType: application/xhtml+xml 14/08/13 11:35:29 INFO parse.ParserChecker: signature: 6ac298a128080fcb51e4c3efa1c040df --------- Url --------------- http://www.my-ebenefits.com/PenguinRandomHouse/ --------- ParseData --------- Version: 5 Status: success(1,0) Title: Penguin Random House When I run it with the job file the dev built it gives me this. 14/08/13 11:35:50 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 14/08/13 11:35:50 INFO conf.Configuration: found resource httpclient-auth.xml at file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/httpclient-auth.xml 14/08/13 11:35:50 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/parse-plugins.xml HtmlParser setConf - read rules now in parseNeko now 14/08/13 11:35:51 ERROR parse.html: Error: java.lang.NullPointerException at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown Source) at org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) So it is something with the configuration. Does the default job file use Neko or TagSoup? I assume Neko since that is what it is nutch-default.xml. How do I tell what rules have been changed? Thanks, Steve On Wed, Aug 13, 2014 at 4:16 AM, Julien Nioche < [email protected]> wrote: > Hi Steve, > > I tried with Nutch 1.9 RC1 and am not getting this exception. > => ./nutch parsechecker -D http.agent.name=tralala > http://www.my-ebenefits.com/PenguinRandomHouse/ > > Probably something that we fixed since 1.5.1 which is rather outdated. Why > don't you give 1.9 a try instead? > > Julien > > > > On 12 August 2014 20:34, Steve Cohen <[email protected]> wrote: > > > Hello, > > > > I have been running nutch 1.5.1 without a problem but I have run across a > > couple web pages that are giving me a null pointer exception when I try > to > > crawl them. > > > > 2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error: > > java.lang.NullPointerException > > at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown > > Source) > > at > > > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) > > at > > > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033) > > at > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836) > > at > > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809) > > at > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) > > at > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) > > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > > at > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347) > > at > > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244) > > at > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160) > > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > > at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > > at > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > > at java.lang.Thread.run(Thread.java:662) > > 2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error > > parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: failed(2,200): > > java.lang.NullPointerException > > > > > > What information do I need to provide for you to help me debug the issue? > > > > Thanks, > > Steve > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

