I just untarred apache-nutch 1.5.1 added links to the hadoop core-site.xml, hdfs-site.xml, and mapred-site.xml, ran ant to build a job file, and ran nutch with parsechecker. It runs fine. So the issue is with something my developer did. Interestingly, I tried setting parser.html.impl to tagsoup in my dev's nutch-site.xml and it works. For now I am using tagsoup with his configuration until I can figure out what is messed up.
[email protected]:~ $ /usr/local/src/apache-nutch-1.5.1/runtime/deploy/bin/nutch parsechecker -D http.agent.name=tralala http://www.my-ebenefits.com/PenguinRandomHouse/ 14/08/18 13:37:33 INFO parse.ParserChecker: fetching: http://www.my-ebenefits.com/PenguinRandomHouse/ 14/08/18 13:37:33 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-nutch/hadoop-unjar7772543522957888323/classes/plugins 14/08/18 13:37:33 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 14/08/18 13:37:33 INFO plugin.PluginRepository: Registered Plugins: 14/08/18 13:37:33 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 14/08/18 13:37:33 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 14/08/18 13:37:33 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 14/08/18 13:37:33 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 14/08/18 13:37:33 INFO plugin.PluginRepository: HTTP Framework (lib-http) 14/08/18 13:37:33 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 14/08/18 13:37:33 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 14/08/18 13:37:33 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 14/08/18 13:37:33 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 14/08/18 13:37:33 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 14/08/18 13:37:33 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 14/08/18 13:37:33 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 14/08/18 13:37:33 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 14/08/18 13:37:33 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 14/08/18 13:37:33 INFO plugin.PluginRepository: Registered Extension-Points: 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 14/08/18 13:37:33 INFO plugin.PluginRepository: HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 14/08/18 13:37:33 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 14/08/18 13:37:33 INFO http.Http: http.proxy.host = null 14/08/18 13:37:33 INFO http.Http: http.proxy.port = 8080 14/08/18 13:37:33 INFO http.Http: http.timeout = 10000 14/08/18 13:37:33 INFO http.Http: http.content.limit = 65536 14/08/18 13:37:33 INFO http.Http: http.agent = tralala/Nutch-1.5.1 14/08/18 13:37:33 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 14/08/18 13:37:33 INFO http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 14/08/18 13:37:33 INFO conf.Configuration: found resource parse-plugins.xml at file:/tmp/hadoop-nutch/hadoop-unjar7772543522957888323/parse-plugins.xml 14/08/18 13:37:34 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature 14/08/18 13:37:34 INFO parse.ParserChecker: parsing: http://www.my-ebenefits.com/PenguinRandomHouse/ 14/08/18 13:37:34 INFO parse.ParserChecker: contentType: application/xhtml+xml 14/08/18 13:37:34 INFO parse.ParserChecker: signature: 6ac298a128080fcb51e4c3efa1c040df --------- Url --------------- http://www.my-ebenefits.com/PenguinRandomHouse/ --------- ParseData --------- Version: 5 Status: success(1,0) Title: Penguin Random House Outlinks: 41 outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/WorkArea/FrameworkUI/js/ektron.javascript.ashx?id=-2084040714+-1231752331+-1851527679+-1305609882+-1996034680 anchor: outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/WorkArea/FrameworkUI/css/ektron.stylesheet.ashx?id=-1371297047+685719339 anchor: outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/css/project.css anchor: outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/css/cms_styles.css anchor: outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/css/mMenu.css anchor: outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/ScriptResource.axd?d=I0CGzPeOQYAWLewyl9fyRxS2_qOlLSunBFjJSIM4pgjOHmIuywndBnF0uG46TBXpiH9CVYdQOXlIf8_i0GO5Mu9vm7TUdT9S0zkZLzrUWc7XvByNYcsRZ3wb2hDVdQB1gwRzlhQtz6uiVrEZWLRDHKhMPi7pngtZgRHDi3EXmr9vYdVeq2q6FnGvUYerlgGo0&t=ca758f3 anchor: outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Home.aspx anchor: Cover to cover logo outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/images/i-cover-to-cover-logo.png anchor: Cover to cover logo outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Home.aspx anchor: Home outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Header.aspx?id=103 anchor: Forms outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Header.aspx?id=104 anchor: Contacts outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/uploadedFiles/Documents_and_Files/FINAL_5-12_Penguin_Random_House_Benefits_Guide.pdf#page=1 anchor: SPDs outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Health_Benefits.aspx anchor: Health Benefits outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Medical.aspx anchor: Medical outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Dental.aspx anchor: Dental outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Vision.aspx anchor: Vision outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/FSA.aspx anchor: Flexible Spending Accounts outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/COBRA.aspx anchor: COBRA outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/FinacialSecurity.aspx anchor: Financial Security outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Disability.aspx anchor: Disability outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/LifeInsurance.aspx anchor: Life Insurance outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/401(K).aspx anchor: 401 (k) outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/FinancialEngines.aspx anchor: Financial Engines outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Pension.aspx anchor: Pension outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/VoluntaryBenefits.aspx anchor: Voluntary Benefits outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/WorkLife.aspx anchor: Work/Life outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/HealthAdvocate.aspx anchor: Health Advocate outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/EmployeeAssistanceProgram.aspx anchor: Employee Assistance Program outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Back-UpChildAdultCare.aspx anchor: Back-up Child & Adult Care outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/TransitParkingBenefit.aspx anchor: Transit & Parking outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/BenefitGuide.aspx anchor: Benefits Guide outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Calculator.aspx anchor: Cost Calculator outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/images/Rotating Banner/banner-img2.jpg anchor: banner outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/images/Rotating Banner/banner-txt2.gif anchor: caption outlink: toUrl: https://E12.Ultipro.com anchor: ultipro_Login outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/uploadedImages/usloginlogo.png?n=78 anchor: ultipro_Login outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/WorkArea/DownloadAsset.aspx?id=187 anchor: How to enroll outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Medical.aspx anchor: See what my medical plan pays for outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/HealthAdvocate.aspx anchor: Get help with medical claims outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/Calculator.aspx anchor: Find out the cost of my 2014 benefit choices outlink: toUrl: http://www.my-ebenefits.com/PenguinRandomHouse/images/penguin-random-house-logo.gif anchor: Penguin Random House Logo Content Metadata: Strict-Transport-Security=max-age=31536000 Vary=Accept-Encoding Date=Mon, 18 Aug 2014 17:37:32 GMT Content-Length=5032 X-Robots-Tag=noindex, nofollow, noarchive, nosnippet Content-Encoding=gzip Set-Cookie=icookie=A; path=/ Connection=close Content-Type=text/html; charset=utf-8 Server=nginx/1.4.7 Backend=PEKT1 Cache-Control=private Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 On Sat, Aug 16, 2014 at 11:34 AM, Sebastian Nagel < [email protected]> wrote: > Hi Steve, > > does the job file contain the original parse-html from Nutch 1.5.1? > I cannot sync the stack with > > http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?view=markup > (nor with the current trunk / 1.9), e.g. parseNeko() should be lines > 228-266: > > at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347) > at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244) > at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160) > > Sebastian > > On 08/13/2014 05:43 PM, Steve Cohen wrote: > > I forgot about the parsechecker and indexchecker command line options. > > > > When I run it parsechecker with the default nutch with the standard job > > file it works. > > > > 14/08/13 11:35:28 INFO http.Http: http.proxy.host = null > > 14/08/13 11:35:28 INFO http.Http: http.proxy.port = 8080 > > 14/08/13 11:35:28 INFO http.Http: http.timeout = 10000 > > 14/08/13 11:35:28 INFO http.Http: http.content.limit = 65536 > > 14/08/13 11:35:28 INFO http.Http: http.agent = tralala/Nutch-1.5.1 > (Lucene > > Random House Crawler; http://www.randomhouse.com/; > [email protected] > > ) > > 14/08/13 11:35:28 INFO http.Http: http.accept.language = > > en-us,en-gb,en;q=0.7,*;q=0.3 > > 14/08/13 11:35:28 INFO http.Http: http.accept = > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > 14/08/13 11:35:28 INFO conf.Configuration: found resource > parse-plugins.xml > > at > file:/tmp/hadoop-nutch/hadoop-unjar7029442108299209520/parse-plugins.xml > > 14/08/13 11:35:29 INFO crawl.SignatureFactory: Using Signature impl: > > org.apache.nutch.crawl.MD5Signature > > 14/08/13 11:35:29 INFO parse.ParserChecker: parsing: > > http://www.my-ebenefits.com/PenguinRandomHouse/ > > 14/08/13 11:35:29 INFO parse.ParserChecker: contentType: > > application/xhtml+xml > > 14/08/13 11:35:29 INFO parse.ParserChecker: signature: > > 6ac298a128080fcb51e4c3efa1c040df > > --------- > > Url > > --------------- > > http://www.my-ebenefits.com/PenguinRandomHouse/ > > --------- > > ParseData > > --------- > > Version: 5 > > Status: success(1,0) > > Title: Penguin Random House > > > > > > When I run it with the job file the dev built it gives me this. > > > > > > 14/08/13 11:35:50 INFO httpclient.Http: http.accept = > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > 14/08/13 11:35:50 INFO conf.Configuration: found resource > > httpclient-auth.xml at > > > file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/httpclient-auth.xml > > 14/08/13 11:35:50 INFO conf.Configuration: found resource > parse-plugins.xml > > at > file:/tmp/hadoop-nutch/hadoop-unjar8361088391392178235/parse-plugins.xml > > HtmlParser setConf - read rules now > > in parseNeko now > > 14/08/13 11:35:51 ERROR parse.html: Error: > > java.lang.NullPointerException > > at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown > > Source) > > at > > > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) > > > > > > So it is something with the configuration. Does the default job file use > > Neko or TagSoup? I assume Neko since that is what it is > nutch-default.xml. > > How do I tell what rules have been changed? > > > > Thanks, > > Steve > > > > > > On Wed, Aug 13, 2014 at 4:16 AM, Julien Nioche < > > [email protected]> wrote: > > > >> Hi Steve, > >> > >> I tried with Nutch 1.9 RC1 and am not getting this exception. > >> => ./nutch parsechecker -D http.agent.name=tralala > >> http://www.my-ebenefits.com/PenguinRandomHouse/ > >> > >> Probably something that we fixed since 1.5.1 which is rather outdated. > Why > >> don't you give 1.9 a try instead? > >> > >> Julien > >> > >> > >> > >> On 12 August 2014 20:34, Steve Cohen <[email protected]> wrote: > >> > >>> Hello, > >>> > >>> I have been running nutch 1.5.1 without a problem but I have run > across a > >>> couple web pages that are giving me a null pointer exception when I try > >> to > >>> crawl them. > >>> > >>> 2014-08-12 14:01:21,844 ERROR org.apache.nutch.parse.html: Error: > >>> java.lang.NullPointerException > >>> at > org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown > >>> Source) > >>> at > >>> > >> > org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195) > >>> at > >>> > >>> > >> > org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:2033) > >>> at > >>> > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1836) > >>> at > >>> org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809) > >>> at > >>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) > >>> at > >>> org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) > >>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > >>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) > >>> at > >>> org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:347) > >>> at > >>> org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:244) > >>> at > >>> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:160) > >>> at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > >>> at > >> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > >>> at > >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) > >>> at java.lang.Thread.run(Thread.java:662) > >>> 2014-08-12 14:01:21,844 WARN org.apache.nutch.parse.ParseSegment: Error > >>> parsing: http://www.my-ebenefits.com/PenguinRandomHouse/: > failed(2,200): > >>> java.lang.NullPointerException > >>> > >>> > >>> What information do I need to provide for you to help me debug the > issue? > >>> > >>> Thanks, > >>> Steve > >>> > >> > >> > >> > >> -- > >> > >> Open Source Solutions for Text Engineering > >> > >> http://digitalpebble.blogspot.com/ > >> http://www.digitalpebble.com > >> http://twitter.com/digitalpebble > >> > > > >

