AbstractMethodError for cyberneko parser
Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
Re: AbstractMethodError for cyberneko parser
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
Re: AbstractMethodError for cyberneko parser
Hi Harry, Could you try using parse-tika instead and see if you are getting the same problem? I gather from your email that you are using Nutch 1.1 or the SVN version, so parse-tika should be used by default. Have you deactivated it? Thanks Julien On 21 April 2010 11:58, Harry Nutch harrynu...@gmail.com wrote: Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) -- DigitalPebble Ltd http://www.digitalpebble.com
Re: AbstractMethodError for cyberneko parser
Thanks Julien. I have changed nutch-site.xml to have only parse-(tika) instead of parse-(text | html | js | tika) in plugin.includes property. It works now as it doesn't pick up any other parser besides tika. On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Harry, Could you try using parse-tika instead and see if you are getting the same problem? I gather from your email that you are using Nutch 1.1 or the SVN version, so parse-tika should be used by default. Have you deactivated it? Thanks Julien On 21 April 2010 11:58, Harry Nutch harrynu...@gmail.com wrote: Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.init(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) -- DigitalPebble Ltd http://www.digitalpebble.com