AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Hi,

I am running the latest version for nutch. While crawling one particular
site I get a AbstractMethodError in the cyberneko plugin for all of it pages
when doing a Fetch.
As i understand, this has to do because of difference between the runtime
and compile version. However, I am running it afresh after an ant clean.

Any suggestions would be helpful. Btw, i am using java version 1.6.0_18 on
a windows environment


java.lang.AbstractMethodError:
org.cyberneko.html.HTMLScanner.getCharacterOffset
()I
at org.apache.xerces.xni.parser.XMLParseException.init(Unknown
Source)

at
org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
MLConfiguration.java:673)
at
org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
nfiguration.java:662)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
er.java:2404)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
er.java:2360)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
anner.java:2267)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
java:164)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)

at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
9)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
java.lang.AbstractMethodError:
org.cyberneko.html.HTMLScanner.getCharacterOffset
()I
at org.apache.xerces.xni.parser.XMLParseException.init(Unknown
Source)

at
org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
MLConfiguration.java:673)
at
org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
nfiguration.java:662)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
er.java:2404)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
er.java:2360)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
anner.java:2267)
at
org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
820)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
)
at
org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
)
at
org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
java:164)
at
org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)

at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
9)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)


Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to
fix the problem.

On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote:

 Hi,

 I am running the latest version for nutch. While crawling one particular
 site I get a AbstractMethodError in the cyberneko plugin for all of it pages
 when doing a Fetch.
 As i understand, this has to do because of difference between the runtime
 and compile version. However, I am running it afresh after an ant clean.

 Any suggestions would be helpful. Btw, i am using java version 1.6.0_18
 on a windows environment


 java.lang.AbstractMethodError:
 org.cyberneko.html.HTMLScanner.getCharacterOffset
 ()I
 at org.apache.xerces.xni.parser.XMLParseException.init(Unknown
 Source)

 at
 org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
 MLConfiguration.java:673)
 at
 org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
 nfiguration.java:662)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
 er.java:2404)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
 er.java:2360)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
 anner.java:2267)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
 820)
 at
 org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
 at
 org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
 )
 at
 org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
 )
 at
 org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
 java:164)
 at
 org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)

 at
 org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
 at
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
 9)
 at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
 java.lang.AbstractMethodError:
 org.cyberneko.html.HTMLScanner.getCharacterOffset
 ()I
 at org.apache.xerces.xni.parser.XMLParseException.init(Unknown
 Source)

 at
 org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
 MLConfiguration.java:673)
 at
 org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
 nfiguration.java:662)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
 er.java:2404)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
 er.java:2360)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
 anner.java:2267)
 at
 org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
 820)
 at
 org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
 at
 org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
 )
 at
 org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
 )
 at
 org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
 java:164)
 at
 org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)

 at
 org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
 at
 org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
 9)
 at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)





Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Julien Nioche
Hi Harry,

Could you try using parse-tika instead and see if you are getting the same
problem? I gather from your email that you are using Nutch 1.1 or the SVN
version, so parse-tika should be used by default. Have you deactivated it?

Thanks

Julien

On 21 April 2010 11:58, Harry Nutch harrynu...@gmail.com wrote:

 Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to
 fix the problem.

 On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote:

  Hi,
 
  I am running the latest version for nutch. While crawling one particular
  site I get a AbstractMethodError in the cyberneko plugin for all of it
 pages
  when doing a Fetch.
  As i understand, this has to do because of difference between the runtime
  and compile version. However, I am running it afresh after an ant clean.
 
  Any suggestions would be helpful. Btw, i am using java version 1.6.0_18
  on a windows environment
 
 
  java.lang.AbstractMethodError:
  org.cyberneko.html.HTMLScanner.getCharacterOffset
  ()I
  at org.apache.xerces.xni.parser.XMLParseException.init(Unknown
  Source)
 
  at
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
  MLConfiguration.java:673)
  at
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
  nfiguration.java:662)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
  er.java:2404)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
  er.java:2360)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
  anner.java:2267)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
  820)
  at
  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
  at
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
  )
  at
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
  )
  at
  org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
  java:164)
  at
  org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
 
  at
  org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
  at
  org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
  at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
  at
  org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
  9)
  at
  org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
  java.lang.AbstractMethodError:
  org.cyberneko.html.HTMLScanner.getCharacterOffset
  ()I
  at org.apache.xerces.xni.parser.XMLParseException.init(Unknown
  Source)
 
  at
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
  MLConfiguration.java:673)
  at
  org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
  nfiguration.java:662)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
  er.java:2404)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
  er.java:2360)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
  anner.java:2267)
  at
  org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
  820)
  at
  org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
  at
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
  )
  at
  org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
  )
  at
  org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
  java:164)
  at
  org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
 
  at
  org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
  at
  org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
  at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
  at
  org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
  9)
  at
  org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
 
 
 




-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Thanks Julien.
I have changed nutch-site.xml to have only parse-(tika) instead of
parse-(text | html | js | tika) in plugin.includes property.
It works now as it doesn't pick up any other parser besides tika.

On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi Harry,

 Could you try using parse-tika instead and see if you are getting the same
 problem? I gather from your email that you are using Nutch 1.1 or the SVN
 version, so parse-tika should be used by default. Have you deactivated it?

 Thanks

 Julien

 On 21 April 2010 11:58, Harry Nutch harrynu...@gmail.com wrote:

  Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to
  fix the problem.
 
  On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com
 wrote:
 
   Hi,
  
   I am running the latest version for nutch. While crawling one
 particular
   site I get a AbstractMethodError in the cyberneko plugin for all of it
  pages
   when doing a Fetch.
   As i understand, this has to do because of difference between the
 runtime
   and compile version. However, I am running it afresh after an ant
 clean.
  
   Any suggestions would be helpful. Btw, i am using java version
 1.6.0_18
   on a windows environment
  
  
   java.lang.AbstractMethodError:
   org.cyberneko.html.HTMLScanner.getCharacterOffset
   ()I
   at
 org.apache.xerces.xni.parser.XMLParseException.init(Unknown
   Source)
  
   at
   org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
   MLConfiguration.java:673)
   at
   org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
   nfiguration.java:662)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
   er.java:2404)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
   er.java:2360)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
   anner.java:2267)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
   820)
   at
   org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
   at
   org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
   )
   at
   org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
   )
   at
   org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
   java:164)
   at
   org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
  
   at
   org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
   at
   org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
   at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
   at
   org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
   9)
   at
   org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
   java.lang.AbstractMethodError:
   org.cyberneko.html.HTMLScanner.getCharacterOffset
   ()I
   at
 org.apache.xerces.xni.parser.XMLParseException.init(Unknown
   Source)
  
   at
   org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT
   MLConfiguration.java:673)
   at
   org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo
   nfiguration.java:662)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
   er.java:2404)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann
   er.java:2360)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc
   anner.java:2267)
   at
   org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1
   820)
   at
   org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
   at
   org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478
   )
   at
   org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431
   )
   at
   org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.
   java:164)
   at
   org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249)
  
   at
   org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212)
   at
   org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
   at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
   at
   org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87
   9)
   at
   org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
  
  
  
 



 --
 DigitalPebble Ltd
 http://www.digitalpebble.com