Hi,

The Nutch 1.4 distribution includes

 - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
nekohtml)
 - xercesImpl-2.9.1.jar (under .../runtime/local/lib)

These two JARs appear to be incompatible versions. When the HtmlParser
(configured to use neko) is invoked during a local-mode crawl, the parse
fails due to an AbstractMethodError. (Note: I discovered the
AbstractMethodError by rebuilding the HtmlParser plugin and adding a
catch(Throwable) clause in the getParse method to log the stacktrace. With
the original code, the error is unhandled and simply results in the
unhelpful log message "Unable to successfully parse content".).

I found that substituting a later, compatible version of nekohtml (1.9.11)
fixes the problem.

Curiously, and in support of the above, the nekohtml plugin.xml file in
Nutch 1.4 contains the following:

<plugin
   id="lib-nekohtml"
   name="CyberNeko HTML Parser"
   version="1.9.11"
   provider-name="org.cyberneko">

   <runtime>
       <library name="nekohtml-0.9.5.jar">
           <export name="*"/>
       </library>
   </runtime>
</plugin>

Note the conflicting version numbers (version tag is "1.9.11" but the
specified library is "nekohtml-0.9.5.jar").

Was the 0.9.5 version included by mistake? Was the intention rather to
include 1.9.11?

I'm a Nutch newbie, so please forgive me if I'm missing something obvious
here... :)

Reply via email to