Hi, The Nutch 1.4 distribution includes
- nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib- nekohtml) - xercesImpl-2.9.1.jar (under .../runtime/local/lib) These two JARs appear to be incompatible versions. When the HtmlParser (configured to use neko) is invoked during a local-mode crawl, the parse fails due to an AbstractMethodError. (Note: I discovered the AbstractMethodError by rebuilding the HtmlParser plugin and adding a catch(Throwable) clause in the getParse method to log the stacktrace. With the original code, the error is unhandled and simply results in the unhelpful log message "Unable to successfully parse content".). I found that substituting a later, compatible version of nekohtml (1.9.11) fixes the problem. Curiously, and in support of the above, the nekohtml plugin.xml file in Nutch 1.4 contains the following: <plugin id="lib-nekohtml" name="CyberNeko HTML Parser" version="1.9.11" provider-name="org.cyberneko"> <runtime> <library name="nekohtml-0.9.5.jar"> <export name="*"/> </library> </runtime> </plugin> Note the conflicting version numbers (version tag is "1.9.11" but the specified library is "nekohtml-0.9.5.jar"). Was the 0.9.5 version included by mistake? Was the intention rather to include 1.9.11? I'm a Nutch newbie, so please forgive me if I'm missing something obvious here... :)