I figured out a workaround for this in case anybody needs it.

Apparently Hadoop extracts only job JARs. In my case, only Nutch gets extracted to jobcache, since I am just running Nutch jobs from my application.

To make Hadoop extract your entire JAR, you must include Nutch class package in your own JAR. In other words, you must configure Maven in a way that it extracts Nutch into your own JAR at build time (see shade plugin). You can keep the other ones as JARs in a folder like "libs", but job classes must be extracted into your JAR.

Hope this helps.

Best,

Emre

On 07/16/2012 03:09 PM, Emre Çelikten wrote:
Hello,

I am developing a Java application that uses Nutch as a Maven
dependency. I run Nutch jobs from my application in a way just like
Nutch itself does, by calling them like:

ToolRunner.run(NutchConfiguration.create(), new Injector(), args);

I have been unable to get it to work because it is not able to find the
plugins, resulting in "java.lang.RuntimeException: Error in configuring
object" errors. I have been unsuccessfully trying since the last week. I
think I have narrowed down the problem enough to ask here.

Here are the details.

I am using Nutch 1.5.

When I run Nutch like this:

./hadoop jar /apps/nutchjob/apache-nutch-1.5.job
org.apache.nutch.crawl.Injector crawl/crawldb urls/urls

here's what the logs say about plugins:

2012-07-16 14:19:48,450 INFO  plugin.PluginRepository - Plugins: looking
in:
/hadooptmp/mapred/local/taskTracker/hduser/jobcache/job_201207161219_0026/jars/classes/plugins

2012-07-16 14:19:48,787 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
012-07-16 14:19:48,787 INFO  plugin.PluginRepository - Registered Plugins:
2012-07-16 14:19:48,787 INFO  plugin.PluginRepository -     the nutch
core extension points (nutch-extensionpoints)
2012-07-16 14:19:48,787 INFO  plugin.PluginRepository -     Basic URL
Normalizer (urlnormalizer-basic)
2012-07-16 14:19:48,787 INFO  plugin.PluginRepository -     Html Parse
Plug-in (parse-html)
2012-07-16 14:19:48,787 INFO  plugin.PluginRepository -     Basic
Indexing Filter (index-basic)
2012-07-16 14:19:48,787 INFO  plugin.PluginRepository -     HTTP
Framework (lib-http)

...

When I run my own application:

./hadoop jar /apps/myapp/myapp.jar myapp.MyApp

The logs say:

2012-07-16 13:13:38,407 WARN  plugin.PluginRepository - Plugins:
directory not found: plugins
2012-07-16 13:13:38,407 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-07-16 13:13:38,407 INFO  plugin.PluginRepository - Registered Plugins:
2012-07-16 13:13:38,407 INFO  plugin.PluginRepository -     NONE


Both are using vanilla Nutch configuration.  Their folder structure is
almost the same, except Nutch is in lib folder as a jar library and the
file includes my own class files. Plugins are located under
classes/plugins in the jar file.

Strangely, in the second case, Hadoop only extracts contents of Nutch
library jar which does not contain any plugins to jobcache folder.
Nothing from my own jar file is extracted.

Note that my application is not a MapReduce job itself. My main method
just makes some arrangements and then calls jobs like Injector, Fetcher
etc. using ToolRunner. I suspect this might have to do with it. Should I
make my main class implement Tool interface and then call it with
ToolRunner, making it a custom version of Crawl class?

This might be more of a Hadoop question than Nutch one, sorry about that.

Also, is it possible for you to distribute default Nutch plugins as a
Maven dependency jar? Nutch 1.5 is unusable for its standard use case if
its default plugins are not included, which defeats the purpose of
Maven, no?

Any help would be really appreciated.

Thanks very much in advance,

Emre


Reply via email to