Hadoop cannot find plugins when I launch my Nutch wrapper application

Emre Çelikten Mon, 16 Jul 2012 05:10:06 -0700

Hello,

I am developing a Java application that uses Nutch as a Mavendependency. I run Nutch jobs from my application in a way just likeNutch itself does, by calling them like:


ToolRunner.run(NutchConfiguration.create(), new Injector(), args);

I have been unable to get it to work because it is not able to find theplugins, resulting in "java.lang.RuntimeException: Error in configuringobject" errors. I have been unsuccessfully trying since the last week. Ithink I have narrowed down the problem enough to ask here.


Here are the details.

I am using Nutch 1.5.

When I run Nutch like this:

./hadoop jar /apps/nutchjob/apache-nutch-1.5.joborg.apache.nutch.crawl.Injector crawl/crawldb urls/urls


here's what the logs say about plugins:

2012-07-16 14:19:48,450 INFO plugin.PluginRepository - Plugins: lookingin:/hadooptmp/mapred/local/taskTracker/hduser/jobcache/job_201207161219_0026/jars/classes/plugins2012-07-16 14:19:48,787 INFO plugin.PluginRepository - PluginAuto-activation mode: [true]

012-07-16 14:19:48,787 INFO  plugin.PluginRepository - Registered Plugins:

2012-07-16 14:19:48,787 INFO plugin.PluginRepository - the nutch coreextension points (nutch-extensionpoints)2012-07-16 14:19:48,787 INFO plugin.PluginRepository - Basic URLNormalizer (urlnormalizer-basic)2012-07-16 14:19:48,787 INFO plugin.PluginRepository - Html ParsePlug-in (parse-html)2012-07-16 14:19:48,787 INFO plugin.PluginRepository - Basic IndexingFilter (index-basic)2012-07-16 14:19:48,787 INFO plugin.PluginRepository - HTTP Framework(lib-http)


...

When I run my own application:

./hadoop jar /apps/myapp/myapp.jar myapp.MyApp

The logs say:

2012-07-16 13:13:38,407 WARN plugin.PluginRepository - Plugins:directory not found: plugins2012-07-16 13:13:38,407 INFO plugin.PluginRepository - PluginAuto-activation mode: [true]

2012-07-16 13:13:38,407 INFO  plugin.PluginRepository - Registered Plugins:
2012-07-16 13:13:38,407 INFO  plugin.PluginRepository -         NONE

Both are using vanilla Nutch configuration. Their folder structure isalmost the same, except Nutch is in lib folder as a jar library and thefile includes my own class files. Plugins are located underclasses/plugins in the jar file.

Strangely, in the second case, Hadoop only extracts contents of Nutchlibrary jar which does not contain any plugins to jobcache folder.Nothing from my own jar file is extracted.

Note that my application is not a MapReduce job itself. My main methodjust makes some arrangements and then calls jobs like Injector, Fetcheretc. using ToolRunner. I suspect this might have to do with it. Should Imake my main class implement Tool interface and then call it withToolRunner, making it a custom version of Crawl class?


This might be more of a Hadoop question than Nutch one, sorry about that.

Also, is it possible for you to distribute default Nutch plugins as aMaven dependency jar? Nutch 1.5 is unusable for its standard use case ifits default plugins are not included, which defeats the purpose ofMaven, no?


Any help would be really appreciated.

Thanks very much in advance,

Emre

Hadoop cannot find plugins when I launch my Nutch wrapper application

Reply via email to