Hello,

On 07/16/2012 03:11 PM, Jim Chandler wrote:
What does your nutch-site.xml look like?

It's the same as default nutch-site.xml. There is nothing under
<configuration> field. I am using default configuration which is
distributed with Nutch. Nutch works with it whereas my project doesn't.

What about your plugin.xml for your project?

My application is not a plugin, therefore there is no plugin.xml for my
project. I am just calling Nutch API functions to crawl some websites.
It is similar to "run" method of Crawl.java in Nutch sources. Just a
plain Java class with a main method.

Thanks,

Emre

Jim

On Mon, Jul 16, 2012 at 8:09 AM, Emre Çelikten <[email protected]>
wrote:

Hello,

I am developing a Java application that uses Nutch as a Maven
dependency. I run Nutch jobs from my application in a way just
like Nutch itself does, by calling them like:

ToolRunner.run(**NutchConfiguration.create(), new Injector(),
args);

I have been unable to get it to work because it is not able to
find the plugins, resulting in "java.lang.RuntimeException: Error
in configuring object" errors. I have been unsuccessfully trying
since the last week. I think I have narrowed down the problem
enough to ask here.

Here are the details.

I am using Nutch 1.5.

When I run Nutch like this:

./hadoop jar /apps/nutchjob/apache-nutch-1.**5.job
org.apache.nutch.crawl. **Injector crawl/crawldb urls/urls

here's what the logs say about plugins:

2012-07-16 14:19:48,450 INFO  plugin.PluginRepository - Plugins:
looking in:
/hadooptmp/mapred/local/**taskTracker/hduser/jobcache/**
job_201207161219_0026/jars/**classes/plugins 2012-07-16
14:19:48,787 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true] 012-07-16 14:19:48,787 INFO
plugin.PluginRepository - Registered Plugins: 2012-07-16
14:19:48,787 INFO plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints) 2012-07-16 14:19:48,787
INFO plugin.PluginRepository -         Basic URL Normalizer
(urlnormalizer-basic) 2012-07-16 14:19:48,787 INFO
plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2012-07-16 14:19:48,787 INFO  plugin.PluginRepository - Basic
Indexing Filter (index-basic) 2012-07-16 14:19:48,787 INFO
plugin.PluginRepository -         HTTP Framework (lib-http)

...

When I run my own application:

./hadoop jar /apps/myapp/myapp.jar myapp.MyApp

The logs say:

2012-07-16 13:13:38,407 WARN  plugin.PluginRepository - Plugins:
directory not found: plugins 2012-07-16 13:13:38,407 INFO
plugin.PluginRepository - Plugin Auto-activation mode: [true]
2012-07-16 13:13:38,407 INFO  plugin.PluginRepository - Registered
Plugins: 2012-07-16 13:13:38,407 INFO  plugin.PluginRepository -
NONE


Both are using vanilla Nutch configuration.  Their folder
structure is almost the same, except Nutch is in lib folder as a
jar library and the file includes my own class files. Plugins are
located under classes/plugins in the jar file.

Strangely, in the second case, Hadoop only extracts contents of
Nutch library jar which does not contain any plugins to jobcache
folder. Nothing from my own jar file is extracted.

Note that my application is not a MapReduce job itself. My main
method just makes some arrangements and then calls jobs like
Injector, Fetcher etc. using ToolRunner. I suspect this might have
to do with it. Should I make my main class implement Tool
interface and then call it with ToolRunner, making it a custom
version of Crawl class?

This might be more of a Hadoop question than Nutch one, sorry
about that.

Also, is it possible for you to distribute default Nutch plugins
as a Maven dependency jar? Nutch 1.5 is unusable for its standard
use case if its default plugins are not included, which defeats
the purpose of Maven, no?

Any help would be really appreciated.

Thanks very much in advance,

Emre




Reply via email to