Thanks for a great reply!

Right now I have a 4 urls in my seed file with domains d1,d2,d3,d4.

I see that when the nutch job is being run on Hadoop its only picking up
URLs for d4, there does not seem to be any parallelism .

I am running the Nutch job using the following command.

bin/hadoop jar 
/home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job
org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000
-topN 30000




On Mon, Dec 9, 2013 at 8:16 PM, Tejas Patil <[email protected]>wrote:

> When you run Nutch over Hadoop ie. deploy mode, you use the job file
> (apache-nutch-1.X.job). This is nothing but a big fat zip file
> containing (you can unzip it and verify yourself) :
> (a) all the nutch classes compiled,
> (b) config files and
> (c) dependent jars
>
> When hadoop launches map-reduce jobs for nutch:
> 1. This nutch job file is copied over to the node where your job is
> executed (say map task),
> 2. It is unpacked
> 3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the configs.
> 4. By default, plugin.folders is set to "plugins" which is a relative path.
> It would search the plugin classes in the classpath under a directory named
> "plugins".
> 5. The "plugins" directory is under a directory named "classes" which is in
> the classpath (this is inside the extracted job file). Now, required plugin
> classes are loaded from here and everything runs fine.
>
> In short: Leave it as it is. It should work over Hadoop by default.
>
> Thanks,
> Tejas
>
> On Mon, Dec 9, 2013 at 4:54 PM, S.L <[email protected]> wrote:
>
> > What should be the plugins property be set to when running Nutch as a
> > Hadoop job ?
> >
> > I just created a deploy mode jar running the ant script , I see that the
> > value of the plugins property is being copied and used from the
> > confiuration into the hadoop job. While it seems to be getting the
> plugins
> > directory  because Hadoop is being run on the same machine , I am sure it
> > will fail when moved to a different machine.
> >
> > How should I set the plugins property so that it is relative to the
> hadoop
> > job?
> >
> > Thanks
> >
>

Reply via email to