Thanks for a great reply! Right now I have a 4 urls in my seed file with domains d1,d2,d3,d4.
I see that when the nutch job is being run on Hadoop its only picking up URLs for d4, there does not seem to be any parallelism . I am running the Nutch job using the following command. bin/hadoop jar /home/general/workspace/nutch/runtime/deploy/apache-nutch-1.8-SNAPSHOT.job org.apache.nutch.crawl.Crawl urls -dir crawldirectory -depth 1000 -topN 30000 On Mon, Dec 9, 2013 at 8:16 PM, Tejas Patil <[email protected]>wrote: > When you run Nutch over Hadoop ie. deploy mode, you use the job file > (apache-nutch-1.X.job). This is nothing but a big fat zip file > containing (you can unzip it and verify yourself) : > (a) all the nutch classes compiled, > (b) config files and > (c) dependent jars > > When hadoop launches map-reduce jobs for nutch: > 1. This nutch job file is copied over to the node where your job is > executed (say map task), > 2. It is unpacked > 3. Nutch gets the nutch-site.xml and nutch-default.xml, loads the configs. > 4. By default, plugin.folders is set to "plugins" which is a relative path. > It would search the plugin classes in the classpath under a directory named > "plugins". > 5. The "plugins" directory is under a directory named "classes" which is in > the classpath (this is inside the extracted job file). Now, required plugin > classes are loaded from here and everything runs fine. > > In short: Leave it as it is. It should work over Hadoop by default. > > Thanks, > Tejas > > On Mon, Dec 9, 2013 at 4:54 PM, S.L <[email protected]> wrote: > > > What should be the plugins property be set to when running Nutch as a > > Hadoop job ? > > > > I just created a deploy mode jar running the ant script , I see that the > > value of the plugins property is being copied and used from the > > confiuration into the hadoop job. While it seems to be getting the > plugins > > directory because Hadoop is being run on the same machine , I am sure it > > will fail when moved to a different machine. > > > > How should I set the plugins property so that it is relative to the > hadoop > > job? > > > > Thanks > > >

