On 13 June 2011 21:15, Jason Stubblefield <[email protected]>wrote:
> Thanks for the help Julien, I'll just copy the files to the hadoop conf > directory for now while it is a single node. > > If I use the job file do I have to have the nutch package on each node in > the cluster, or just on the master node? > Just on the master - it is sent to all the nodes for you just like any normal mapreduce job > I'm also curious if it would be possible or practical to declare the > NUTCH_CONF_DIR in a nutch-env.sh file like hadoop uses, or somewhere in the > nutch script. Thanks again. > hmmm. relying on the conf files on the master only is OK but that won't help with the URLFilter files etc... much simpler to generate a job file, use it from the master and let hadoop distribute it to the slaves Julien > > ~Jason > > On Mon, Jun 13, 2011 at 4:03 PM, Julien Nioche < > [email protected]> wrote: > > > Hi Jason, > > > > If you have hadoop running independently from Nutch you should use > > runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir > > or > > in the Nutch job which you will need to regenerate with 'ant job' so that > > it > > reflects the changes you made in NUTCH/conf > > > > Julien > > > > On 13 June 2011 11:59, Jason Stubblefield > > <[email protected]>wrote: > > > > > Update: The nutch configuration files need to go in the hadoop conf > > file. > > > > > > Maybe someone could recommend some best practices regarding the file > > > structure? Should all the nutch config files simply be copied to the > > > hadoop > > > conf directory? Currently I have: > > > > > > /webcrawler/hadoop > > > /webcrawler/nutch > > > > > > I guess im a bit confused because 1.3 didn't come bundled with hadoop. > > > > > > Thanks! > > > > > > ~Jason > > > > > > On Mon, Jun 13, 2011 at 12:07 PM, Jason Stubblefield < > > > [email protected]> wrote: > > > > > > > Hello, > > > > > > > > I'm trying to fetch a segment using hadoop on a single node with > nutch > > > 1.3. > > > > I seem to be struggling with the new runtime configuration. I have > > > hadoop > > > > up and running and have successfully run the readdb -stats command > and > > > > generated a sement, but when I run: > > > > > > > > runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads > 8 > > > > > > > > I get an error message: No agents listed in 'http.agent.name' > property > > > > > > > > I noticed there are now 2 conf files, one at trunk/conf and the other > > at > > > > trunk/runtime/local/conf, and hae updated both of them with my > > > > nutch-site.xml file, both have a properly configured http.agent.name > . > > > > > > > > Do I need to explicitly declare the conf directory somewhere? Do in > > need > > > > to move the conf file to trunk/runtime/deploy/conf, or put it > somewhere > > > > else? What am i missing? > > > > > > > > Thanks in advance! > > > > > > > > ~Jason > > > > > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

