On 13 June 2011 21:15, Jason Stubblefield
<[email protected]>wrote:

> Thanks for the help Julien, I'll just copy the files to the hadoop conf
> directory for now while it is a single node.
>
> If I use the job file do I have to have the nutch package on each node in
> the cluster, or just on the master node?
>

Just on the master - it is sent to all the nodes for you just like any
normal mapreduce job


> I'm also curious if it would be possible or practical to declare the
> NUTCH_CONF_DIR in a nutch-env.sh file like hadoop uses, or somewhere in the
> nutch script.  Thanks again.
>

hmmm. relying on the conf files on the master only is OK but that won't help
with the URLFilter files etc... much simpler to generate a job file, use it
from the master and let hadoop distribute it to the slaves

Julien



>
> ~Jason
>
> On Mon, Jun 13, 2011 at 4:03 PM, Julien Nioche <
> [email protected]> wrote:
>
> > Hi Jason,
> >
> > If you have hadoop running independently from Nutch you should use
> > runtime/deploy/bin. The conf files can go directly in the hadoop/conf dir
> > or
> > in the Nutch job which you will need to regenerate with 'ant job' so that
> > it
> > reflects the changes you made in NUTCH/conf
> >
> > Julien
> >
> > On 13 June 2011 11:59, Jason Stubblefield
> > <[email protected]>wrote:
> >
> > > Update:  The nutch configuration files need to go in the hadoop conf
> > file.
> > >
> > > Maybe someone could recommend some best practices regarding the file
> > > structure?  Should all the nutch config files simply be copied to the
> > > hadoop
> > > conf directory?  Currently I have:
> > >
> > > /webcrawler/hadoop
> > > /webcrawler/nutch
> > >
> > > I guess im a bit confused because 1.3 didn't come bundled with hadoop.
> > >
> > > Thanks!
> > >
> > > ~Jason
> > >
> > > On Mon, Jun 13, 2011 at 12:07 PM, Jason Stubblefield <
> > > [email protected]> wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm trying to fetch a segment using hadoop on a single node with
> nutch
> > > 1.3.
> > > >  I seem to be struggling with the new runtime configuration.  I have
> > > hadoop
> > > > up and running and have successfully run the readdb -stats command
> and
> > > > generated a sement, but when I run:
> > > >
> > > > runtime/deploy/bin/nutch fetch crawl/segments/20110613103305 -threads
> 8
> > > >
> > > > I get an error message: No agents listed in 'http.agent.name'
> property
> > > >
> > > > I noticed there are now 2 conf files, one at trunk/conf and the other
> > at
> > > > trunk/runtime/local/conf, and hae updated both of them with my
> > > > nutch-site.xml file, both have a properly configured http.agent.name
> .
> > > >
> > > > Do I need to explicitly declare the conf directory somewhere?  Do in
> > need
> > > > to move the conf file to trunk/runtime/deploy/conf, or put it
> somewhere
> > > > else?  What am i missing?
> > > >
> > > > Thanks in advance!
> > > >
> > > > ~Jason
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to