Re: How do you run multi-site nutch in a hadoop cluster?

Steve Cohen Thu, 23 Dec 2010 22:41:30 -0800

Thanks for the response. I see that when I ran ant package, a nutch-1.2.job
file was placed in /opt/nutch, which was what I specified in the
build.properties file.  You are saying that the job file was built with the
default conf files that come in the apache-nutch-1.2/conf directory since I
didn't change the conf files until after I built nutch.


What do I need to do for a set up with two different crawls? Can I put in
two conf directories into the source (call the directories conf1 and conf2)
, build nutch with ant package which writes the conf directories and files
into nutch-1.2.job, and then specify NUTCH_CONF_DIR=conf1 for the first
crawl script and specify NUTCH_CONF_DIR=conf2 for the second crawl script?

On Thu, Dec 23, 2010 at 2:02 PM, Claudio Martella <
[email protected]> wrote:

> When you run nutch on an hadoop cluster you use the nutch.job file. The
> config files are all packed in it. So you should repack the job with ant
> every time you change one of the config files.
>
>
> On 12/23/10 5:11 PM, Steve Cohen wrote:
> > To run nutch fetches on multiple sites you can set a NUTCH_CONF_DIR
> > Environment Variable as it says in th FAQ:
> >
> > http://wiki.apache.org/nutch/FAQ
> >
> > " How can I force fetcher to use custom nutch-config?
> >
> >    - Create a new sub-directory under $NUTCH_HOME/conf, like
> conf/myconfig
> >    - Copy these files from $NUTCH_HOME/conf to the new directory:
> >    common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml,
> >    regex-normalize.xml, regex-urlfilter.txt
> >    - Modify the nutch-default.xml to suite your needs
> >    - Set NUTCH_CONF_DIR environment variable to point into the directory
> you
> >    created
> >    - run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR
> >    environment variable. You should check the command outputs for lines
> where
> >    the configs are loaded, that they are really loaded from your custom
> dir.
> >    - Happy using."
> >
> >
> > This worked fine with when I wasn't using a hadoop cluster. However, when
> I
> > started a hadoop cluster, it ignored the regex-urlfilter.txt. However,
> when
> > I included the regex-urlfilter.txt with the hadoop configuration files
> and
> > restarted the cluster, the nutch fetch followed the rules. The problem
> with
> > this is that if you are doing two different fetches with different
> > regex-urlfilter.txts, you need stop the hdfs cluster, put in new
> > configuration files, and restart the cluster.
> >
> > Is there any other way of setting this up?
> >
> > Thanks,
> > Steve
> >
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13
> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to [email protected] in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
>
>
>

Re: How do you run multi-site nutch in a hadoop cluster?

Reply via email to