Thanks for the response. I see that when I ran ant package, a nutch-1.2.job file was placed in /opt/nutch, which was what I specified in the build.properties file. You are saying that the job file was built with the default conf files that come in the apache-nutch-1.2/conf directory since I didn't change the conf files until after I built nutch.
What do I need to do for a set up with two different crawls? Can I put in two conf directories into the source (call the directories conf1 and conf2) , build nutch with ant package which writes the conf directories and files into nutch-1.2.job, and then specify NUTCH_CONF_DIR=conf1 for the first crawl script and specify NUTCH_CONF_DIR=conf2 for the second crawl script? On Thu, Dec 23, 2010 at 2:02 PM, Claudio Martella < [email protected]> wrote: > When you run nutch on an hadoop cluster you use the nutch.job file. The > config files are all packed in it. So you should repack the job with ant > every time you change one of the config files. > > > On 12/23/10 5:11 PM, Steve Cohen wrote: > > To run nutch fetches on multiple sites you can set a NUTCH_CONF_DIR > > Environment Variable as it says in th FAQ: > > > > http://wiki.apache.org/nutch/FAQ > > > > " How can I force fetcher to use custom nutch-config? > > > > - Create a new sub-directory under $NUTCH_HOME/conf, like > conf/myconfig > > - Copy these files from $NUTCH_HOME/conf to the new directory: > > common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, > > regex-normalize.xml, regex-urlfilter.txt > > - Modify the nutch-default.xml to suite your needs > > - Set NUTCH_CONF_DIR environment variable to point into the directory > you > > created > > - run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR > > environment variable. You should check the command outputs for lines > where > > the configs are loaded, that they are really loaded from your custom > dir. > > - Happy using." > > > > > > This worked fine with when I wasn't using a hadoop cluster. However, when > I > > started a hadoop cluster, it ignored the regex-urlfilter.txt. However, > when > > I included the regex-urlfilter.txt with the hadoop configuration files > and > > restarted the cluster, the nutch fetch followed the rules. The problem > with > > this is that if you are doing two different fetches with different > > regex-urlfilter.txts, you need stop the hdfs cluster, put in new > > configuration files, and restart the cluster. > > > > Is there any other way of setting this up? > > > > Thanks, > > Steve > > > > > -- > Claudio Martella > Digital Technologies > Unit Research & Development - Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [email protected] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 > of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [email protected] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to > in Section 7 of Decree 196/2003. The data controller is TIS Techno > Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the > complete information on the web site www.tis.bz.it. > > >

