I'll answer my own question, since I got it working.
1) grab the conf files you want to use and put them into src/nutch/conf/.
2) change the name of job file in the jar jarfile line below to so the line
is something like <jar jarfile="${build.dir}/nutch-site1.job.
3) run ant and copy the job file to the /opt/nutch/ or where ever you are
installing nutch to.
4) make a copy of /opt/nutch/bin/nutch called /opt/nutch/bin/nutch_site1 and
edit the nutch-*.job section like this:
if [ $IS_CORE == 0 ]
then
for f in $NUTCH_HOME/build/nutch-site1.job; do
CLASSPATH=${CLASSPATH}:$f;
done
# for releases, add Nutch job to CLASSPATH
for f in $NUTCH_HOME/nutch-site1.job; do
CLASSPATH=${CLASSPATH}:$f;
done
and use your crawl script with /opt/nutch/bin/nutch_site1.
On Fri, Dec 24, 2010 at 11:38 AM, Steve Cohen <[email protected]> wrote:
> Thanks for the response.
>
> I am looking at the build.xml file and I see the make job jar section of
> it.
>
> <!-- ================================================================== -->
> <!-- Make job jar
> -->
> <!-- ==================================================================
> -->
> <!--
> -->
> <!-- ==================================================================
> -->
> <target name="job" depends="compile">
> <jar jarfile="${build.dir}/${final.name}.job">
> <!-- If the build.classes has the nutch config files because the jar
> command command has run, exclude them. The conf directory has
> them.
> -->
> <zipfileset dir="${build.classes}"
> excludes="nutch-default.xml,nutch-site.xml"/>
> <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
> <zipfileset dir="${lib.dir}" prefix="lib"
> includes="**/*.jar" excludes="hadoop-*.jar"/>
> <zipfileset dir="${build.plugins}" prefix="plugins"/>
> </jar>
> </target>
>
>
> Are you saying that instead of the "<jar jarfile="${build.dir}/${
> final.name}.job">" line I can make it something like this:
>
> "<jar jarfile="${build.dir}/crawl1.job">" and change the ${conf.dir} line
> to be the conf directory of the first crawl and then add a second
> "make job jar section" with a jar jarfile line like this "<jar
> jarfile="${build.dir}/crawl2.job">" and change the ${conf.dir} line to be
> the conf directory of the second crawl?
>
> I guess I would then put the classpath into the crawl script and specify
> which job file I am using. Would this allow me to run both crawl scripts at
> the same time, assuming I have enough system resources to do so?
>
> Thanks,
> Steve
>
> On Fri, Dec 24, 2010 at 9:52 AM, Claudio Martella <
> [email protected]> wrote:
>
>> try and checkout the build file and see where he takes the files when he
>> builds the job file.
>> When you run nutch in hadoop it's not going to use config files in the
>> NUTCH_CONF_DIR but those in the job file.
>> So you should forget about NUTCH_CONF_DIR at run time, but to properly
>> build 2 job files, one with one set of configuration files and one with
>> the other one.
>>
>>
>> On 12/24/10 7:40 AM, Steve Cohen wrote:
>> > Thanks for the response. I see that when I ran ant package, a
>> nutch-1.2.job
>> > file was placed in /opt/nutch, which was what I specified in the
>> > build.properties file. You are saying that the job file was built with
>> the
>> > default conf files that come in the apache-nutch-1.2/conf directory
>> since I
>> > didn't change the conf files until after I built nutch.
>> >
>> > What do I need to do for a set up with two different crawls? Can I put
>> in
>> > two conf directories into the source (call the directories conf1 and
>> conf2)
>> > , build nutch with ant package which writes the conf directories and
>> files
>> > into nutch-1.2.job, and then specify NUTCH_CONF_DIR=conf1 for the first
>> > crawl script and specify NUTCH_CONF_DIR=conf2 for the second crawl
>> script?
>> >
>> > On Thu, Dec 23, 2010 at 2:02 PM, Claudio Martella <
>> > [email protected]> wrote:
>> >
>> >> When you run nutch on an hadoop cluster you use the nutch.job file. The
>> >> config files are all packed in it. So you should repack the job with
>> ant
>> >> every time you change one of the config files.
>> >>
>> >>
>> >> On 12/23/10 5:11 PM, Steve Cohen wrote:
>> >>> To run nutch fetches on multiple sites you can set a NUTCH_CONF_DIR
>> >>> Environment Variable as it says in th FAQ:
>> >>>
>> >>> http://wiki.apache.org/nutch/FAQ
>> >>>
>> >>> " How can I force fetcher to use custom nutch-config?
>> >>>
>> >>> - Create a new sub-directory under $NUTCH_HOME/conf, like
>> >> conf/myconfig
>> >>> - Copy these files from $NUTCH_HOME/conf to the new directory:
>> >>> common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml,
>> >>> regex-normalize.xml, regex-urlfilter.txt
>> >>> - Modify the nutch-default.xml to suite your needs
>> >>> - Set NUTCH_CONF_DIR environment variable to point into the
>> directory
>> >> you
>> >>> created
>> >>> - run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR
>> >>> environment variable. You should check the command outputs for
>> lines
>> >> where
>> >>> the configs are loaded, that they are really loaded from your
>> custom
>> >> dir.
>> >>> - Happy using."
>> >>>
>> >>>
>> >>> This worked fine with when I wasn't using a hadoop cluster. However,
>> when
>> >> I
>> >>> started a hadoop cluster, it ignored the regex-urlfilter.txt. However,
>> >> when
>> >>> I included the regex-urlfilter.txt with the hadoop configuration files
>> >> and
>> >>> restarted the cluster, the nutch fetch followed the rules. The problem
>> >> with
>> >>> this is that if you are doing two different fetches with different
>> >>> regex-urlfilter.txts, you need stop the hdfs cluster, put in new
>> >>> configuration files, and restart the cluster.
>> >>>
>> >>> Is there any other way of setting this up?
>> >>>
>> >>> Thanks,
>> >>> Steve
>> >>>
>> >>
>> >> --
>> >> Claudio Martella
>> >> Digital Technologies
>> >> Unit Research & Development - Analyst
>> >>
>> >> TIS innovation park
>> >> Via Siemens 19 | Siemensstr. 19
>> >> 39100 Bolzano | 39100 Bozen
>> >> Tel. +39 0471 068 123
>> >> Fax +39 0471 068 129
>> >> [email protected] http://www.tis.bz.it
>> >>
>> >> Short information regarding use of personal data. According to Section
>> 13
>> >> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you
>> that we
>> >> process your personal data in order to fulfil contractual and fiscal
>> >> obligations and also to send you information regarding our services and
>> >> events. Your personal data are processed with and without electronic
>> means
>> >> and by respecting data subjects' rights, fundamental freedoms and
>> dignity,
>> >> particularly with regard to confidentiality, personal identity and the
>> right
>> >> to personal data protection. At any time and without formalities you
>> can
>> >> write an e-mail to [email protected] in order to object the processing
>> of
>> >> your personal data for the purpose of sending advertising materials and
>> also
>> >> to exercise the right to access personal data and other rights referred
>> to
>> >> in Section 7 of Decree 196/2003. The data controller is TIS Techno
>> >> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
>> >> complete information on the web site www.tis.bz.it.
>> >>
>> >>
>> >>
>>
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax +39 0471 068 129
>> [email protected] http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13
>> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
>> process your personal data in order to fulfil contractual and fiscal
>> obligations and also to send you information regarding our services and
>> events. Your personal data are processed with and without electronic means
>> and by respecting data subjects' rights, fundamental freedoms and dignity,
>> particularly with regard to confidentiality, personal identity and the right
>> to personal data protection. At any time and without formalities you can
>> write an e-mail to [email protected] in order to object the processing of
>> your personal data for the purpose of sending advertising materials and also
>> to exercise the right to access personal data and other rights referred to
>> in Section 7 of Decree 196/2003. The data controller is TIS Techno
>> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
>> complete information on the web site www.tis.bz.it.
>>
>>
>>
>