Re: How do you run multi-site nutch in a hadoop cluster?

Steve Cohen Sat, 25 Dec 2010 19:13:44 -0800

I'll answer my own question, since I got it working.

1) grab the conf files you want to use and put them into src/nutch/conf/.
2) change the name of job file in the jar jarfile line below to so the line
is something like <jar jarfile="${build.dir}/nutch-site1.job.
3) run ant and copy the job file to the /opt/nutch/ or where ever you are
installing nutch to.
4) make a copy of /opt/nutch/bin/nutch called /opt/nutch/bin/nutch_site1 and
edit the nutch-*.job section like this:


if [ $IS_CORE == 0 ]
then
  for f in $NUTCH_HOME/build/nutch-site1.job; do
    CLASSPATH=${CLASSPATH}:$f;
  done

  # for releases, add Nutch job to CLASSPATH
  for f in $NUTCH_HOME/nutch-site1.job; do
    CLASSPATH=${CLASSPATH}:$f;
  done

and use your crawl script with /opt/nutch/bin/nutch_site1.

On Fri, Dec 24, 2010 at 11:38 AM, Steve Cohen <[email protected]> wrote:

> Thanks for the response.
>
> I am looking at the build.xml file and I see the make job jar section of
> it.
>
> <!-- ================================================================== -->
>   <!-- Make job jar
> -->
>   <!-- ==================================================================
> -->
>   <!--
> -->
>   <!-- ==================================================================
> -->
>   <target name="job" depends="compile">
>     <jar jarfile="${build.dir}/${final.name}.job">
>       <!-- If the build.classes has the nutch config files because the jar
>            command command has run, exclude them.  The conf directory has
>            them.
>       -->
>       <zipfileset dir="${build.classes}"
>                   excludes="nutch-default.xml,nutch-site.xml"/>
>       <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
>       <zipfileset dir="${lib.dir}" prefix="lib"
>                   includes="**/*.jar" excludes="hadoop-*.jar"/>
>       <zipfileset dir="${build.plugins}" prefix="plugins"/>
>     </jar>
>   </target>
>
>
> Are you saying that instead of the "<jar jarfile="${build.dir}/${
> final.name}.job">" line I can make it something like this:
>
> "<jar jarfile="${build.dir}/crawl1.job">" and change the ${conf.dir} line
> to be the conf directory of the first crawl and then add a second
> "make job jar section" with a jar jarfile line like this "<jar
> jarfile="${build.dir}/crawl2.job">" and change the ${conf.dir} line to be
> the conf directory of the second crawl?
>
> I guess I would then put the classpath into the crawl script and specify
> which job file I am using. Would this allow me to run both crawl scripts at
> the same time, assuming I have enough system resources to do so?
>
> Thanks,
> Steve
>
> On Fri, Dec 24, 2010 at 9:52 AM, Claudio Martella <
> [email protected]> wrote:
>
>> try and checkout the build file and see where he takes the files when he
>> builds the job file.
>> When you run nutch in hadoop it's not going to use config files in the
>> NUTCH_CONF_DIR but those in the job file.
>> So you should forget about NUTCH_CONF_DIR at run time, but to properly
>> build 2 job files, one with one set of configuration files and one with
>> the other one.
>>
>>
>> On 12/24/10 7:40 AM, Steve Cohen wrote:
>> > Thanks for the response. I see that when I ran ant package, a
>> nutch-1.2.job
>> > file was placed in /opt/nutch, which was what I specified in the
>> > build.properties file.  You are saying that the job file was built with
>> the
>> > default conf files that come in the apache-nutch-1.2/conf directory
>> since I
>> > didn't change the conf files until after I built nutch.
>> >
>> > What do I need to do for a set up with two different crawls? Can I put
>> in
>> > two conf directories into the source (call the directories conf1 and
>> conf2)
>> > , build nutch with ant package which writes the conf directories and
>> files
>> > into nutch-1.2.job, and then specify NUTCH_CONF_DIR=conf1 for the first
>> > crawl script and specify NUTCH_CONF_DIR=conf2 for the second crawl
>> script?
>> >
>> > On Thu, Dec 23, 2010 at 2:02 PM, Claudio Martella <
>> > [email protected]> wrote:
>> >
>> >> When you run nutch on an hadoop cluster you use the nutch.job file. The
>> >> config files are all packed in it. So you should repack the job with
>> ant
>> >> every time you change one of the config files.
>> >>
>> >>
>> >> On 12/23/10 5:11 PM, Steve Cohen wrote:
>> >>> To run nutch fetches on multiple sites you can set a NUTCH_CONF_DIR
>> >>> Environment Variable as it says in th FAQ:
>> >>>
>> >>> http://wiki.apache.org/nutch/FAQ
>> >>>
>> >>> " How can I force fetcher to use custom nutch-config?
>> >>>
>> >>>    - Create a new sub-directory under $NUTCH_HOME/conf, like
>> >> conf/myconfig
>> >>>    - Copy these files from $NUTCH_HOME/conf to the new directory:
>> >>>    common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml,
>> >>>    regex-normalize.xml, regex-urlfilter.txt
>> >>>    - Modify the nutch-default.xml to suite your needs
>> >>>    - Set NUTCH_CONF_DIR environment variable to point into the
>> directory
>> >> you
>> >>>    created
>> >>>    - run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR
>> >>>    environment variable. You should check the command outputs for
>> lines
>> >> where
>> >>>    the configs are loaded, that they are really loaded from your
>> custom
>> >> dir.
>> >>>    - Happy using."
>> >>>
>> >>>
>> >>> This worked fine with when I wasn't using a hadoop cluster. However,
>> when
>> >> I
>> >>> started a hadoop cluster, it ignored the regex-urlfilter.txt. However,
>> >> when
>> >>> I included the regex-urlfilter.txt with the hadoop configuration files
>> >> and
>> >>> restarted the cluster, the nutch fetch followed the rules. The problem
>> >> with
>> >>> this is that if you are doing two different fetches with different
>> >>> regex-urlfilter.txts, you need stop the hdfs cluster, put in new
>> >>> configuration files, and restart the cluster.
>> >>>
>> >>> Is there any other way of setting this up?
>> >>>
>> >>> Thanks,
>> >>> Steve
>> >>>
>> >>
>> >> --
>> >> Claudio Martella
>> >> Digital Technologies
>> >> Unit Research & Development - Analyst
>> >>
>> >> TIS innovation park
>> >> Via Siemens 19 | Siemensstr. 19
>> >> 39100 Bolzano | 39100 Bozen
>> >> Tel. +39 0471 068 123
>> >> Fax  +39 0471 068 129
>> >> [email protected] http://www.tis.bz.it
>> >>
>> >> Short information regarding use of personal data. According to Section
>> 13
>> >> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you
>> that we
>> >> process your personal data in order to fulfil contractual and fiscal
>> >> obligations and also to send you information regarding our services and
>> >> events. Your personal data are processed with and without electronic
>> means
>> >> and by respecting data subjects' rights, fundamental freedoms and
>> dignity,
>> >> particularly with regard to confidentiality, personal identity and the
>> right
>> >> to personal data protection. At any time and without formalities you
>> can
>> >> write an e-mail to [email protected] in order to object the processing
>> of
>> >> your personal data for the purpose of sending advertising materials and
>> also
>> >> to exercise the right to access personal data and other rights referred
>> to
>> >> in Section 7 of Decree 196/2003. The data controller is TIS Techno
>> >> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
>> >> complete information on the web site www.tis.bz.it.
>> >>
>> >>
>> >>
>>
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> [email protected] http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13
>> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
>> process your personal data in order to fulfil contractual and fiscal
>> obligations and also to send you information regarding our services and
>> events. Your personal data are processed with and without electronic means
>> and by respecting data subjects' rights, fundamental freedoms and dignity,
>> particularly with regard to confidentiality, personal identity and the right
>> to personal data protection. At any time and without formalities you can
>> write an e-mail to [email protected] in order to object the processing of
>> your personal data for the purpose of sending advertising materials and also
>> to exercise the right to access personal data and other rights referred to
>> in Section 7 of Decree 196/2003. The data controller is TIS Techno
>> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
>> complete information on the web site www.tis.bz.it.
>>
>>
>>
>

Re: How do you run multi-site nutch in a hadoop cluster?

Reply via email to