Re: How do you run multi-site nutch in a hadoop cluster?

Steve Cohen Fri, 24 Dec 2010 08:38:57 -0800

Thanks for the response.

I am looking at the build.xml file and I see the make job jar section of it.


<!-- ================================================================== -->
  <!-- Make job jar
-->
  <!-- ==================================================================
-->
  <!--
-->
  <!-- ==================================================================
-->
  <target name="job" depends="compile">
    <jar jarfile="${build.dir}/${final.name}.job">
      <!-- If the build.classes has the nutch config files because the jar
           command command has run, exclude them.  The conf directory has
           them.
      -->
      <zipfileset dir="${build.classes}"
                  excludes="nutch-default.xml,nutch-site.xml"/>
      <zipfileset dir="${conf.dir}" excludes="*.template,hadoop*.*"/>
      <zipfileset dir="${lib.dir}" prefix="lib"
                  includes="**/*.jar" excludes="hadoop-*.jar"/>
      <zipfileset dir="${build.plugins}" prefix="plugins"/>
    </jar>
  </target>


Are you saying that instead of the "<jar
jarfile="${build.dir}/${final.name}.job">"
line I can make it something like this:

"<jar jarfile="${build.dir}/crawl1.job">" and change the ${conf.dir} line to
be the conf directory of the first crawl and then add a second
"make job jar section" with a jar jarfile line like this "<jar
jarfile="${build.dir}/crawl2.job">" and change the ${conf.dir} line to be
the conf directory of the second crawl?

I guess I would then put the classpath into the crawl script and specify
which job file I am using. Would this allow me to run both crawl scripts at
the same time, assuming I have enough system resources to do so?

Thanks,
Steve

On Fri, Dec 24, 2010 at 9:52 AM, Claudio Martella <
[email protected]> wrote:

> try and checkout the build file and see where he takes the files when he
> builds the job file.
> When you run nutch in hadoop it's not going to use config files in the
> NUTCH_CONF_DIR but those in the job file.
> So you should forget about NUTCH_CONF_DIR at run time, but to properly
> build 2 job files, one with one set of configuration files and one with
> the other one.
>
>
> On 12/24/10 7:40 AM, Steve Cohen wrote:
> > Thanks for the response. I see that when I ran ant package, a
> nutch-1.2.job
> > file was placed in /opt/nutch, which was what I specified in the
> > build.properties file.  You are saying that the job file was built with
> the
> > default conf files that come in the apache-nutch-1.2/conf directory since
> I
> > didn't change the conf files until after I built nutch.
> >
> > What do I need to do for a set up with two different crawls? Can I put in
> > two conf directories into the source (call the directories conf1 and
> conf2)
> > , build nutch with ant package which writes the conf directories and
> files
> > into nutch-1.2.job, and then specify NUTCH_CONF_DIR=conf1 for the first
> > crawl script and specify NUTCH_CONF_DIR=conf2 for the second crawl
> script?
> >
> > On Thu, Dec 23, 2010 at 2:02 PM, Claudio Martella <
> > [email protected]> wrote:
> >
> >> When you run nutch on an hadoop cluster you use the nutch.job file. The
> >> config files are all packed in it. So you should repack the job with ant
> >> every time you change one of the config files.
> >>
> >>
> >> On 12/23/10 5:11 PM, Steve Cohen wrote:
> >>> To run nutch fetches on multiple sites you can set a NUTCH_CONF_DIR
> >>> Environment Variable as it says in th FAQ:
> >>>
> >>> http://wiki.apache.org/nutch/FAQ
> >>>
> >>> " How can I force fetcher to use custom nutch-config?
> >>>
> >>>    - Create a new sub-directory under $NUTCH_HOME/conf, like
> >> conf/myconfig
> >>>    - Copy these files from $NUTCH_HOME/conf to the new directory:
> >>>    common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml,
> >>>    regex-normalize.xml, regex-urlfilter.txt
> >>>    - Modify the nutch-default.xml to suite your needs
> >>>    - Set NUTCH_CONF_DIR environment variable to point into the
> directory
> >> you
> >>>    created
> >>>    - run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR
> >>>    environment variable. You should check the command outputs for lines
> >> where
> >>>    the configs are loaded, that they are really loaded from your custom
> >> dir.
> >>>    - Happy using."
> >>>
> >>>
> >>> This worked fine with when I wasn't using a hadoop cluster. However,
> when
> >> I
> >>> started a hadoop cluster, it ignored the regex-urlfilter.txt. However,
> >> when
> >>> I included the regex-urlfilter.txt with the hadoop configuration files
> >> and
> >>> restarted the cluster, the nutch fetch followed the rules. The problem
> >> with
> >>> this is that if you are doing two different fetches with different
> >>> regex-urlfilter.txts, you need stop the hdfs cluster, put in new
> >>> configuration files, and restart the cluster.
> >>>
> >>> Is there any other way of setting this up?
> >>>
> >>> Thanks,
> >>> Steve
> >>>
> >>
> >> --
> >> Claudio Martella
> >> Digital Technologies
> >> Unit Research & Development - Analyst
> >>
> >> TIS innovation park
> >> Via Siemens 19 | Siemensstr. 19
> >> 39100 Bolzano | 39100 Bozen
> >> Tel. +39 0471 068 123
> >> Fax  +39 0471 068 129
> >> [email protected] http://www.tis.bz.it
> >>
> >> Short information regarding use of personal data. According to Section
> 13
> >> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you
> that we
> >> process your personal data in order to fulfil contractual and fiscal
> >> obligations and also to send you information regarding our services and
> >> events. Your personal data are processed with and without electronic
> means
> >> and by respecting data subjects' rights, fundamental freedoms and
> dignity,
> >> particularly with regard to confidentiality, personal identity and the
> right
> >> to personal data protection. At any time and without formalities you can
> >> write an e-mail to [email protected] in order to object the processing
> of
> >> your personal data for the purpose of sending advertising materials and
> also
> >> to exercise the right to access personal data and other rights referred
> to
> >> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> >> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> >> complete information on the web site www.tis.bz.it.
> >>
> >>
> >>
>
>
> --
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> [email protected] http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13
> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to [email protected] in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
>
>
>

Re: How do you run multi-site nutch in a hadoop cluster?

Reply via email to