Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Walter Tietze Tue, 18 Sep 2012 03:38:25 -0700

Hi Jiuling,


It should suffice to recompile! You don't have to unpack your job.



I start the job with the command

'runtime/deploy/bin/nutch crawl your_seeds_dir -depth 1'

which does nothing else then calling

'hadoop jar apache-nutch-1.5.1.job ....'!

That should suffice.


For accessing the plugins from the job, the parameter

<property>
  <name>plugin.folders</name>
  <!-- value>plugins</value -->
  <value>classes/plugins</value>
</property>

might have to be adjusted like the example above,


Please check the structure of the plugins directory
in your job.



I made one further modification, which came from the need to be able
to set hadoop parameters for the jobs.


I modified class ./src/java/org/apache/nutch/util/NutchJob.java to



public class NutchJob extends JobConf {
  public NutchJob(Configuration conf) {
    super(conf, NutchJob.class);
    checkMyOpts();
  }

  public void checkMyOpts() {
        Map<String, String> env = System.getenv();
        String myOpts = env.get("MY_CRAWLER_OPTS");
        if(null != myOpts) {
           String[] myOptsArray = myOpts.split(" ");
           for(int i = 0; i < myOptsArray.length; i++) {
                String[] keyval = myOptsArray[i].split("=");
                if(null != keyval && keyval.length == 2) {
                        set(keyval[0], keyval[1]);
                }
           }
        }
  }
}


to be able to set hadoop parameters for the jobs from the commandline,
because I had problems with the default settings for the hadoop child
processes.


If you add the code above,  you can set an environment variable to


export MY_CRAWLER_OPTS="mapreduce.map.java.opts=-Xmx4096m
mapreduce.reduce.java.opts=-Xmx4096m mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=4096 mapreduce.job.maps=21
mapreduce.job.reduces=21"


which sets for YarnChild processes the java parameter -Xmx to 4GB
and requests for the crawl 21 maps and 21 reduces.

This variables get important, when for example you want to generate
a nutch webgraph and the hadoop default settings are choosen for
'normally' sized jobs.


Please remark, if hadoop unpacks the job, the container must have at
least space for the unpacked files and enough memory space to load
the jars into the jvms of the child processes.


Hope this helps!




Cheers, Walter



Am 18.09.2012 04:39, schrieb jiuling:
> Dir Walter:
> 
>     I am sorry for I want your more help. 
> 
>      I have update the corresponding java and recompiled. At the first step,
> I do not unpack the job and directly excute hadoop jar *.job ..., it still
> not work. 
>     Finally, I unpacked the job, but don't known how to compile the command?
> Can you help me for more information  about "Something one can do, is to
> unpack the job in the Nodemanager manually 
> and to load the classes from within the code into the current 
> classloader. "?
> 
>     Thank you a lot.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008512.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Reply via email to