Re: Nutch 1.15 runtime/local does not run in Standalone mode

Sebastian Nagel Wed, 20 Feb 2019 01:58:36 -0800

Hi Ameer,

(bringing this back to user@nutch - sorry, I hit the wrong reply to)


> So, does that mean we do not have the standalone mode anymore as it used be 
> in the past

Nutch is based on Hadoop since the beginning and the "local" mode is an 
emulated Hadoop system in a
single process/JVM.
There has been no change to this behavior in recent Nutch versions.

> Any thoughts in getting back the old behavior with no jobs being created in 
> the
> *tmp* directory.

The issues with the /tmp directory have ever been there in local mode, see
  http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html

In local mode, you can change the temporary folder used by Hadoop via the Java
option
  -Dhadoop.tmp.dir

With bin/nutch or bin/crawl this is done by setting the environment variable 
NUTCH_OPTS

  export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir

Then all temporary data is written to /my/nutch/tmpdir but you're still 
responsible
to clean-up this folder.


> It confuses me to see these messages

You can suppress them by removing the following lines in
conf/log4j.properties:

# log mapreduce job messages and counters
log4j.logger.org.apache.hadoop.mapreduce.Job=INFO

However, for debugging these messages are really useful, esp. the job counters.
See https://issues.apache.org/jira/browse/NUTCH-2519


Best,
Sebastian



On 2/19/19 11:01 PM, Ameer Tawfik wrote:
> Thanks Sebastian for the reply.
> 
> So, does that mean we do not have the standalone mode anymore as it used be 
> in the past. It confuses
> me to see these messages 
> 
>  The url to track the job: http://localhost:8080/
> 2019-02-20 04:48:08,156 INFO  mapreduce.Job - Running job: 
> job_local2035597620_0001
> 2019-02-20 04:48:09,159 INFO  mapreduce.Job - J*ob job_local2035597620_0001* 
> running in uber mode :
> false
> 2019-02-20 04:48:09,161 INFO  mapreduce.Job -  *map 0% reduce 100%*
> 2019-02-20 04:48:09,163 INFO  mapreduce.Job - J*ob job_local2035597620_0001 
> *completed successfully
> 2019-02-20 04:48:09,194 INFO  mapreduce.Job - Counters: 24
> 
> In addition, it starts to create problems as these jobs accumulated in
> the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/ *directory* 
> *and eats up the
> harddisk space. Any thoughts in getting back the old behavior with no jobs 
> being created in the
> *tmp* directory. It also seems slow to me.
> 
> Regards
> Ameer
> 
> 
> 
> On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <wastl.na...@googlemail.com
> <mailto:wastl.na...@googlemail.com>> wrote:
> 
>     Hi Ameer,
> 
>     yes, you're correct.  If launched by
>       runtime/local/bin/nutch
>     resp.
>       runtime/local/bin/crawl
>     Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job and 
> task clients
>     in a single process (JVM).
> 
>     The other options are:
>      - pseudo-distributed mode: HDFS namenode and datanode, job and task 
> clients
>        as multiple processes on a single node
>      - fully distributed mode: multiple processes on multiple nodes
> 
>     Best,
>     Sebastian
> 
> 
> 
>     On 2/19/19 7:03 PM, atawfik wrote:
>     > Hi all,
>     >
>     > I downloaded Nutch 1.15 and built using *ant runtime*. When I issue the
>     > following crawl command from *runtime/local*
>     >
>     > 
>     >
>     > Nutch generates hadoop jobs and  hadoop single node logs. See the 
> content of
>     > the *hadoop.log* file below:
>     >
>     >
>     >
>     > If I understand right, it seems that nutch is running in a SingleNode 
> mode.
>     > We are not running Nutch in a cluster. We are just running locally.
>     >
>     > Please correct me if I misunderstood anything.
>     >
>     > Regards
>     > Ameer
>     >
>     >
>     >
>     >
>     > --
>     > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>     >
>

Re: Nutch 1.15 runtime/local does not run in Standalone mode

Reply via email to