Hi Ameer, (bringing this back to user@nutch - sorry, I hit the wrong reply to)
> So, does that mean we do not have the standalone mode anymore as it used be > in the past Nutch is based on Hadoop since the beginning and the "local" mode is an emulated Hadoop system in a single process/JVM. There has been no change to this behavior in recent Nutch versions. > Any thoughts in getting back the old behavior with no jobs being created in > the > *tmp* directory. The issues with the /tmp directory have ever been there in local mode, see http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html In local mode, you can change the temporary folder used by Hadoop via the Java option -Dhadoop.tmp.dir With bin/nutch or bin/crawl this is done by setting the environment variable NUTCH_OPTS export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir Then all temporary data is written to /my/nutch/tmpdir but you're still responsible to clean-up this folder. > It confuses me to see these messages You can suppress them by removing the following lines in conf/log4j.properties: # log mapreduce job messages and counters log4j.logger.org.apache.hadoop.mapreduce.Job=INFO However, for debugging these messages are really useful, esp. the job counters. See https://issues.apache.org/jira/browse/NUTCH-2519 Best, Sebastian On 2/19/19 11:01 PM, Ameer Tawfik wrote: > Thanks Sebastian for the reply. > > So, does that mean we do not have the standalone mode anymore as it used be > in the past. It confuses > me to see these messages > > The url to track the job: http://localhost:8080/ > 2019-02-20 04:48:08,156 INFO mapreduce.Job - Running job: > job_local2035597620_0001 > 2019-02-20 04:48:09,159 INFO mapreduce.Job - J*ob job_local2035597620_0001* > running in uber mode : > false > 2019-02-20 04:48:09,161 INFO mapreduce.Job - *map 0% reduce 100%* > 2019-02-20 04:48:09,163 INFO mapreduce.Job - J*ob job_local2035597620_0001 > *completed successfully > 2019-02-20 04:48:09,194 INFO mapreduce.Job - Counters: 24 > > In addition, it starts to create problems as these jobs accumulated in > the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/ *directory* > *and eats up the > harddisk space. Any thoughts in getting back the old behavior with no jobs > being created in the > *tmp* directory. It also seems slow to me. > > Regards > Ameer > > > > On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <wastl.na...@googlemail.com > <mailto:wastl.na...@googlemail.com>> wrote: > > Hi Ameer, > > yes, you're correct. If launched by > runtime/local/bin/nutch > resp. > runtime/local/bin/crawl > Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job and > task clients > in a single process (JVM). > > The other options are: > - pseudo-distributed mode: HDFS namenode and datanode, job and task > clients > as multiple processes on a single node > - fully distributed mode: multiple processes on multiple nodes > > Best, > Sebastian > > > > On 2/19/19 7:03 PM, atawfik wrote: > > Hi all, > > > > I downloaded Nutch 1.15 and built using *ant runtime*. When I issue the > > following crawl command from *runtime/local* > > > > > > > > Nutch generates hadoop jobs and hadoop single node logs. See the > content of > > the *hadoop.log* file below: > > > > > > > > If I understand right, it seems that nutch is running in a SingleNode > mode. > > We are not running Nutch in a cluster. We are just running locally. > > > > Please correct me if I misunderstood anything. > > > > Regards > > Ameer > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html > > >