Thanks for your reply Sebastian. I asked this question for the following
reasons:

* We were running crawl script using nohup and we redirected the output to
a local log file. In some weird/rare scenario (may be our master node went
down at that time, i am not sure), the log file stopped but nutch process
was running. We could not really see what it (nutch) is doing.

* I see that the nutch code uses log4j to log, so i am wondering it should
all go to a log4j rotated log file instead of just console. The same works
well in local mode. Can you please explain me why it doesnt write to a file
and only to console ?

* It also puzzles me why the running process shows "
Dhadoop.root.logger=INFO,console"  though i changed conf/log4j.properties
to "log4j.rootLogger=INFO,DRFA"

Thanks
Srini

On Tue, Aug 1, 2017 at 7:51 AM, Sebastian Nagel <[email protected]>
wrote:

> Hi Srini,
>
> > I am referring to the INFO messages that are printed in console when
> nutch
> > 1.14 is running in distributed mode. For example
>
> Afaics, the only way to get the logs of the job client is to redirect the
> console output to a file,
> e.g.,
>
> /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> seed.txt &>inject.log
>
> > I am running nutch from a EMR cluster.
>
> If you're interested in the logs of task attempts, see:
>
> http://docs.aws.amazon.com/emr/latest/ManagementGuide/
> emr-manage-view-web-log-files.html
>
>
> Sebastian
>
> On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote:
> > Hi Sebastin
> >
> > I am referring to the INFO messages that are printed in console when
> nutch
> > 1.14 is running in distributed mode. For example
> >
> > Injecting seed URLs
> > /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb
> > seed.txt
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29
> > 06:51:18
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb:
> > /user/hadoop/crawlDIR/crawldb
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt
> > 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls
> > to crawl db entries.
> > 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at
> > ip-*-*-*-*.ec2.internal/*.*.*.*:8032
> > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
> process
> > : 0
> > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to
> process
> > : 1
> > .
> > .
> > 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003
> > 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running
> in
> > uber mode : false
> > 17/07/29 06:51:28 INFO mapreduce.Job:  map 0% reduce 0%
> > 17/07/29 06:51:33 INFO mapreduce.Job:  map 100% reduce 0%
> > 17/07/29 06:51:38 INFO mapreduce.Job:  map 100% reduce 4%
> > 17/07/29 06:51:40 INFO mapreduce.Job:  map 100% reduce 6%
> > 17/07/29 06:51:41 INFO mapreduce.Job:  map 100% reduce 49%
> > 17/07/29 06:51:42 INFO mapreduce.Job:  map 100% reduce 66%
> > 17/07/29 06:51:43 INFO mapreduce.Job:  map 100% reduce 87%
> > 17/07/29 06:51:44 INFO mapreduce.Job:  map 100% reduce 100%
> >
> > I am running nutch from a EMR cluster. I did check around the log
> > directories and I dont see the messages i see in the console anywhere
> else.
> >
> > One more thing i noticed is when i issue the command
> >
> > *ps -ef | grep nutch*
> >
> > hadoop    21616  18344  2 06:59 pts/1    00:00:09
> > /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server
> > -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/
> hadoop/logs*
> > *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop
> > -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console*
> > -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/
> lib/hadoop/lib/native
> > -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
> > -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30
> > org.apache.hadoop.util.RunJar
> > /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job
> > org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D
> > mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D
> > mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> > /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100
> >
> > The logger mentioned in the running process is console. How do i change
> it
> > to the log file rotated by log4j ?
> >
> > i tried modifying the conf/log4j.properties file to use DRFA instead
> > of cmdstdout logger. but that did not help either.
> >
> > Any help would be appreciated.
> >
> > Thanks
> > Srini
> >
> > On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel <
> > [email protected]> wrote:
> >
> >> Hi Srini,
> >>
> >> in distributed mode the bulk of Nutch's log output is kept in the Hadoop
> >> task logs.
> >> The configuration whether, how long and where these logs are kept
> depends
> >> on the
> >> configuration of your Hadoop cluster.  You can easily find tutorials and
> >> examples
> >> how to configure this if you google for "hadoop task logs".
> >>
> >> Be careful the Nutch logs are usually huge.  The easiest way to get them
> >> for a jobs
> >> is to run the following command on the master node:
> >>
> >>   yarn logs -applicationId <app_id>
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote:
> >>> Hi
> >>>
> >>> I am running nutch in distributed mode. I would like to see all nuch
> logs
> >>> written to files. I only see the console output. Can i see the same
> >>> information logged to some log files ?
> >>>
> >>> When i run nutch in local mode i do see the logs in runtime/local/logs
> >>> directory. But when i run nutch in distributed mode, i dont see it
> >> anywhere
> >>> except console.
> >>>
> >>> Can anyone help me with the settings that i need to change ?
> >>>
> >>> Thanks
> >>> Srini
> >>>
> >>
> >>
> >
>
>

Reply via email to