Thanks for your reply Sebastian. I asked this question for the following reasons:
* We were running crawl script using nohup and we redirected the output to a local log file. In some weird/rare scenario (may be our master node went down at that time, i am not sure), the log file stopped but nutch process was running. We could not really see what it (nutch) is doing. * I see that the nutch code uses log4j to log, so i am wondering it should all go to a log4j rotated log file instead of just console. The same works well in local mode. Can you please explain me why it doesnt write to a file and only to console ? * It also puzzles me why the running process shows " Dhadoop.root.logger=INFO,console" though i changed conf/log4j.properties to "log4j.rootLogger=INFO,DRFA" Thanks Srini On Tue, Aug 1, 2017 at 7:51 AM, Sebastian Nagel <[email protected]> wrote: > Hi Srini, > > > I am referring to the INFO messages that are printed in console when > nutch > > 1.14 is running in distributed mode. For example > > Afaics, the only way to get the logs of the job client is to redirect the > console output to a file, > e.g., > > /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb > seed.txt &>inject.log > > > I am running nutch from a EMR cluster. > > If you're interested in the logs of task attempts, see: > > http://docs.aws.amazon.com/emr/latest/ManagementGuide/ > emr-manage-view-web-log-files.html > > > Sebastian > > On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote: > > Hi Sebastin > > > > I am referring to the INFO messages that are printed in console when > nutch > > 1.14 is running in distributed mode. For example > > > > Injecting seed URLs > > /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb > > seed.txt > > 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29 > > 06:51:18 > > 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb: > > /user/hadoop/crawlDIR/crawldb > > 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt > > 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls > > to crawl db entries. > > 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at > > ip-*-*-*-*.ec2.internal/*.*.*.*:8032 > > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to > process > > : 0 > > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to > process > > : 1 > > . > > . > > 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003 > > 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running > in > > uber mode : false > > 17/07/29 06:51:28 INFO mapreduce.Job: map 0% reduce 0% > > 17/07/29 06:51:33 INFO mapreduce.Job: map 100% reduce 0% > > 17/07/29 06:51:38 INFO mapreduce.Job: map 100% reduce 4% > > 17/07/29 06:51:40 INFO mapreduce.Job: map 100% reduce 6% > > 17/07/29 06:51:41 INFO mapreduce.Job: map 100% reduce 49% > > 17/07/29 06:51:42 INFO mapreduce.Job: map 100% reduce 66% > > 17/07/29 06:51:43 INFO mapreduce.Job: map 100% reduce 87% > > 17/07/29 06:51:44 INFO mapreduce.Job: map 100% reduce 100% > > > > I am running nutch from a EMR cluster. I did check around the log > > directories and I dont see the messages i see in the console anywhere > else. > > > > One more thing i noticed is when i issue the command > > > > *ps -ef | grep nutch* > > > > hadoop 21616 18344 2 06:59 pts/1 00:00:09 > > /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server > > -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/ > hadoop/logs* > > *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop > > -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console* > > -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/ > lib/hadoop/lib/native > > -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true > > -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30 > > org.apache.hadoop.util.RunJar > > /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job > > org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D > > mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D > > mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D > > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 > > /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100 > > > > The logger mentioned in the running process is console. How do i change > it > > to the log file rotated by log4j ? > > > > i tried modifying the conf/log4j.properties file to use DRFA instead > > of cmdstdout logger. but that did not help either. > > > > Any help would be appreciated. > > > > Thanks > > Srini > > > > On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel < > > [email protected]> wrote: > > > >> Hi Srini, > >> > >> in distributed mode the bulk of Nutch's log output is kept in the Hadoop > >> task logs. > >> The configuration whether, how long and where these logs are kept > depends > >> on the > >> configuration of your Hadoop cluster. You can easily find tutorials and > >> examples > >> how to configure this if you google for "hadoop task logs". > >> > >> Be careful the Nutch logs are usually huge. The easiest way to get them > >> for a jobs > >> is to run the following command on the master node: > >> > >> yarn logs -applicationId <app_id> > >> > >> Best, > >> Sebastian > >> > >> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote: > >>> Hi > >>> > >>> I am running nutch in distributed mode. I would like to see all nuch > logs > >>> written to files. I only see the console output. Can i see the same > >>> information logged to some log files ? > >>> > >>> When i run nutch in local mode i do see the logs in runtime/local/logs > >>> directory. But when i run nutch in distributed mode, i dont see it > >> anywhere > >>> except console. > >>> > >>> Can anyone help me with the settings that i need to change ? > >>> > >>> Thanks > >>> Srini > >>> > >> > >> > > > >

