Hi Srini, > I am referring to the INFO messages that are printed in console when nutch > 1.14 is running in distributed mode. For example
Afaics, the only way to get the logs of the job client is to redirect the console output to a file, e.g., /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb seed.txt &>inject.log > I am running nutch from a EMR cluster. If you're interested in the logs of task attempts, see: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html Sebastian On 07/29/2017 09:38 AM, Srinivasan Ramaswamy wrote: > Hi Sebastin > > I am referring to the INFO messages that are printed in console when nutch > 1.14 is running in distributed mode. For example > > Injecting seed URLs > /mnt/nutch/runtime/deploy/bin/nutch inject /user/hadoop/crawlDIR/crawldb > seed.txt > 17/07/29 06:51:18 INFO crawl.Injector: Injector: starting at 2017-07-29 > 06:51:18 > 17/07/29 06:51:18 INFO crawl.Injector: Injector: crawlDb: > /user/hadoop/crawlDIR/crawldb > 17/07/29 06:51:18 INFO crawl.Injector: Injector: urlDir: seed.txt > 17/07/29 06:51:18 INFO crawl.Injector: Injector: Converting injected urls > to crawl db entries. > 17/07/29 06:51:19 INFO client.RMProxy: Connecting to ResourceManager at > ip-*-*-*-*.ec2.internal/*.*.*.*:8032 > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process > : 0 > 17/07/29 06:51:20 INFO input.FileInputFormat: Total input paths to process > : 1 > . > . > 17/07/29 06:51:20 INFO mapreduce.Job: Running job: job_1500749038440_0003 > 17/07/29 06:51:28 INFO mapreduce.Job: Job job_1500749038440_0003 running in > uber mode : false > 17/07/29 06:51:28 INFO mapreduce.Job: map 0% reduce 0% > 17/07/29 06:51:33 INFO mapreduce.Job: map 100% reduce 0% > 17/07/29 06:51:38 INFO mapreduce.Job: map 100% reduce 4% > 17/07/29 06:51:40 INFO mapreduce.Job: map 100% reduce 6% > 17/07/29 06:51:41 INFO mapreduce.Job: map 100% reduce 49% > 17/07/29 06:51:42 INFO mapreduce.Job: map 100% reduce 66% > 17/07/29 06:51:43 INFO mapreduce.Job: map 100% reduce 87% > 17/07/29 06:51:44 INFO mapreduce.Job: map 100% reduce 100% > > I am running nutch from a EMR cluster. I did check around the log > directories and I dont see the messages i see in the console anywhere else. > > One more thing i noticed is when i issue the command > > *ps -ef | grep nutch* > > hadoop 21616 18344 2 06:59 pts/1 00:00:09 > /usr/lib/jvm/java-1.8.0-openjdk.x86_64/bin/java -Xmx1000m -server > -XX:OnOutOfMemoryError=kill -9 %p *-Dhadoop.log.dir=/usr/lib/hadoop/logs* > *-Dhadoop.log.file=hadoop.log* -Dhadoop.home.dir=/usr/lib/hadoop > -Dhadoop.id.str= *-Dhadoop.root.logger=INFO,console* > -Djava.library.path=:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native > -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true > -Dhadoop.security.logger=INFO,NullAppender -Dsun.net.inetaddr.ttl=30 > org.apache.hadoop.util.RunJar > /mnt/nutch/runtime/deploy/apache-nutch-1.14-SNAPSHOT.job > org.apache.nutch.fetcher.Fetcher -D mapreduce.map.java.opts=-Xmx2304m -D > mapreduce.map.memory.mb=2880 -D mapreduce.reduce.java.opts=-Xmx4608m -D > mapreduce.reduce.memory.mb=5760 -D mapreduce.job.reduces=12 -D > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 > /user/hadoop/crawlDIR/segments/20170729065841 -noParsing -threads 100 > > The logger mentioned in the running process is console. How do i change it > to the log file rotated by log4j ? > > i tried modifying the conf/log4j.properties file to use DRFA instead > of cmdstdout logger. but that did not help either. > > Any help would be appreciated. > > Thanks > Srini > > On Mon, Jul 24, 2017 at 12:52 AM, Sebastian Nagel < > [email protected]> wrote: > >> Hi Srini, >> >> in distributed mode the bulk of Nutch's log output is kept in the Hadoop >> task logs. >> The configuration whether, how long and where these logs are kept depends >> on the >> configuration of your Hadoop cluster. You can easily find tutorials and >> examples >> how to configure this if you google for "hadoop task logs". >> >> Be careful the Nutch logs are usually huge. The easiest way to get them >> for a jobs >> is to run the following command on the master node: >> >> yarn logs -applicationId <app_id> >> >> Best, >> Sebastian >> >> On 07/21/2017 10:09 PM, Srinivasan Ramaswamy wrote: >>> Hi >>> >>> I am running nutch in distributed mode. I would like to see all nuch logs >>> written to files. I only see the console output. Can i see the same >>> information logged to some log files ? >>> >>> When i run nutch in local mode i do see the logs in runtime/local/logs >>> directory. But when i run nutch in distributed mode, i dont see it >> anywhere >>> except console. >>> >>> Can anyone help me with the settings that i need to change ? >>> >>> Thanks >>> Srini >>> >> >> >

