Add some logging to your Map task and retry? St.Ack
On Tue, Dec 7, 2010 at 10:28 PM, Hari Sreekumar <[email protected]> wrote: > Hi Stack, > > The logs don't show anything nasty. e.g, I ran a job which spawned > 5 mappers. All of the Child processes spawned by them remained even after > the job completed. 3 map tasks got completed, and they have the following > log: > > *stdout logs* > > ------------------------------ > > > *stderr logs* > > ------------------------------ > > > *syslog logs* > > 2010-12-08 11:43:28,358 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > Initializing JVM Metrics with processName=MAP, sessionId= > 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client > environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 > GMT > 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client > environment:host.name=hadoop1 > 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.version=1.6.0_22 > 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.vendor=Sun Microsystems Inc. > 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.home=/usr/java/jdk1.6.0_22/jre > 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.class.path=/opt/hadoop/bin/../conf:/usr/java/jdk1.6.0_22/lib/tools.jar:/opt/hadoop/bin/..:/opt/hadoop/bin/../hadoop-0.20.2-core.jar:/opt/hadoop/bin/../lib/commons-cli-1.2.jar:/opt/hadoop/bin/../lib/commons-codec-1.3.jar:/opt/hadoop/bin/../lib/commons-el-1.0.jar:/opt/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/opt/hadoop/bin/../lib/commons-logging-1.0.4.jar:/opt/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/opt/hadoop/bin/../lib/commons-net-1.4.1.jar:/opt/hadoop/bin/../lib/core-3.1.1.jar:/opt/hadoop/bin/../lib/hsqldb-1.8.0.10.jar:/opt/hadoop/bin/../lib/jasper-compiler-5.5.12.jar:/opt/hadoop/bin/../lib/jasper-runtime-5.5.12.jar:/opt/hadoop/bin/../lib/jets3t-0.6.1.jar:/opt/hadoop/bin/../lib/jetty-6.1.14.jar:/opt/hadoop/bin/../lib/jetty-util-6.1.14.jar:/opt/hadoop/bin/../lib/junit-3.8.1.jar:/opt/hadoop/bin/../lib/kfs-0.2.2.jar:/opt/hadoop/bin/../lib/log4j-1.2.15.jar:/opt/hadoop/bin/../lib/mockito-all-1.8.0.jar:/opt/hadoop/bin/../lib/oro-2.0.8.jar:/opt/hadoop/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/hadoop/bin/../lib/slf4j-api-1.4.3.jar:/opt/hadoop/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop/bin/../lib/xmlenc-0.52.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar:/hbase-0.20.6.jar:/hbase-0.20.6-test.jar:/conf:/lib/zookeeper-3.2.2.jar::/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars/classes:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.library.path=/opt/hadoop/bin/../lib/native/Linux-amd64-64:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.io.tmpdir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work/tmp > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:java.compiler=<NA> > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:os.name=Linux > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:os.arch=amd64 > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:os.version=2.6.18-194.26.1.el5 > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:user.name=hadoop > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:user.home=/home/hadoop > 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client > environment:user.dir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work > 2010-12-08 11:43:28,689 INFO org.apache.zookeeper.ZooKeeper: > Initiating client connection, connectString=hadoop2:2181 > sessionTimeout=60000 > watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@4d3c7378 > 2010-12-08 11:43:28,690 INFO org.apache.zookeeper.ClientCnxn: > zookeeper.disableAutoWatchReset is false > 2010-12-08 11:43:28,703 INFO org.apache.zookeeper.ClientCnxn: > Attempting connection to server hadoop2/192.168.1.111:2181 > 2010-12-08 11:43:28,722 INFO org.apache.zookeeper.ClientCnxn: Priming > connection to java.nio.channels.SocketChannel[connected > local=/192.168.1.55:41757 remote=hadoop2/192.168.1.111:2181] > 2010-12-08 11:43:28,724 INFO org.apache.zookeeper.ClientCnxn: Server > connection successful > 2010-12-08 11:46:51,985 INFO org.apache.hadoop.mapred.TaskRunner: > Task:attempt_201012071556_0020_m_000000_0 is done. And is in the > process of commiting > 2010-12-08 11:46:51,989 INFO org.apache.hadoop.mapred.TaskRunner: Task > 'attempt_201012071556_0020_m_000000_0' done. > > > 2 of the jobs got killed and picked up by another machine, and their > log looks exactly the same as above. Out of these, I see a cleanup > attempt on one of the killed jobs. This cleanup job has this log: > > *stdout logs* > > ------------------------------ > > > *stderr logs* > > ------------------------------ > > > *syslog logs* > > 2010-12-08 11:44:58,313 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > Initializing JVM Metrics with processName=CLEANUP, sessionId= > 2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner: > Runnning cleanup for the task > 2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner: > Task:attempt_201012071556_0020_m_000003_0 is done. And is in the > process of commiting > 2010-12-08 11:44:58,439 INFO org.apache.hadoop.mapred.TaskRunner: Task > 'attempt_201012071556_0020_m_000003_0' done. > > One of the killed jobs did not have a cleanup attempt associated with it. > > hari > > > > On Tue, Dec 7, 2010 at 9:43 PM, Stack <[email protected]> wrote: > >> Can you dig in more Hari? When a child process won't go down, try >> figuring what its doing? Thread-dump it or study its logs? >> St.Ack >> >> On Tue, Dec 7, 2010 at 4:36 AM, Hari Sreekumar <[email protected]> >> wrote: >> > Hi, >> > >> > My cluster was running great till yesterday. Today, I submitted >> some >> > jobs and I saw that the jobs were taking way too long. On investigation, >> I >> > saw that the "Child" processes created by previous MR jobs were not >> getting >> > killed, even though no jobs were running on the cluster, and there were >> like >> > 40-50 child processes, each consuming memory, leading to huge swapping. >> When >> > I kill -9 the child processes and re-run the jobs, I don't encounter this >> > problem for some time, and then again the child processes don't get >> killed >> > and eventually swapping happens. What could be the reason/solution? >> > >> > Thanks, >> > Hari >> > >> >
