Hi Stack,
The logs don't show anything nasty. e.g, I ran a job which spawned
5 mappers. All of the Child processes spawned by them remained even after
the job completed. 3 map tasks got completed, and they have the following
log:
*stdout logs*
------------------------------
*stderr logs*
------------------------------
*syslog logs*
2010-12-08 11:43:28,358 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51
GMT
2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
environment:host.name=hadoop1
2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.version=1.6.0_22
2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.vendor=Sun Microsystems Inc.
2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.home=/usr/java/jdk1.6.0_22/jre
2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.class.path=/opt/hadoop/bin/../conf:/usr/java/jdk1.6.0_22/lib/tools.jar:/opt/hadoop/bin/..:/opt/hadoop/bin/../hadoop-0.20.2-core.jar:/opt/hadoop/bin/../lib/commons-cli-1.2.jar:/opt/hadoop/bin/../lib/commons-codec-1.3.jar:/opt/hadoop/bin/../lib/commons-el-1.0.jar:/opt/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/opt/hadoop/bin/../lib/commons-logging-1.0.4.jar:/opt/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/opt/hadoop/bin/../lib/commons-net-1.4.1.jar:/opt/hadoop/bin/../lib/core-3.1.1.jar:/opt/hadoop/bin/../lib/hsqldb-1.8.0.10.jar:/opt/hadoop/bin/../lib/jasper-compiler-5.5.12.jar:/opt/hadoop/bin/../lib/jasper-runtime-5.5.12.jar:/opt/hadoop/bin/../lib/jets3t-0.6.1.jar:/opt/hadoop/bin/../lib/jetty-6.1.14.jar:/opt/hadoop/bin/../lib/jetty-util-6.1.14.jar:/opt/hadoop/bin/../lib/junit-3.8.1.jar:/opt/hadoop/bin/../lib/kfs-0.2.2.jar:/opt/hadoop/bin/../lib/log4j-1.2.15.jar:/opt/hadoop/bin/../lib/mockito-all-1.8.0.jar:/opt/hadoop/bin/../lib/oro-2.0.8.jar:/opt/hadoop/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/hadoop/bin/../lib/slf4j-api-1.4.3.jar:/opt/hadoop/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop/bin/../lib/xmlenc-0.52.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar:/hbase-0.20.6.jar:/hbase-0.20.6-test.jar:/conf:/lib/zookeeper-3.2.2.jar::/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars/classes:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.library.path=/opt/hadoop/bin/../lib/native/Linux-amd64-64:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.io.tmpdir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work/tmp
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:java.compiler=<NA>
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:os.name=Linux
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:os.arch=amd64
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:os.version=2.6.18-194.26.1.el5
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:user.name=hadoop
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:user.home=/home/hadoop
2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
environment:user.dir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
2010-12-08 11:43:28,689 INFO org.apache.zookeeper.ZooKeeper:
Initiating client connection, connectString=hadoop2:2181
sessionTimeout=60000
watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@4d3c7378
2010-12-08 11:43:28,690 INFO org.apache.zookeeper.ClientCnxn:
zookeeper.disableAutoWatchReset is false
2010-12-08 11:43:28,703 INFO org.apache.zookeeper.ClientCnxn:
Attempting connection to server hadoop2/192.168.1.111:2181
2010-12-08 11:43:28,722 INFO org.apache.zookeeper.ClientCnxn: Priming
connection to java.nio.channels.SocketChannel[connected
local=/192.168.1.55:41757 remote=hadoop2/192.168.1.111:2181]
2010-12-08 11:43:28,724 INFO org.apache.zookeeper.ClientCnxn: Server
connection successful
2010-12-08 11:46:51,985 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201012071556_0020_m_000000_0 is done. And is in the
process of commiting
2010-12-08 11:46:51,989 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201012071556_0020_m_000000_0' done.
2 of the jobs got killed and picked up by another machine, and their
log looks exactly the same as above. Out of these, I see a cleanup
attempt on one of the killed jobs. This cleanup job has this log:
*stdout logs*
------------------------------
*stderr logs*
------------------------------
*syslog logs*
2010-12-08 11:44:58,313 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=CLEANUP, sessionId=
2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner:
Runnning cleanup for the task
2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201012071556_0020_m_000003_0 is done. And is in the
process of commiting
2010-12-08 11:44:58,439 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201012071556_0020_m_000003_0' done.
One of the killed jobs did not have a cleanup attempt associated with it.
hari
On Tue, Dec 7, 2010 at 9:43 PM, Stack <[email protected]> wrote:
> Can you dig in more Hari? When a child process won't go down, try
> figuring what its doing? Thread-dump it or study its logs?
> St.Ack
>
> On Tue, Dec 7, 2010 at 4:36 AM, Hari Sreekumar <[email protected]>
> wrote:
> > Hi,
> >
> > My cluster was running great till yesterday. Today, I submitted
> some
> > jobs and I saw that the jobs were taking way too long. On investigation,
> I
> > saw that the "Child" processes created by previous MR jobs were not
> getting
> > killed, even though no jobs were running on the cluster, and there were
> like
> > 40-50 child processes, each consuming memory, leading to huge swapping.
> When
> > I kill -9 the child processes and re-run the jobs, I don't encounter this
> > problem for some time, and then again the child processes don't get
> killed
> > and eventually swapping happens. What could be the reason/solution?
> >
> > Thanks,
> > Hari
> >
>