If not already, check your code, it is possible that you might have missed on 
close call on Scanner that may lead to the connection still hanging and so is 
the process(es).

Ravi
________________________________________
From: [email protected] [[email protected]] On Behalf Of Stack 
[[email protected]]
Sent: Wednesday, December 08, 2010 11:08 AM
To: [email protected]
Subject: Re: "Child" processes not getting killed

Add some logging to your Map task and retry?
St.Ack

On Tue, Dec 7, 2010 at 10:28 PM, Hari Sreekumar
<[email protected]> wrote:
> Hi Stack,
>
>          The logs don't show anything nasty. e.g, I ran a job which spawned
> 5 mappers. All of the Child processes spawned by them remained even after
> the job completed. 3 map tasks got completed, and they have the following
> log:
>
> *stdout logs*
>
> ------------------------------
>
>
> *stderr logs*
>
> ------------------------------
>
>
> *syslog logs*
>
> 2010-12-08 11:43:28,358 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51
> GMT
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:host.name=hadoop1
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.version=1.6.0_22
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.vendor=Sun Microsystems Inc.
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.home=/usr/java/jdk1.6.0_22/jre
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.class.path=/opt/hadoop/bin/../conf:/usr/java/jdk1.6.0_22/lib/tools.jar:/opt/hadoop/bin/..:/opt/hadoop/bin/../hadoop-0.20.2-core.jar:/opt/hadoop/bin/../lib/commons-cli-1.2.jar:/opt/hadoop/bin/../lib/commons-codec-1.3.jar:/opt/hadoop/bin/../lib/commons-el-1.0.jar:/opt/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/opt/hadoop/bin/../lib/commons-logging-1.0.4.jar:/opt/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/opt/hadoop/bin/../lib/commons-net-1.4.1.jar:/opt/hadoop/bin/../lib/core-3.1.1.jar:/opt/hadoop/bin/../lib/hsqldb-1.8.0.10.jar:/opt/hadoop/bin/../lib/jasper-compiler-5.5.12.jar:/opt/hadoop/bin/../lib/jasper-runtime-5.5.12.jar:/opt/hadoop/bin/../lib/jets3t-0.6.1.jar:/opt/hadoop/bin/../lib/jetty-6.1.14.jar:/opt/hadoop/bin/../lib/jetty-util-6.1.14.jar:/opt/hadoop/bin/../lib/junit-3.8.1.jar:/opt/hadoop/bin/../lib/kfs-0.2.2.jar:/opt/hadoop/bin/../lib/log4j-1.2.15.jar:/opt/hadoop/bin/../lib/mockito-all-1.8.0.jar:/opt/hadoop/bin/../lib/oro-2.0.8.jar:/opt/hadoop/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/hadoop/bin/../lib/slf4j-api-1.4.3.jar:/opt/hadoop/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop/bin/../lib/xmlenc-0.52.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar:/hbase-0.20.6.jar:/hbase-0.20.6-test.jar:/conf:/lib/zookeeper-3.2.2.jar::/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars/classes:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.library.path=/opt/hadoop/bin/../lib/native/Linux-amd64-64:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.io.tmpdir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work/tmp
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.compiler=<NA>
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:os.name=Linux
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:os.arch=amd64
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:os.version=2.6.18-194.26.1.el5
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:user.name=hadoop
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:user.home=/home/hadoop
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:user.dir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
> 2010-12-08 11:43:28,689 INFO org.apache.zookeeper.ZooKeeper:
> Initiating client connection, connectString=hadoop2:2181
> sessionTimeout=60000
> watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@4d3c7378
> 2010-12-08 11:43:28,690 INFO org.apache.zookeeper.ClientCnxn:
> zookeeper.disableAutoWatchReset is false
> 2010-12-08 11:43:28,703 INFO org.apache.zookeeper.ClientCnxn:
> Attempting connection to server hadoop2/192.168.1.111:2181
> 2010-12-08 11:43:28,722 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected
> local=/192.168.1.55:41757 remote=hadoop2/192.168.1.111:2181]
> 2010-12-08 11:43:28,724 INFO org.apache.zookeeper.ClientCnxn: Server
> connection successful
> 2010-12-08 11:46:51,985 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201012071556_0020_m_000000_0 is done. And is in the
> process of commiting
> 2010-12-08 11:46:51,989 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_201012071556_0020_m_000000_0' done.
>
>
> 2 of the jobs got killed and picked up by another machine, and their
> log looks exactly the same as above. Out of these, I see a cleanup
> attempt on one of the killed jobs. This cleanup job has this log:
>
> *stdout logs*
>
> ------------------------------
>
>
> *stderr logs*
>
> ------------------------------
>
>
> *syslog logs*
>
> 2010-12-08 11:44:58,313 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=CLEANUP, sessionId=
> 2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner:
> Runnning cleanup for the task
> 2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201012071556_0020_m_000003_0 is done. And is in the
> process of commiting
> 2010-12-08 11:44:58,439 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_201012071556_0020_m_000003_0' done.
>
> One of the killed jobs did not have a cleanup attempt associated with it.
>
> hari
>
>
>
> On Tue, Dec 7, 2010 at 9:43 PM, Stack <[email protected]> wrote:
>
>> Can you dig in more Hari?  When a child process won't go down, try
>> figuring what its doing?  Thread-dump it or study its logs?
>> St.Ack
>>
>> On Tue, Dec 7, 2010 at 4:36 AM, Hari Sreekumar <[email protected]>
>> wrote:
>> > Hi,
>> >
>> >       My cluster was running great till yesterday. Today, I submitted
>> some
>> > jobs and I saw that the jobs were taking way too long. On investigation,
>> I
>> > saw that the "Child" processes created by previous MR jobs were not
>> getting
>> > killed, even though no jobs were running on the cluster, and there were
>> like
>> > 40-50 child processes, each consuming memory, leading to huge swapping.
>> When
>> > I kill -9 the child processes and re-run the jobs, I don't encounter this
>> > problem for some time, and then again the child processes don't get
>> killed
>> > and eventually swapping happens. What could be the reason/solution?
>> >
>> > Thanks,
>> > Hari
>> >
>>
>

The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, please 
notify the sender and delete/destroy the original message and any copy of it 
from your computer or paper files.

Reply via email to