Re: "Child" processes not getting killed

Stack Wed, 08 Dec 2010 09:08:46 -0800

Add some logging to your Map task and retry?
St.Ack


On Tue, Dec 7, 2010 at 10:28 PM, Hari Sreekumar
<[email protected]> wrote:
> Hi Stack,
>
>          The logs don't show anything nasty. e.g, I ran a job which spawned
> 5 mappers. All of the Child processes spawned by them remained even after
> the job completed. 3 map tasks got completed, and they have the following
> log:
>
> *stdout logs*
>
> ------------------------------
>
>
> *stderr logs*
>
> ------------------------------
>
>
> *syslog logs*
>
> 2010-12-08 11:43:28,358 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51
> GMT
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:host.name=hadoop1
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.version=1.6.0_22
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.vendor=Sun Microsystems Inc.
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.home=/usr/java/jdk1.6.0_22/jre
> 2010-12-08 11:43:28,687 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.class.path=/opt/hadoop/bin/../conf:/usr/java/jdk1.6.0_22/lib/tools.jar:/opt/hadoop/bin/..:/opt/hadoop/bin/../hadoop-0.20.2-core.jar:/opt/hadoop/bin/../lib/commons-cli-1.2.jar:/opt/hadoop/bin/../lib/commons-codec-1.3.jar:/opt/hadoop/bin/../lib/commons-el-1.0.jar:/opt/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/opt/hadoop/bin/../lib/commons-logging-1.0.4.jar:/opt/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/opt/hadoop/bin/../lib/commons-net-1.4.1.jar:/opt/hadoop/bin/../lib/core-3.1.1.jar:/opt/hadoop/bin/../lib/hsqldb-1.8.0.10.jar:/opt/hadoop/bin/../lib/jasper-compiler-5.5.12.jar:/opt/hadoop/bin/../lib/jasper-runtime-5.5.12.jar:/opt/hadoop/bin/../lib/jets3t-0.6.1.jar:/opt/hadoop/bin/../lib/jetty-6.1.14.jar:/opt/hadoop/bin/../lib/jetty-util-6.1.14.jar:/opt/hadoop/bin/../lib/junit-3.8.1.jar:/opt/hadoop/bin/../lib/kfs-0.2.2.jar:/opt/hadoop/bin/../lib/log4j-1.2.15.jar:/opt/hadoop/bin/../lib/mockito-all-1.8.0.jar:/opt/hadoop/bin/../lib/oro-2.0.8.jar:/opt/hadoop/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/hadoop/bin/../lib/slf4j-api-1.4.3.jar:/opt/hadoop/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop/bin/../lib/xmlenc-0.52.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar:/hbase-0.20.6.jar:/hbase-0.20.6-test.jar:/conf:/lib/zookeeper-3.2.2.jar::/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars/classes:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/jars:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.library.path=/opt/hadoop/bin/../lib/native/Linux-amd64-64:/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.io.tmpdir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work/tmp
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:java.compiler=<NA>
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:os.name=Linux
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:os.arch=amd64
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:os.version=2.6.18-194.26.1.el5
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:user.name=hadoop
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:user.home=/home/hadoop
> 2010-12-08 11:43:28,688 INFO org.apache.zookeeper.ZooKeeper: Client
> environment:user.dir=/home/hadoop/DFS/MultiNode/mapred/local/taskTracker/jobcache/job_201012071556_0020/attempt_201012071556_0020_m_000000_0/work
> 2010-12-08 11:43:28,689 INFO org.apache.zookeeper.ZooKeeper:
> Initiating client connection, connectString=hadoop2:2181
> sessionTimeout=60000
> watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@4d3c7378
> 2010-12-08 11:43:28,690 INFO org.apache.zookeeper.ClientCnxn:
> zookeeper.disableAutoWatchReset is false
> 2010-12-08 11:43:28,703 INFO org.apache.zookeeper.ClientCnxn:
> Attempting connection to server hadoop2/192.168.1.111:2181
> 2010-12-08 11:43:28,722 INFO org.apache.zookeeper.ClientCnxn: Priming
> connection to java.nio.channels.SocketChannel[connected
> local=/192.168.1.55:41757 remote=hadoop2/192.168.1.111:2181]
> 2010-12-08 11:43:28,724 INFO org.apache.zookeeper.ClientCnxn: Server
> connection successful
> 2010-12-08 11:46:51,985 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201012071556_0020_m_000000_0 is done. And is in the
> process of commiting
> 2010-12-08 11:46:51,989 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_201012071556_0020_m_000000_0' done.
>
>
> 2 of the jobs got killed and picked up by another machine, and their
> log looks exactly the same as above. Out of these, I see a cleanup
> attempt on one of the killed jobs. This cleanup job has this log:
>
> *stdout logs*
>
> ------------------------------
>
>
> *stderr logs*
>
> ------------------------------
>
>
> *syslog logs*
>
> 2010-12-08 11:44:58,313 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=CLEANUP, sessionId=
> 2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner:
> Runnning cleanup for the task
> 2010-12-08 11:44:58,435 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201012071556_0020_m_000003_0 is done. And is in the
> process of commiting
> 2010-12-08 11:44:58,439 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_201012071556_0020_m_000003_0' done.
>
> One of the killed jobs did not have a cleanup attempt associated with it.
>
> hari
>
>
>
> On Tue, Dec 7, 2010 at 9:43 PM, Stack <[email protected]> wrote:
>
>> Can you dig in more Hari?  When a child process won't go down, try
>> figuring what its doing?  Thread-dump it or study its logs?
>> St.Ack
>>
>> On Tue, Dec 7, 2010 at 4:36 AM, Hari Sreekumar <[email protected]>
>> wrote:
>> > Hi,
>> >
>> >       My cluster was running great till yesterday. Today, I submitted
>> some
>> > jobs and I saw that the jobs were taking way too long. On investigation,
>> I
>> > saw that the "Child" processes created by previous MR jobs were not
>> getting
>> > killed, even though no jobs were running on the cluster, and there were
>> like
>> > 40-50 child processes, each consuming memory, leading to huge swapping.
>> When
>> > I kill -9 the child processes and re-run the jobs, I don't encounter this
>> > problem for some time, and then again the child processes don't get
>> killed
>> > and eventually swapping happens. What could be the reason/solution?
>> >
>> > Thanks,
>> > Hari
>> >
>>
>

Re: "Child" processes not getting killed

Reply via email to