Re: ConnectionException in container, happens only sometimes

Andrei Wed, 10 Jul 2013 06:22:21 -0700

Hi Devaraj,

thanks for your answer. Yes, I suspected it could be because of host
mapping, so I have already checked (and have just re-checked) settings in
/etc/hosts of each machine, and they all are ok. I use both fully-qualified
names (e.g. `master-host.company.com`) and their shortcuts (e.g.
`master-host`), so it shouldn't depend on notation too.


I have also checked AM syslog. There's nothing about network, but there are
several messages like the following:

ERROR [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container
complete event for unknown container id
container_1373460572360_0001_01_000088


I understand container just doesn't get registered in AM (probably because
of the same issue), is it correct? So I wonder who sends "container
complete event" to ApplicationMaster?





On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <[email protected]> wrote:

>  >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It is trying to connect to MRAppMaster for executing the actual task.****
>
> ** **
>
> >1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> It seems Container is not getting the correct MRAppMaster address due to
> some reason or AM is crashing before giving the task to Container. Probably
> it is coming due to invalid host mapping.  Can you check the host mapping
> is proper in both the machines and also check the AM log that time for any
> clue. ****
>
> ** **
>
> Thanks****
>
> Devaraj k****
>
> ** **
>
> *From:* Andrei [mailto:[email protected]]
> *Sent:* 10 July 2013 17:32
> *To:* [email protected]
> *Subject:* ConnectionException in container, happens only sometimes****
>
> ** **
>
> Hi, ****
>
> ** **
>
> I'm running CDH4.3 installation of Hadoop with the following simple setup:
> ****
>
> ** **
>
> master-host: runs NameNode, ResourceManager and JobHistoryServer****
>
> slave-1-host and slave-2-hosts: DataNodes and NodeManagers. ****
>
> ** **
>
> When I run simple MapReduce job (both - using streaming API or Pi example
> from distribution) on client I see that some tasks fail: ****
>
> ** **
>
> 13/07/10 14:40:10 INFO mapreduce.Job:  map 60% reduce 0%****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000003_0, Status : FAILED****
>
> 13/07/10 14:40:14 INFO mapreduce.Job: Task Id :
> attempt_1373454026937_0005_m_000005_0, Status : FAILED****
>
> ...****
>
> 13/07/10 14:40:23 INFO mapreduce.Job:  map 60% reduce 20%****
>
> ...****
>
> ** **
>
> Every time different set of tasks/attempts fails. In some cases number of
> failed attempts becomes critical, and the whole job fails, in other cases
> job is finished successfully. I can't see any dependency, but I noticed the
> following. ****
>
> ** **
>
> Let's say, ApplicationMaster runs on _slave-1-host_. In this case on
> _slave-2-host_ there will be corresponding syslog with the following
> contents: ****
>
> ** **
>
> ... ****
>
> 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> ...****
>
> 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying
> connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1 SECONDS)****
>
> 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild:
> Exception running child : java.net.ConnectException: Call From slave-2-host/
> 127.0.0.1 to slave-2-host:11812 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused****
>
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)****
>
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> ****
>
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ****
>
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> ****
>
>         at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)****
>
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729)
> ****
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1229)****
>
>         at
> org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225)
> ****
>
>         at com.sun.proxy.$Proxy6.getTask(Unknown Source)****
>
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)****
>
> Caused by: java.net.ConnectException: Connection refused****
>
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)****
>
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)****
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
> ****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)****
>
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)****
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)**
> **
>
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)***
> *
>
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)****
>
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)***
> *
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1196)****
>
>         ... 3 more****
>
> ** **
>
> ** **
>
> Notice several things: ****
>
> ** **
>
> 1. This exception always happens on the different host than
> ApplicationMaster runs on. ****
>
> 2. It always tries to connect to localhost, not other host in cluster. ***
> *
>
> 3. Port number (11812 in this case) is always different. ****
>
> ** **
>
> My questions are: ****
>
> ** **
>
> 1. I assume this is the task (container) that tries to establish
> connection, but what it wants to connect to? ****
>
> 2. Why this error happens and how can I fix it? ****
>
> ** **
>
> Any suggestions are welcome.****
>
> ** **
>
> Thanks, ****
>
> Andrei****
>

Re: ConnectionException in container, happens only sometimes

Reply via email to