Hi Devaraj, thanks for your answer. Yes, I suspected it could be because of host mapping, so I have already checked (and have just re-checked) settings in /etc/hosts of each machine, and they all are ok. I use both fully-qualified names (e.g. `master-host.company.com`) and their shortcuts (e.g. `master-host`), so it shouldn't depend on notation too.
I have also checked AM syslog. There's nothing about network, but there are several messages like the following: ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Container complete event for unknown container id container_1373460572360_0001_01_000088 I understand container just doesn't get registered in AM (probably because of the same issue), is it correct? So I wonder who sends "container complete event" to ApplicationMaster? On Wed, Jul 10, 2013 at 3:19 PM, Devaraj k <[email protected]> wrote: > >1. I assume this is the task (container) that tries to establish > connection, but what it wants to connect to? **** > > It is trying to connect to MRAppMaster for executing the actual task.**** > > ** ** > > >1. I assume this is the task (container) that tries to establish > connection, but what it wants to connect to? **** > > It seems Container is not getting the correct MRAppMaster address due to > some reason or AM is crashing before giving the task to Container. Probably > it is coming due to invalid host mapping. Can you check the host mapping > is proper in both the machines and also check the AM log that time for any > clue. **** > > ** ** > > Thanks**** > > Devaraj k**** > > ** ** > > *From:* Andrei [mailto:[email protected]] > *Sent:* 10 July 2013 17:32 > *To:* [email protected] > *Subject:* ConnectionException in container, happens only sometimes**** > > ** ** > > Hi, **** > > ** ** > > I'm running CDH4.3 installation of Hadoop with the following simple setup: > **** > > ** ** > > master-host: runs NameNode, ResourceManager and JobHistoryServer**** > > slave-1-host and slave-2-hosts: DataNodes and NodeManagers. **** > > ** ** > > When I run simple MapReduce job (both - using streaming API or Pi example > from distribution) on client I see that some tasks fail: **** > > ** ** > > 13/07/10 14:40:10 INFO mapreduce.Job: map 60% reduce 0%**** > > 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : > attempt_1373454026937_0005_m_000003_0, Status : FAILED**** > > 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : > attempt_1373454026937_0005_m_000005_0, Status : FAILED**** > > ...**** > > 13/07/10 14:40:23 INFO mapreduce.Job: map 60% reduce 20%**** > > ...**** > > ** ** > > Every time different set of tasks/attempts fails. In some cases number of > failed attempts becomes critical, and the whole job fails, in other cases > job is finished successfully. I can't see any dependency, but I noticed the > following. **** > > ** ** > > Let's say, ApplicationMaster runs on _slave-1-host_. In this case on > _slave-2-host_ there will be corresponding syslog with the following > contents: **** > > ** ** > > ... **** > > 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying > connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s); > retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1 SECONDS)**** > > 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying > connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s); > retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1 SECONDS)**** > > ...**** > > 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying > connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s); > retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1 SECONDS)**** > > 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: > Exception running child : java.net.ConnectException: Call From slave-2-host/ > 127.0.0.1 to slave-2-host:11812 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused**** > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method)**** > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > **** > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > **** > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > **** > > at > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782)**** > > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729) > **** > > at org.apache.hadoop.ipc.Client.call(Client.java:1229)**** > > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225) > **** > > at com.sun.proxy.$Proxy6.getTask(Unknown Source)**** > > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131)**** > > Caused by: java.net.ConnectException: Connection refused**** > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)**** > > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708)**** > > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207) > **** > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)**** > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492)**** > > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499)** > ** > > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593)*** > * > > at > org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241)**** > > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278)*** > * > > at org.apache.hadoop.ipc.Client.call(Client.java:1196)**** > > ... 3 more**** > > ** ** > > ** ** > > Notice several things: **** > > ** ** > > 1. This exception always happens on the different host than > ApplicationMaster runs on. **** > > 2. It always tries to connect to localhost, not other host in cluster. *** > * > > 3. Port number (11812 in this case) is always different. **** > > ** ** > > My questions are: **** > > ** ** > > 1. I assume this is the task (container) that tries to establish > connection, but what it wants to connect to? **** > > 2. Why this error happens and how can I fix it? **** > > ** ** > > Any suggestions are welcome.**** > > ** ** > > Thanks, **** > > Andrei**** >
