[ 
https://issues.apache.org/jira/browse/HDFS-14857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri reassigned HDFS-14857:
----------------------------------

    Assignee: Jeff Saremi

> FS operations fail in HA mode: DataNode fails to connect to NameNode
> --------------------------------------------------------------------
>
>                 Key: HDFS-14857
>                 URL: https://issues.apache.org/jira/browse/HDFS-14857
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.1.0
>            Reporter: Jeff Saremi
>            Assignee: Jeff Saremi
>            Priority: Major
>
> In an HA configuration, if the NameNodes get restarted and if they're 
> assigned new IP addresses, any client FS operation such as a copyFromLocal 
> will fail with a message like the following:
> {{2019-09-12 18:47:30,544 WARN hdfs.DataStreamer: DataStreamer 
> Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /tmp/init.sh._COPYING_ could only be written to 0 of the 1 minReplication 
> nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this 
> operation.        at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2211)
>  ...}}
>  
> Looking at DataNode's stderr shows the following:
>  * The heartbeat service detects the IP change and recovers (almost)
>  * At this stage, an *hdfs dfsadmin -report* reports all datanodes correctly
>  * Once the write begins, the follwoing exception shows up in the datanode 
> log: *no route to host*
> {{2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in 
> offerService2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in 
> offerServicejava.io.EOFException: End of File Exception between local host 
> is: "storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211"; 
> destination host is: "nmnode-0-0.nmnode-0-svc.test.svc.cluster.local":9000; : 
> java.io.EOFException; For more details see:  
> http://wiki.apache.org/hadoop/EOFException at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:789) at 
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1491) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1388) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>  at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:166)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:516)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:646)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847)
>  at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException at 
> java.io.DataInputStream.readInt(DataInputStream.java:392) at 
> org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1850) at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183) 
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)}}
> {{2019-09-12 01:41:12,273 WARN ipc.Client: Address change detected. Old: 
> nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 New: 
> nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.220:9000}}{{...}}
>  
> {{2019-09-12 01:41:12,482 INFO datanode.DataNode: Block pool 
> BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 
> 7673ef28-957a-439f-a721-d47a4a6adb7b) service to 
> nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 beginning 
> handshake with NN}}
> {{2019-09-12 01:41:12,534 INFO datanode.DataNode: Block pool Block pool 
> BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 
> 7673ef28-957a-439f-a721-d47a4a6adb7b) service to 
> nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 successfully 
> registered with NN}}
>  
> *NOTE*:  See how when the '{{Address change detected' shows up, the printout 
> correctly shows the old and the new address (}}{{10.244.0.220}}{{). 
> }}{{However when the registration with NN is complete, the old IP address is 
> still being printed (}}{{10.244.0.217) which is showing how cached copies of 
> the IP addresses linger on.}}{{}}
>  
> {{And the following is where the actual error happens preventing any writes 
> to FS:}}
>  
> {{2019-09-12 18:45:29,843 INFO retry.RetryInvocationHandler: 
> java.net.NoRouteToHostException: No Route to Host from 
> storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211 to 
> nmnode-0-1.nmnode-0-svc:50200 failed on socket timeout exception: 
> java.net.NoRouteToHostException: No route to host; For more details see: 
> http://wiki.apache.org/hadoop/NoRouteToHost, while invoking 
> InMemoryAliasMapProtocolClientSideTranslatorPB.read over 
> nmnode-0-1.nmnode-0-svc/10.244.0.217:50200 after 3 failover attempts. Trying 
> to failover after sleeping for 4452ms.}}{{}}{{}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to