subject:"\[jira\] \[Comment Edited\] \(HDFS\-14811\) RBF\: TestRouterRpc#testErasureCoding is flaky"

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-11 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928128#comment-16928128
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/12/19 1:39 AM:


{quote}Let's see if we can fix the count of active threads.
{quote}
[~elgoiri]，HDFS-12288 won't help in this case. when some client writing data on 
1 DN, it will start 2 threads({{DatanodeXceiver}} and {{PacketResponder}}), 
after the patch of HDFS-12288, {{DatanodeXceiverServer}} thread will not be 
included in {{xceiverCount}} of heartbeat. In this case, there will be 5 DN 
with {{xceiverCount}} equals to 0 and 1 DN equals to 2(which is overloaded).


was (Author: zhangchen):
{quote}Let's see if we can fix the count of active threads.
{quote}
[~elgoiri]，HDFS-12288 won't help in this case. when some client writing data on 
1 DN, it will start 2 threads(\{{DatanodeXceiver}} and {{PacketResponder}}), 
after the patch of HDFS-12288, {{DatanodeXceiverServer}} will not included in 
{{xceiverCount}} of heartbeat. In this case, there will be 5 DN with 
{{xceiverCount}} equals to 0 and 1 DN equals to 2(which is overloaded).

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
>

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-09 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925202#comment-16925202
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/9/19 7:27 AM:
---

Hi [~ayushtkn], I've gone through the discussion in HDFS-12288, the latest 
conclusion is to modify getXceiverCount() method to return real number of 
DataXceiver threads (current is much more than the real number), but the load 
of each DN is still not changed (using the activeNumberOfThread instead), so 
when a DN start writing a block, the load would still be 3, which makes it 
overloaded.

My initial idea is quite same as [~lukmajercak] mentioned at HDFS-12288: do not 
consider packetResponder thread when calculating DN's load. But this solution 
looks not a good choice.


was (Author: zhangchen):
Hi [~ayushtkn], I've gone through the discussion in HDFS-12288, the latest 
conclusion is to modify getXceiverCount() method to return real number of 
DataXceiver threads (current is much more than the real number), but the load 
of each DN is still not changed (using the activeNumberOfThread instead), so 
when a DN start writing a block, the load would still be 3, which makes it 
overloaded.

My initially idea is quite same as [~lukmajercak] mentioned at HDFS-12288: do 
not consider packetResponder thread when calculating DN's load. But this 
solution looks not a good choice.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
>

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-08 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924765#comment-16924765
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/8/19 2:32 PM:
---

Uploaded patch v2 to disable considerLoad option. I've run the whole class 
test(using {{mvn -Dtest=TestRouterRpc test}}) 50 times in local, all of them 
passed after patch.

I've filed another Jira HDFS-14830 to track the xceiverCount problem.


was (Author: zhangchen):
Uploaded patch v2 to disable considerLoad option. I've run the whole class 
test(using {{mvn -Dtest=TestRouterRpc test}}) 50 times in local, all of them 
passed after patch.

I've filed another Jira HDFS-14803 to track the xceiverCount problem.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-05 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923661#comment-16923661
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/6/19 12:44 AM:


Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, createFile method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count (every xceiver processing op WRITE_BLOCK will create an extra 
PacketResponder thread) .


was (Author: zhangchen):
Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002,

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-05 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923661#comment-16923661
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/5/19 6:02 PM:
---

Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, createFile method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count.


was (Author: zhangchen):
Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder:

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-04 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922646#comment-16922646
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/5/19 2:37 AM:
---

>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 clients are writing and 2 clients are reading, since the write 
target and read target is chosen randomly, it is possible that 2 read clients 
read on the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.


was (Author: zhangchen):
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-04 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922646#comment-16922646
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/4/19 4:33 PM:
---

>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.


was (Author: zhangchen):
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s)

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-04 Thread Ayush Saxena (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922557#comment-16922557
 ] 

Ayush Saxena edited comment on HDFS-14811 at 9/4/19 2:40 PM:
-

If load is the reason, I guess marking {{dfs.namenode.redundancy.considerLoad}} 
as false, will never let that happen. But I am ain't sure we should do like 
this, Ideally we should make sure this shouldn't happen and find why actually 
this is happening?


was (Author: ayushtkn):
I load is the reason, I guess marking {{dfs.namenode.redundancy.considerLoad}} 
as false, will never let that happen. But I am ain't sure we should do like 
this, Ideally we should make sure this shouldn't happen and find why actually 
this is happening?

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921450#comment-16921450
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/3/19 2:38 PM:
---

I've verified the patch, run the test by call {{mvn -Dtest=TestRouterRpc test}} 
50 times using shell script. Before the patch, the test failed 8 times, after 
the patch they all succeed.


was (Author: zhangchen):
I've verify the patch, run the test by call {{mvn -Dtest=TestRouterRpc test}} 
50 times using shell script. Before the patch, the test failed 8 times, after 
the patch they all succeed.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

9 matches

Site Navigation

Mail list logo

Footer information