Lei (Eddy) Xu created HDFS-12044:
------------------------------------

             Summary: Mismatch between BlockManager#maxReplicatioStreams and 
ErasureCodingWorker.stripedReconstructionPool pool size causes slow and burst 
recovery. 
                 Key: HDFS-12044
                 URL: https://issues.apache.org/jira/browse/HDFS-12044
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: erasure-coding
    Affects Versions: 3.0.0-alpha3
            Reporter: Lei (Eddy) Xu


{{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}} 
and {{maxPoolSize=8}} as default. And it rejects more tasks if the queue is 
full.

When {{BlockManager#maxReplicationStream}} is larger than 
{{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}}, for 
example, {{maxReplicationStream=20}} and {{corePoolSize=2 , maxPoolSize=8}}.  
Meanwhile, NN sends up to {{maxTransfer}} reconstruction tasks to DN for each 
heartbeat, and it is calculated in {{FSNamesystem}}:

{code}
final int maxTransfer = blockManager.getMaxReplicationStreams() - 
xmitsInProgress;
{code}

However, at any giving time, {{{ErasureCodingWorker#stripedReconstructionPool}} 
takes 2 {{xmitInProcess}}. So for each heartbeat in 3s, NN will send about 
{{20-2 = 18}} reconstruction tasks to the DN, and DN throw away most of them if 
there were 8 tasks in the queue already. So NN needs to take longer to 
re-consider these blocks were under-replicated to schedule new tasks.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to