Max Mizikar created HDFS-15420:
----------------------------------

             Summary: approx scheduled blocks not reseting over time
                 Key: HDFS-15420
                 URL: https://issues.apache.org/jira/browse/HDFS-15420
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: block placement
    Affects Versions: 3.0.0, 2.6.0
         Environment: Our 2.6.0 environment is a 3 node cluster running 
cdh5.15.0.
Our 3.0.0 environment is a 4 node cluster running cdh6.3.0.
            Reporter: Max Mizikar
         Attachments: Screenshot from 2020-06-18 09-29-57.png, Screenshot from 
2020-06-18 09-31-15.png

We have been experiencing large amounts of scheduled blocks that never get 
cleared out. This is preventing blocks from being placed even when there is 
plenty of space on the system.
Here is an example of the block growth over 24 hours on one of our systems 
running 2.6.0
 !Screenshot from 2020-06-18 09-29-57.png! 
Here is an example of the block growth over 24 hours on one of our systems 
running 3.0.0
 !Screenshot from 2020-06-18 09-31-15.png! 
https://issues.apache.org/jira/browse/HDFS-1172 appears to be the main issue we 
were having on 2.6.0 so the growth has decreased since upgrading to 3.0.0, 
however, there appears to still be a systemic growth in scheduled blocks over 
time and our systems will still need to restart the namenode on occasion to 
reset this count. I have not determined what is causing the leaked blocks in 
3.0.0.

Looking into the issue, I discovered that the intention is for scheduled blocks 
to slowly go back down to 0 after errors cause blocks to be leaked.
{code}
  /** Increment the number of blocks scheduled. */
  void incrementBlocksScheduled(StorageType t) {
    currApproxBlocksScheduled.add(t, 1);
  }
  
  /** Decrement the number of blocks scheduled. */
  void decrementBlocksScheduled(StorageType t) {
    if (prevApproxBlocksScheduled.get(t) > 0) {
      prevApproxBlocksScheduled.subtract(t, 1);
    } else if (currApproxBlocksScheduled.get(t) > 0) {
      currApproxBlocksScheduled.subtract(t, 1);
    } 
    // its ok if both counters are zero.
  }
  
  /** Adjusts curr and prev number of blocks scheduled every few minutes. */
  private void rollBlocksScheduled(long now) {
    if (now - lastBlocksScheduledRollTime > BLOCKS_SCHEDULED_ROLL_INTERVAL) {
      prevApproxBlocksScheduled.set(currApproxBlocksScheduled);
      currApproxBlocksScheduled.reset();
      lastBlocksScheduledRollTime = now;
    }
  }
{code}

However, this code does not do what is intended if the system has a constant 
flow of written blocks. If blocks make it into prevApproxBlocksScheduled, the 
next scheduled block increments currApproxBlocksScheduled and when it 
completes, it decrements prevApproxBlocksScheduled preventing the leaked block 
to be removed from the approx count. So, for errors to be corrected, we have to 
not write any data for the roll period of 10 minutes. The number of blocks we 
write per 10 minutes is quite high. This allows the error on the approx counts 
to grow to very large numbers.

The comments in the ticket for the original implementation suggest this issues 
was known. https://issues.apache.org/jira/browse/HADOOP-3707. However, it's not 
clear to me if the severity of it was known at the time.
> So if there are some blocks that are not reported back by the datanode, they 
> will eventually get adjusted (usually 10 min; bit longer if datanode is 
> continuously receiving blocks).
The comments suggest it will eventually get cleared out, but in our case, it 
never gets cleared out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to