Kihwal Lee created HDFS-7809:
--------------------------------

             Summary: Block and lease recovery failure caused by snapshot issue
                 Key: HDFS-7809
                 URL: https://issues.apache.org/jira/browse/HDFS-7809
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.5.0
            Reporter: Kihwal Lee
            Priority: Critical


On a cluster running 2.5, we have observed a decommissioning failure due to a 
file that had been under construction for 3 days.  It turned out that the file 
was abandoned and a lease recovery was carried out by the name node 3 days ago.

The block recovery failed because the name node threw a quota exception while 
serving {{commitBlockSynchronization()}}. After this failure, no further 
attempt for recovery was made, leaving the file in under-construction state 
forever.

Furthermore, the nature of the recovery failure is very strange. Even though 
*snapshot was never used* in the cluster, it was trying to record the diff and 
that required incrementing {{nsquota}} by 1. The user happened to ran out of 
his {{nsquota}} at that time, so it failed and caused 
{{commitBlockSynchronization()}} to fail.  We do see quota discrepancies 
occasionally. Probably those were caused by something like this all along?

Few observations:
- Lease recovery did not complete, yet didn't get retried.
- No snapshot was in use, but somehow it went through snapshot-related code 
path.
- quota update during {{commitBlockSynchronization()}} should be done 
unconditionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to