Kihwal Lee created HDFS-7809: -------------------------------- Summary: Block and lease recovery failure caused by snapshot issue Key: HDFS-7809 URL: https://issues.apache.org/jira/browse/HDFS-7809 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.5.0 Reporter: Kihwal Lee Priority: Critical
On a cluster running 2.5, we have observed a decommissioning failure due to a file that had been under construction for 3 days. It turned out that the file was abandoned and a lease recovery was carried out by the name node 3 days ago. The block recovery failed because the name node threw a quota exception while serving {{commitBlockSynchronization()}}. After this failure, no further attempt for recovery was made, leaving the file in under-construction state forever. Furthermore, the nature of the recovery failure is very strange. Even though *snapshot was never used* in the cluster, it was trying to record the diff and that required incrementing {{nsquota}} by 1. The user happened to ran out of his {{nsquota}} at that time, so it failed and caused {{commitBlockSynchronization()}} to fail. We do see quota discrepancies occasionally. Probably those were caused by something like this all along? Few observations: - Lease recovery did not complete, yet didn't get retried. - No snapshot was in use, but somehow it went through snapshot-related code path. - quota update during {{commitBlockSynchronization()}} should be done unconditionally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)