Hi Yi, I went through HDFS-4516, and it really solves our problem, thanks very much!
2014-09-10 16:39 GMT+08:00 Zesheng Wu <[email protected]>: > Thanks Yi, I will look into HDFS-4516. > > > 2014-09-10 15:03 GMT+08:00 Liu, Yi A <[email protected]>: > > Hi Zesheng, >> >> >> >> I got from an offline email of you and knew your Hadoop version was >> 2.0.0-alpha and you also said “The block is allocated successfully in NN, >> but isn’t created in DN”. >> >> Yes, we may have this issue in 2.0.0-alpha. I suspect your issue is >> similar with HDFS-4516. And can you try Hadoop 2.4 or later, you should >> not be able to re-produce it for these versions. >> >> >> >> From your description, the second block is created successfully and NN >> would flush the edit log info to shared journal and shared storage might >> persist the info, but before reporting back in rpc, there might be timeout >> to NN from shared storage. So the block exist in shared edit log, but DN >> doesn’t create it in anyway. On restart, client could fail, because in >> that Hadoop version, client would retry only in the case of NN last block >> size reported as non-zero if it was synced (see more in HDFS-4516). >> >> >> >> Regards, >> >> Yi Liu >> >> >> >> *From:* Zesheng Wu [mailto:[email protected]] >> *Sent:* Tuesday, September 09, 2014 6:16 PM >> *To:* [email protected] >> *Subject:* HDFS: Couldn't obtain the locations of the last block >> >> >> >> Hi, >> >> >> >> These days we encountered a critical bug in HDFS which can result in >> HBase can't start normally. >> >> The scenario is like following: >> >> 1. rs1 writes data to HDFS file f1, and the first block is written >> successfully >> >> 2. rs1 apply to create the second block successfully, at this time, >> nn1(ann) is crashed due to writing journal timeout >> >> 3. nn2(snn) isn't become active because of zkfc2 is in abnormal state >> >> 4. nn1 is restarted and becomes active >> >> 5. During the process of nn1 restarting, rs1 is crashed due to writing to >> safemode nn(nn1) >> >> 6. As a result, the file f1 is in abnormal state and the HBase cluster >> can't serve any more >> >> >> >> We can use the command line shell to list the file, look like following: >> >> -rw------- 3 hbase_srv supergroup 134217728 2014-09-05 11:32 >> /hbase/lgsrv-push/xxx >> >> But when we try to download the file from hdfs, the dfs client >> complains: >> >> 14/09/09 18:12:11 WARN hdfs.DFSClient: Last block locations not available. >> Datanodes might not have reported blocks completely. Will retry for 3 times >> >> 14/09/09 18:12:15 WARN hdfs.DFSClient: Last block locations not available. >> Datanodes might not have reported blocks completely. Will retry for 2 times >> >> 14/09/09 18:12:19 WARN hdfs.DFSClient: Last block locations not available. >> Datanodes might not have reported blocks completely. Will retry for 1 times >> >> get: Could not obtain the last block locations. >> >> Anyone can help on this? >> >> -- >> Best Wishes! >> >> Yours, Zesheng >> > > > > -- > Best Wishes! > > Yours, Zesheng > -- Best Wishes! Yours, Zesheng
