[ https://issues.apache.org/jira/browse/HDFS-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liying updated HDFS-14736: -------------------------- Description: If subdirectories in the datanode data directory was corrupted for some reason, the it would restart datanode unsuccessfully. For example, a sudden power failure in the computer room. The error infomation in the datanode log as the follow: {panel:title=datanode log:} 2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data06/block/current... 2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data07/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data08/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data09/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data10/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data11/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data12/block/current... 2019-08-09 10:01:06,707 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught exception while scanning /data05/block/current. Will throw later. *java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp* at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406) 2019-08-09 10:01:15,330 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data06/block/current: 8627ms 2019-08-09 10:01:15,348 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data11/block/current: 8645ms 2019-08-09 10:01:15,352 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data01/block/current: 8649ms 2019-08-09 10:01:15,361 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data12/block/current: 8658ms 2019-08-09 10:01:15,362 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data03/block/current: 8659ms {panel} I check the codes of the whole process, and find some codes are weird in the #DataNode# and #FsVolumeImpl# as the follow: {code:java} void initBlockPool(BPOfferService bpos) throws IOException { NamespaceInfo nsInfo = bpos.getNamespaceInfo(); if (nsInfo == null) { throw new IOException("NamespaceInfo not found: Block pool " + bpos + " should have retrieved namespace info before initBlockPool."); } setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID()); // Register the new block pool with the BP manager. blockPoolManager.addBlockPool(bpos); // In the case that this is the first block pool to connect, initialize // the dataset, block scanners, etc. initStorage(nsInfo); // Exclude failed disks before initializing the block pools to avoid startup // failures. checkDiskError(); data.addBlockPool(nsInfo.getBlockPoolID(), conf); blockScanner.enableBlockPoolId(bpos.getBlockPoolId()); initDirectoryScanner(conf); } {code} {code:java} void checkDirs() throws DiskErrorException { // TODO:FEDERATION valid synchronization for(BlockPoolSlice s : bpSlices.values()) { s.checkDirs(); } }{code} during restarting the datanode, BPServiceActor will invoke initBlockPool to init the datastorage in this blockpool. It will execute checkDirs before addBlockPool. But I found the bpSlices is empty when the checkDirs was executed. So it is very weird. Then i check the codes as the follow: {code:java} void addBlockPool(String bpid, Configuration conf) throws IOException { File bpdir = new File(currentDir, bpid); BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf); bpSlices.put(bpid, bp); } {code} As you can see, the addBlockPool is executed after the checkDirs. So the bpSlices is empty in this case. And then it will throw *java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting datanode unsuccessfully.* For example, the tmp dir was corrupted with the information as follow: *ls: cannot access tmp: Input/output error* *total 0* *d????????? ? ? ? ? ? tmp* was: If subdirectories in the datanode data directory was corrupted for some reason, the it would restart datanode unsuccessfully. For example, a sudden power failure in the computer room. The error infomation in the datanode log as the follow: 2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data06/block/current... 2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data07/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data08/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data09/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data10/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data11/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data12/block/current... 2019-08-09 10:01:06,707 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught exception while scanning /data05/block/current. Will throw later. *java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp* at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406) 2019-08-09 10:01:15,330 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data06/block/current: 8627ms 2019-08-09 10:01:15,348 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data11/block/current: 8645ms 2019-08-09 10:01:15,352 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data01/block/current: 8649ms 2019-08-09 10:01:15,361 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data12/block/current: 8658ms 2019-08-09 10:01:15,362 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data03/block/current: 8659ms I check the codes of the whole process, and find some code are weird in the #DataNode# and #FsVolumeImpl# as the follow: {code:java} //代码占位符 void initBlockPool(BPOfferService bpos) throws IOException { NamespaceInfo nsInfo = bpos.getNamespaceInfo(); if (nsInfo == null) { throw new IOException("NamespaceInfo not found: Block pool " + bpos + " should have retrieved namespace info before initBlockPool."); } setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID()); // Register the new block pool with the BP manager. blockPoolManager.addBlockPool(bpos); // In the case that this is the first block pool to connect, initialize // the dataset, block scanners, etc. initStorage(nsInfo); // Exclude failed disks before initializing the block pools to avoid startup // failures. checkDiskError(); data.addBlockPool(nsInfo.getBlockPoolID(), conf); blockScanner.enableBlockPoolId(bpos.getBlockPoolId()); initDirectoryScanner(conf); } {code} {code:java} //代码占位符 void checkDirs() throws DiskErrorException { // TODO:FEDERATION valid synchronization for(BlockPoolSlice s : bpSlices.values()) { s.checkDirs(); } }{code} during restarting the datanode, BPServiceActor will invoke initBlockPool to init the datastorage in this blockpool. It will execute checkDirs before addBlockPool. But I found the bpSlices is empty when the checkDirs was executed. So it is very weird. Then i check the codes as the follow: {code:java} //代码占位符 void addBlockPool(String bpid, Configuration conf) throws IOException { File bpdir = new File(currentDir, bpid); BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf); bpSlices.put(bpid, bp); } {code} As you can see, the addBlockPool is executed after the checkDirs. So the bpSlices is empty in this case. And then it will throw *java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting datanode unsuccessfully.* For example, the tmp dir was corrupted with the information as follow: *ls: cannot access tmp: Input/output error* *total 0* *d????????? ? ? ? ? ? tmp* > Starting the datanode unsuccessfully because of the corrupted sub dir in the > data directory > ------------------------------------------------------------------------------------------- > > Key: HDFS-14736 > URL: https://issues.apache.org/jira/browse/HDFS-14736 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.2 > Reporter: liying > Assignee: liying > Priority: Major > > If subdirectories in the datanode data directory was corrupted for some > reason, the it would restart datanode unsuccessfully. > For example, a sudden power failure in the computer room. The error > infomation in the datanode log as the follow: > {panel:title=datanode log:} > 2019-08-09 10:01:06,703 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data06/block/current... > 2019-08-09 10:01:06,703 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data07/block/current... > 2019-08-09 10:01:06,704 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data08/block/current... > 2019-08-09 10:01:06,704 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data09/block/current... > 2019-08-09 10:01:06,704 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data10/block/current... > 2019-08-09 10:01:06,704 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data11/block/current... > 2019-08-09 10:01:06,704 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning > block pool BP-518068284-10.252.12.3-152341691 > 1512 on volume /data12/block/current... > 2019-08-09 10:01:06,707 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught > exception while scanning /data05/block/current. > Will throw later. > *java.io.IOException: Mkdirs failed to create > /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp* > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406) > 2019-08-09 10:01:15,330 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-518068284-10.252.12.3 > -1523416911512 on /data06/block/current: 8627ms > 2019-08-09 10:01:15,348 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-518068284-10.252.12.3 > -1523416911512 on /data11/block/current: 8645ms > 2019-08-09 10:01:15,352 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-518068284-10.252.12.3 > -1523416911512 on /data01/block/current: 8649ms > 2019-08-09 10:01:15,361 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-518068284-10.252.12.3 > -1523416911512 on /data12/block/current: 8658ms > 2019-08-09 10:01:15,362 INFO > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time > taken to scan block pool BP-518068284-10.252.12.3 > -1523416911512 on /data03/block/current: 8659ms > {panel} > > I check the codes of the whole process, and find some codes are weird in the > #DataNode# and #FsVolumeImpl# as the follow: > {code:java} > void initBlockPool(BPOfferService bpos) throws IOException { > NamespaceInfo nsInfo = bpos.getNamespaceInfo(); > if (nsInfo == null) { > throw new IOException("NamespaceInfo not found: Block pool " + bpos > + " should have retrieved namespace info before initBlockPool."); > } > > setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID()); > // Register the new block pool with the BP manager. > blockPoolManager.addBlockPool(bpos); > > // In the case that this is the first block pool to connect, initialize > // the dataset, block scanners, etc. > initStorage(nsInfo); > // Exclude failed disks before initializing the block pools to avoid startup > // failures. > checkDiskError(); > data.addBlockPool(nsInfo.getBlockPoolID(), conf); > blockScanner.enableBlockPoolId(bpos.getBlockPoolId()); > initDirectoryScanner(conf); > } > {code} > {code:java} > void checkDirs() throws DiskErrorException { > // TODO:FEDERATION valid synchronization > for(BlockPoolSlice s : bpSlices.values()) { > s.checkDirs(); > } > }{code} > during restarting the datanode, BPServiceActor will invoke initBlockPool to > init the datastorage in this blockpool. It will execute checkDirs before > addBlockPool. But I found the bpSlices is empty when the checkDirs was > executed. So it is very weird. Then i check the codes as the follow: > {code:java} > void addBlockPool(String bpid, Configuration conf) throws IOException { > File bpdir = new File(currentDir, bpid); > BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf); > bpSlices.put(bpid, bp); > } > {code} > As you can see, the addBlockPool is executed after the checkDirs. So the > bpSlices is empty in this case. And then it will throw *java.io.IOException: > Mkdirs failed to create > /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting > datanode unsuccessfully.* > For example, the tmp dir was corrupted with the information as follow: > *ls: cannot access tmp: Input/output error* > *total 0* > *d????????? ? ? ? ? ? tmp* > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org