[ 
https://issues.apache.org/jira/browse/HDFS-15732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reassigned HDFS-15732:
--------------------------------------

    Assignee: gaozhan ding

> EC client will not retry get block token when block token expired  in 
> kerberized cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: HDFS-15732
>                 URL: https://issues.apache.org/jira/browse/HDFS-15732
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: dfsclient, ec, erasure-coding
>    Affects Versions: 3.1.1
>         Environment: hadoop 3.1.1
> kerberos
> ec RS-3-2-1024k
>            Reporter: gaozhan ding
>            Assignee: gaozhan ding
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When enable ec policy on hbase, we got some issues. Once block token was 
> expired in datanode side, client side will not identify the InvalidToken 
> error because of the SASL negotiation. As a result, ec client will not do 
> retry by refetch token when create blockreader.  Then the peer datanode was 
> added to DeadNodes, and all calls to function createBlockReader aim at this 
> datanode in current DFSStripedInputStream will consider this datanode  was 
> dead and return false. The finally result is a read failure.
> Some logs :
> hbase regionserver:
> 2020-12-17 10:00:24,291 WARN 
> [RpcServer.default.FPBQ.Fifo.handler=15,queue=0,port=16020] hdfs.DFSClient: 
> Failed to connect to /10.65.19.41:9866 for 
> blockBP-1601568648-10.65.19.12-1550823043026:blk_-9223372036813273566_672859566
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:421)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:479)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>  at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:647)
>  at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2936)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:821)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:746)
>  at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:647)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:272)
>  at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:333)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:365)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:514)
>  at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1354)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1318)
>  at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:808)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1568)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:340)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:856)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:806)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:228)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:395)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:250)
>  at 
> org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2031)
>  at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2022)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6408)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:6388)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2926)
>  
>  
> datanode log:
> 2020-12-17 10:00:24,290 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hadoop-btdn0005.eniot.io:9866:DataXceiver error processing unknown operation 
> src: /10.65.19.42:53894 dst: /10.65.19.41:9866
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
> token with block_token_identifier (expiryDate=1608192045256, 
> keyId=1449537289, userId=hbase, 
> blockPoolId=BP-1601568648-10.65.19.12-1550823043026, 
> blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, 
> DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, 
> DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, 
> DS-a1ee0117-7430-4181-b279-3f1b16541567, 
> DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, 
> DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.]
>  at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>  at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
> token with block_token_identifier (expiryDate=1608192045256, 
> keyId=1449537289, userId=hbase, 
> blockPoolId=BP-1601568648-10.65.19.12-1550823043026, 
> blockId=-9223372036813273566, access modes=[READ], storageTypes= [DISK, DISK, 
> DISK, DISK, DISK], storageIds= [DS-604a9eaf-94ba-4127-b70d-436361c49ecd, 
> DS-b53bf503-84c1-4e9e-aede-a19f88e3fc9a, 
> DS-a1ee0117-7430-4181-b279-3f1b16541567, 
> DS-e8459904-e1e1-44a6-affa-9ce9fc56f877, 
> DS-9f54936c-96cc-4c85-91e6-a6f8e00f246a]) is expired.
>  at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:479)
>  at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:81)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>  at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>  ... 7 more
>  
> final result
>  
>  
> java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for 
> reader 
> reader=hdfs://azbeta/hbase/data/o15427722038191/RAW_569dfc35-10dc-4fe4-bb8c-5c403368d6c1/ccf146fb5c852d751b6b2f8307f81fa0/d/386bdb6978d348f4b4e168e728762842,
>  compression=snappy, cacheConf=blockCache=LruBlockCache\{blockCount=7915, 
> currentSize=489.13 MB, freeSize=2.71 GB, maxSize=3.19 GB, heapSize=489.13 MB, 
> minSize=3.03 GB, minFactor=0.95, multiSize=1.52 GB, multiFactor=0.5, 
> singleSize=775.81 MB, singleFactor=0.25}, cacheDataOnRead=true, 
> cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, 
> cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, 
> firstKey=Optional[\x01\x00\x00\x13\xBD\x00\x00\x00T\x070\x97\xF7/d:arm_xingneng_test_model@AI1/1603445985858/Put/seqid=0],
>  
> lastKey=Optional[\x01\x00\x00\x16\xF9\x00\x00\x00T\x073-8/d:arm_xingneng_test_model@AI99/1603446144699/Put/seqid=0],
>  avgKeyLen=53, avgValueLen=4, entries=2016557, length=22957603, 
> cur=\x01\x00\x00\x14#\x00\x00\x00T\x072\xCC\xF7/d:arm_xingneng_test_model@AI84/1603446131146/Put/vlen=4/seqid=17230816]
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:217)
>  at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:120)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:653)
>  at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:388)
>  at 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.compact(Compactor.java:327)
>  at 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:65)
>  at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:126)
>  at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1410)
>  at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:2187)
>  at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.doCompaction(CompactSplit.java:596)
>  at 
> org.apache.hadoop.hbase.regionserver.CompactSplit$CompactionRunner.run(CompactSplit.java:638)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: 3 missing blocks, the stripe is: 
> AlignedStripe(Offset=1048576, length=1048576, fetchedChunksNum=0, 
> missingChunksNum=3); locatedBlocks is: LocatedBlocks\{; fileLength=22957603; 
> underConstruction=false; 
> blocks=[LocatedStripedBlock{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
>  getBlockSize()=22957603; corrupt=false; offset=0; 
> locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
>  indices=[0, 1, 2, 3, 4]}]; 
> lastLocatedBlock=LocatedStripedBlock\{BP-1197414916-10.27.20.30-1535978156945:blk_-9223372036830849056_317784750;
>  getBlockSize()=22957603; corrupt=false; offset=0; 
> locs=[DatanodeInfoWithStorage[10.27.20.42:9866,DS-d0d5a7ce-8280-45fe-b910-1e5c7b579367,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.86:9866,DS-b018c729-66d5-4953-94f1-ec7664a46cb7,DISK],
>  
> DatanodeInfoWithStorage[10.27.22.79:9866,DS-d79c402a-1845-4b3f-893e-f84d94085b2a,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.41:9866,DS-b07ba1e0-9a34-4caf-ad51-6d5f302c08ea,DISK],
>  
> DatanodeInfoWithStorage[10.27.20.39:9866,DS-8a059d87-f122-4fed-a6a2-e45662692305,DISK]];
>  indices=[0, 1, 2, 3, 4]}; isLastBlockComplete=true; 
> ecPolicy=ErasureCodingPolicy=[Name=RS-3-2-1024k, Schema=[ECSchema=[Codec=rs, 
> numDataUnits=3, numParityUnits=2]], CellSize=1048576, Id=2]}
>  at 
> org.apache.hadoop.hdfs.StripeReader.checkMissingBlocks(StripeReader.java:177)
>  at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:372)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readOneStripe(DFSStripedInputStream.java:324)
>  at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.readWithStrategy(DFSStripedInputStream.java:397)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:765)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1562)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1772)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1597)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1496)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.readNextDataBlock(HFileReaderImpl.java:931)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.isNextBlock(HFileReaderImpl.java:1064)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.positionForNextBlock(HFileReaderImpl.java:1058)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl._next(HFileReaderImpl.java:1076)
>  at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.next(HFileReaderImpl.java:1097)
>  at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:208)
>  ... 13 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to