[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531436#comment-16531436
 ] 

Kihwal Lee edited comment on HDFS-13236 at 7/3/18 1:58 PM:
-----------------------------------------------------------

The "restart fails after upgrade" issue is being addressed in HDFS-13596. 
Workaround is to do "saveNamespace" against the active NN after an upgrade from 
2.x to 3.x. The Standby NN will need to be re-bootstrapped.


was (Author: kihwal):
The restart after upgrade issue is being addressed in HDFS-13596. 

> Standby NN down with error encountered while tailing edits
> ----------------------------------------------------------
>
>                 Key: HDFS-13236
>                 URL: https://issues.apache.org/jira/browse/HDFS-13236
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: journal-node, namenode
>    Affects Versions: 3.0.0
>            Reporter: Yuriy Malygin
>            Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_00000000012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_0000000001274897875, 
> cpktTxId
> =0000000001274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&inProgressOk=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&s
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&inProgressOk=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848&inProgressOk=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup&segmentTxId=1274897999&storageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848&inProgressOk=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_000098_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_000098_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:946)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,688 
> FATAL [Edit log tailer] EditLogTailer - Unknown error encountered while 
> tailing edits. Shutting down standby NN.
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: java.io.IOException: 
> java.lang.IllegalStateException: Cannot skip to less than the current value 
> (=145550319), where newValue=145550318
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1941)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:882)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:863)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:293)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:427)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:380)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:397)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:393)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: Caused by: 
> java.lang.IllegalStateException: Cannot skip to less than the current value 
> (=145550319), where newValue=145550318
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1939)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: ... 10 more
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,708  
> INFO [Edit log tailer] ExitUtil - Exiting with status 1: java.io.IOException: 
> java.lang.IllegalStateException: Cannot skip to less than the current value 
> (=145550319), where newValue=145550318
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,722 
> INFO [pool-1-thread-1] NameNode - SHUTDOWN_MSG:
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> /************************************************************
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: SHUTDOWN_MSG: Shutting down 
> NameNode at srvd2135.local/10.137.2.39
> {code}
>  
> Additional information about errors after update to 3.0.0:
> In active NN logs many messages about replication errors (caused by Rack 
> Awareness):
> {code:java}
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: 2018-02-28 01:57:20,804 WARN 
> [IPC Server handler 10 on 8020] PmsRackMapping - Got empty rack for 
> 10.136.2.149, reverting to default.
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: 2018-02-28 01:57:20,806 DEBUG 
> [IPC Server handler 10 on 8020] BlockPlacementPolicy - Failed to choose from 
> local rack (location = /default-rack); the second replica is not found, retry 
> choosing randomly
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:792)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:691)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:598)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalStorage(BlockPlacementPolicyDefault.java:558)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:392)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:268)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:121)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:137)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2093)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:287)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2602)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:864)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:549)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> java.security.AccessController.doPrivileged(Native Method)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> javax.security.auth.Subject.doAs(Subject.java:422)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> Feb 28 01:57:20 srvg671 datalab-namenode[1807]: at 
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> {code:java}
> $ zgrep -c NotEnoughReplicasException 
> archive/datalab-namenode.log-20180{22*,3*}.gz
> archive/datalab-namenode.log-20180228.gz:0       #hadoop 2.7.3 
> archive/datalab-namenode.log-20180301.gz:173492  #hadoop 3.0.0
> archive/datalab-namenode.log-20180302.gz:153192  #hadoop 3.0.0
> archive/datalab-namenode.log-20180303.gz:0       #hadoop 2.7.3{code}
> In JN logs: 
> {code:java}
> Feb 28 01:57:51 srvd87 datalab-journalnode[29960]: 2018-02-28 01:57:51,552 
> INFO [IPC Server handler 4 on 8485] FileJournalManager - Finalizing edits 
> file 
> /one/hadoop-data/journal/datalab-hadoop-backup/current/edits_inprogress_00000
> 00001274897999 -> 
> /one/hadoop-data/journal/datalab-hadoop-backup/current/edits_0000000001274897999-0000000001274898515
> Feb 28 01:58:34 srvd87 datalab-journalnode[29960]: 2018-02-28 01:58:34,671 
> INFO [qtp414690789-164] TransferFsImage - Sending fileName: 
> /one/hadoop-data/journal/datalab-hadoop-backup/current/edits_0000000001274897999-0000000001274898515,
>  fileSize: 80772. Sent total: 80772 bytes. Size of last segment intended to 
> send: -1 bytes.
> {code}
> {code:java}
> $ zgrep -c 'Size of last segment intended to send' 
> archive/datalab-journalnode.log-20180{22*,3*}.gz
> archive/datalab-journalnode.log-20180227.gz:0    #hadoop 2.7.3
> archive/datalab-journalnode.log-20180301.gz:109  #hadoop 3.0.0
> archive/datalab-journalnode.log-20180302.gz:111  #hadoop 3.0.0
> archive/datalab-journalnode.log-20180304.gz:0    #hadoop 2.7.3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to