[jira] [Commented] (HDFS-13236) Standby NN down with error encountered while tailing edits

2019-04-22 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822989#comment-16822989
 ] 

Yuriy Malygin commented on HDFS-13236:
--

This problem in other ticket - HDFS-13596.

> Standby NN down with error encountered while tailing edits
> --
>
> Key: HDFS-13236
> URL: https://issues.apache.org/jira/browse/HDFS-13236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Affects Versions: 3.0.0
>Reporter: Yuriy Malygin
>Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_01274897875, 
> cpktTxId
> =01274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_98_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_98_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:946)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
> Feb 28 01:58:34 

[jira] [Commented] (HDFS-13236) Standby NN down with error encountered while tailing edits

2018-07-04 Thread Yuriy Malygin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532302#comment-16532302
 ] 

Yuriy Malygin commented on HDFS-13236:
--

Hi [~linyiqun], I create additional issue about _NotEnoughReplicasException_ - 
HDFS-13718.

> Standby NN down with error encountered while tailing edits
> --
>
> Key: HDFS-13236
> URL: https://issues.apache.org/jira/browse/HDFS-13236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Affects Versions: 3.0.0
>Reporter: Yuriy Malygin
>Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_01274897875, 
> cpktTxId
> =01274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_98_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_98_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:946)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> 

[jira] [Commented] (HDFS-13236) Standby NN down with error encountered while tailing edits

2018-07-03 Thread Kihwal Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531436#comment-16531436
 ] 

Kihwal Lee commented on HDFS-13236:
---

The restart after upgrade issue is being addressed in HDFS-13596. 

> Standby NN down with error encountered while tailing edits
> --
>
> Key: HDFS-13236
> URL: https://issues.apache.org/jira/browse/HDFS-13236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Affects Versions: 3.0.0
>Reporter: Yuriy Malygin
>Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_01274897875, 
> cpktTxId
> =01274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_98_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_98_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:946)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
> Feb 

[jira] [Commented] (HDFS-13236) Standby NN down with error encountered while tailing edits

2018-07-03 Thread Yiqun Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531135#comment-16531135
 ] 

Yiqun Lin commented on HDFS-13236:
--

Hi [~_ph], as I see you reported two problems when upgrading the cluster. Looks 
like they are unrelated. Could you file a new JIRA for tracking Rack Awareness 
problem you found? And let this JIRA focus on SBN tailing edits error.

> Standby NN down with error encountered while tailing edits
> --
>
> Key: HDFS-13236
> URL: https://issues.apache.org/jira/browse/HDFS-13236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Affects Versions: 3.0.0
>Reporter: Yuriy Malygin
>Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_01274897875, 
> cpktTxId
> =01274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_98_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_98_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> 

[jira] [Commented] (HDFS-13236) Standby NN down with error encountered while tailing edits

2018-07-03 Thread Francisco (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531085#comment-16531085
 ] 

Francisco commented on HDFS-13236:
--

Hi all,

We are having the same issue after upgrading from 2.8.2 to 3.1.0, we have a 
cluster with one name node 3 datanodes and no journal node, that worked without 
issues during an entire year. The upgrade went fine but problem started after a 
system maintenance that restarted all nodes. Here are the main information on 
the log

2018-07-03 10:57:35,543 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 
'/space/hadoop/hadoop_run/head_node/current/edits_00023174228-00023184599'
 
to transaction ID 23174224 
2018-07-03 10:57:35,575 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation AddOp [length=0, inodeId=6383051, 
path=/spark/.sparkStaging/application_1530536780991_0001/commons-compress-1.14.jar,
 replication=3, mtime=1530538999482, atime=1530538999482, blockSize=134217728, 
blocks=[], permissions=spark:hadoop:rw-r--r--, aclEntries=null, 
clientName=DFSClient_NONMAPREDUCE_291933719_1, clientMachine=10.1.19.65, 
overwrite=true, RpcClientId=, RpcCallId=269330502, storagePolicyId=0, 
erasureCodingPolicyId=0, opCode=OP_ADD, txid=23174233]
java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
length 16
 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
 at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
 at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
 at 
org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
 at 
org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:968)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1674)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1741)
2018-07-03 10:57:35,577 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
loading fsimage
java.io.IOException: java.lang.IllegalStateException: Cannot skip to less than 
the current value (=6383051), where newValue=6383050
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
 at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
 at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
 at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:669)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:731)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:968)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:947)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1674)
 at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1741)
Caused by: java.lang.IllegalStateException: Cannot skip to less than the 
current value (=6383051), where newValue=6383050
 at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943)
 ... 13 more

> Standby NN down with error 

[jira] [Commented] (HDFS-13236) Standby NN down with error encountered while tailing edits

2018-03-08 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391622#comment-16391622
 ] 

Kihwal Lee commented on HDFS-13236:
---

For block placement differences, {{DFSNetworkTopology}} is new in Hadoop 3. It 
might be related. See HDFS-11419. [~vagarychen] might be able to tell whether 
it is related.

Missing Client ID is very strange as it has a call ID. Both come from the 
handler's thread local variable set by the RPC server. The client ID field  
isn't the last field in the edit either. It is created when a RPC client is 
created and this is set in the connection context header. It is all internal 
and automatic. I don't know what happens when the connection is dropped before 
edit logging, while the call is still being processed. [~daryn], does server 
reset the client ID in this case?


> Standby NN down with error encountered while tailing edits
> --
>
> Key: HDFS-13236
> URL: https://issues.apache.org/jira/browse/HDFS-13236
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Affects Versions: 3.0.0
>Reporter: Yuriy Malygin
>Priority: Major
>
> After update Hadoop from 2.7.3 to 3.0.0 standby NN down with error 
> encountered while tailing edits from JN:
> {code:java}
> Feb 28 01:58:31 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:31,594 
> INFO [FSImageSaver for /one/hadoop-data/dfs of type IMAGE_AND_EDITS] 
> FSImageFormatProtobuf - Image file 
> /one/hadoop-data/dfs/current/fsimage.ckpt_012748979
> 98 of size 4595971949 bytes saved in 93 seconds.
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Going to retain 
> 2 images with txid >= 1274897935
> Feb 28 01:58:33 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:33,445 
> INFO [Standby State Checkpointer] NNStorageRetentionManager - Purging old 
> image 
> FSImageFile(file=/one/hadoop-data/dfs/current/fsimage_01274897875, 
> cpktTxId
> =01274897875)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Reading 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6a168e6f 
> expecting start txid #1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,660 
> INFO [Edit log tailer] FSImage - Start loading edits file 
> http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A10
> 56233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true, 
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848;
> inProgressOk=true
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999
> torageInfo=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true,
>  
> http://srve2916.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217
> -aef5-6ed206893848=true' to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,661 
> INFO [Edit log tailer] RedundantEditLogInputStream - Fast-forwarding stream 
> 'http://srvd87.local:8480/getJournal?jid=datalab-hadoop-backup=1274897999=-64%3A1056233980%3A0%3ACID-1fba08aa-c8bd-4217-aef5-6ed206893848=true'
>  to transaction ID 1274897999
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 2018-02-28 01:58:34,680 
> ERROR [Edit log tailer] FSEditLogLoader - Encountered exception on operation 
> AddOp [length=0, inodeId=145550319, 
> path=/kafka/parquet/infrastructureGrace/date=2018-02-28/_temporary/1/_temporary/attempt_1516181147167_20856_r_98_0/part-r-00098.gz.parquet,
>  replication=3, mtime=1519772206615, atime=1519772206615, 
> blockSize=134217728, blocks=[], permissions=root:supergroup:rw-r--r--, 
> aclEntries=null, 
> clientName=DFSClient_attempt_1516181147167_20856_r_98_0_1523538799_1, 
> clientMachine=10.137.2.142, overwrite=false, RpcClientId=, 
> RpcCallId=271996603, storagePolicyId=0, erasureCodingPolicyId=0, 
> opCode=OP_ADD, txid=1274898002]
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: 
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
> org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
> Feb 28 01:58:34 srvd2135 datalab-namenode[15566]: at 
>