[ 
https://issues.apache.org/jira/browse/HDFS-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570770#comment-17570770
 ] 

Jan Van Besien edited comment on HDFS-4957 at 7/25/22 11:24 AM:
----------------------------------------------------------------

I am also faced by this problem (in a Hadoop deployment on Kubernetes). It is 
also not limited to namenode failover. Simply restarting a namenode won't work 
either (cfr the problem described in HDFS-10719).

In contrast to what [~jzhuge] writes earlier, can't the solution simply be:
 * when formatting a namenode, all journal nodes need to be available (cfr 
HDFS-4210)
 * in all other operations, including namenode failover, only a majority of 
journal nodes needs to be available

That sounds reasonably straightforward to implement?

I understand there is also a problem with journal nodes not immediately 
rejoining the quorum after a journal node restart (cfr HDFS-3867), but that 
seems to be a separate problem that we should not take into account here?


was (Author: janvanbesien):
I am also faced by this problem (in a Hadoop deployment on Kubernetes).

In contrast to what [~jzhuge] writes earlier, can't the solution simply be:
 * when formatting a namenode, all journal nodes need to be available (cfr 
HDFS-4210)
 * in all other operations, including namenode failover, only a majority of 
journal nodes need to be available

That sounds reasonably straightforward to implement?

I understand there is also a problem with journal nodes not immediately 
rejoining the quorum after a journal node restart (cfr HDFS-3867), but that 
seems to be a separate problem that we should not take into account here?

> NameNode failover should not fail because a DNS entry for a quorum node 
> cannot be resolved
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4957
>                 URL: https://issues.apache.org/jira/browse/HDFS-4957
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: qjm
>    Affects Versions: 2.3.0, 2.6.0
>            Reporter: Colin McCabe
>            Assignee: John Zhuge
>            Priority: Major
>
> When a StandbyNameNode is becoming active, we should not bail out because a 
> DNS entry for a quorum node cannot be resolved.  Currently it does fail in 
> this scenario, with a message like this:
> {code}
> 2013-07-03 21:28:40,576 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2013-07-03 21:28:40,579 FATAL 
> org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring 
> NN shutdown. Shutting down immediately.
> java.lang.IllegalArgumentException: Unable to construct journal, 
> qjournal://hadoop-mm:8485;hadoop-nn-0:8485;hadoop-nn-1:8485/hadoop
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1254)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:226)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:193)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:722)
> <etc>
> {code}
> reported by Matt Bookman



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to