[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538930#comment-17538930 ] ZanderXu commented on HDFS-15562: - [~shv] [~aihuaxu] are you still following up this issue? In our production, one NameService contains multiple ObserverNameNode, and we stop one Observer and plan to offline, but it caused standby abnormally doing checkpoint. bq. We may add a logic for the Checkpointer to not re-create an image if it was created recently `lastCheckpointTime` already exist, but not update it when some exception happened. bq. we see transfers fail once in a while, so just ignoring image transfer failures isn't right. Standby can uploads the latest fsImage to all namenodes as much as possible. For abnormal namenode, if Standby retries multiple times, it still fails, Standby just ignore it will be ok. [~shv] [~ferhui] [~hexiaoqiao] do you have some good ideas about it? And I will be happy to work on it. BTW, do we need a mechanism to actively trigger checkpoint? > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Assignee: Aihua Xu >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15562.patch > > Time Spent: 20m > Remaining Estimate: 0h > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233178#comment-17233178 ] Aihua Xu commented on HDFS-15562: - Thanks [~shv] for your comment. When I get time, I will focus on not recreating the image if there is a recent one. > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Assignee: Aihua Xu >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15562.patch > > Time Spent: 20m > Remaining Estimate: 0h > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228786#comment-17228786 ] Konstantin Shvachko commented on HDFS-15562: Hey guys, I think generally the checkpointer should persist until the checkpoint completes and the image is transferred. With large image we see transfers fail once in a while, so just ignoring image transfer failures isn't right. I understand that with multiple ObserverNodes some of them can be down. We already have logic for ActiveNN and ObserverNodes to reject an image if they already have one recent enough. So frequent checkpoints should not overwhelm the active or the Observers. We may add a logic for the Checkpointer to not re-create an image if it was created recently. But this does not seem to be a big concern. > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Assignee: Aihua Xu >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15562.patch > > Time Spent: 20m > Remaining Estimate: 0h > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226420#comment-17226420 ] Wei-Chiu Chuang commented on HDFS-15562: [~cliang] [~shv] do you think you can help Aihua with the review? > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Assignee: Aihua Xu >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15562.patch > > Time Spent: 20m > Remaining Estimate: 0h > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225803#comment-17225803 ] Hadoop QA commented on HDFS-15562: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 10s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 36m 3s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 20s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 43s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 11s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 9s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 17s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 14s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 47s{color} | {color:orange}https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2430/1/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 35 unchanged - 0 fixed = 37 total (was 35) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 10s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 13s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green}{color}
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223389#comment-17223389 ] Hadoop QA commented on HDFS-15562: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 2m 11s{color} | | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 41m 40s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 30s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 2s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 42s{color} | | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 33s{color} | | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 53s{color} | | {color:green} trunk passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 19s{color} | | {color:green} trunk passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 13s{color} | | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 10s{color} | | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s{color} | | {color:green} the patch passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 13s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | | {color:green} the patch passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} blanks {color} | {color:green} 0m 0s{color} | | {color:green} The patch has no blanks issues. {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 40s{color} | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt|https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/277/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt] | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 34 unchanged - 0 fixed = 36 total (was 34) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 10s{color} | | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 51s{color} | | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | | {color:green} the patch passed with JDK Ubuntu-11.0.9+11-Ubuntu-0ubuntu1.18.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 16s{color} | | {color:green} the patch passed with JDK Private Build-1.8.0_272-8u272-b10-0ubuntu1~18.04-b10 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 13s{color} | | {color:green} the patch passed {color} | || ||
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17223287#comment-17223287 ] Aihua Xu commented on HDFS-15562: - [~weichiu], [~csun] Can you help review the change? Thanks. > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Assignee: Aihua Xu >Priority: Major > Attachments: HDFS-15562.patch > > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206838#comment-17206838 ] Aihua Xu commented on HDFS-15562: - [~aswqazxsd] I will take a look. If you can provide more details like the version, stack trace, etc. that will be helpful. > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Assignee: Aihua Xu >Priority: Major > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15562) StandbyCheckpointer will do checkpoint repeatedly while connecting observer/active namenode failed
[ https://issues.apache.org/jira/browse/HDFS-15562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192719#comment-17192719 ] jianghua zhu commented on HDFS-15562: - [~aswqazxsd] , can you tell which version you are using. > StandbyCheckpointer will do checkpoint repeatedly while connecting > observer/active namenode failed > -- > > Key: HDFS-15562 > URL: https://issues.apache.org/jira/browse/HDFS-15562 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: SunHao >Priority: Major > > We find the standby namenode will do checkpoint over and over while > connecting observer/active namenode failed. > StandbyCheckpointer won't update “lastCheckpointTime” when upload new fsimage > to the other namenode failed, so that the standby namenode will keep doing > checkpoint repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org