[jira] [Updated] (HDFS-14272) [SBN read] ObserverReadProxyProvider should sync with active txnID on startup
[ https://issues.apache.org/jira/browse/HDFS-14272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-14272: -- Fix Version/s: 2.10.0 3.2.1 3.1.3 > [SBN read] ObserverReadProxyProvider should sync with active txnID on startup > - > > Key: HDFS-14272 > URL: https://issues.apache.org/jira/browse/HDFS-14272 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools > Environment: CDH6.1 (Hadoop 3.0.x) + Consistency Reads from Standby + > SSL + Kerberos + RPC encryption >Reporter: Wei-Chiu Chuang >Assignee: Erik Krogen >Priority: Major > Fix For: 2.10.0, 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-14272.000.patch, HDFS-14272.001.patch, > HDFS-14272.002.patch > > > It is typical for integration tests to create some files and then check their > existence. For example, like the following simple bash script: > {code:java} > # hdfs dfs -touchz /tmp/abc > # hdfs dfs -ls /tmp/abc > {code} > The test executes HDFS bash command sequentially, but it may fail with > Consistent Standby Read because the -ls does not find the file. > Analysis: the second bash command, while launched sequentially after the > first one, is not aware of the state id returned from the first bash command. > So ObserverNode wouldn't wait for the the edits to get propagated, and thus > fails. > I've got a cluster where the Observer has tens of seconds of RPC latency, and > this becomes very annoying. (I am still trying to figure out why this > Observer has such a long RPC latency. But that's another story.) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15665) Balancer logging improvement
[ https://issues.apache.org/jira/browse/HDFS-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225636#comment-17225636 ] Chen Liang commented on HDFS-15665: --- Thanks for the clarification [~shv] , +1 to v002 patch > Balancer logging improvement > > > Key: HDFS-15665 > URL: https://issues.apache.org/jira/browse/HDFS-15665 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15665.001.patch, HDFS-15665.002.patch > > > It would be good to have Balancer log all relevant configuration parameters > on each iteration along with some data, which reflects its progress and the > amount of resources it involves. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15665) Balancer logging improvement
[ https://issues.apache.org/jira/browse/HDFS-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225011#comment-17225011 ] Chen Liang commented on HDFS-15665: --- Thanks for working on this [~shv]! v001 patch looks good to me. Just two minor comments: 1. The {{getInt}} line Balancer.java:L#286 seems redundant? no variable is taking that value 2. Balancer.java:L#663 and L#665, the two LOG.info lines, would it be better to merge them to one line? > Balancer logging improvement > > > Key: HDFS-15665 > URL: https://issues.apache.org/jira/browse/HDFS-15665 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15665.001.patch > > > It would be good to have Balancer log all relevant configuration parameters > on each iteration along with some data, which reflects its progress and the > amount of resources it involves. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15567) [SBN Read] HDFS should expose msync() API to allow downstream applications call it explicetly.
[ https://issues.apache.org/jira/browse/HDFS-15567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212685#comment-17212685 ] Chen Liang commented on HDFS-15567: --- Thanks [~shv]. Makes sense, +1 on v002 patch. > [SBN Read] HDFS should expose msync() API to allow downstream applications > call it explicetly. > -- > > Key: HDFS-15567 > URL: https://issues.apache.org/jira/browse/HDFS-15567 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15567.001.patch, HDFS-15567.002.patch > > > Consistent reads from Standby introduced {{msync()}} API HDFS-13688, which > updates client's state ID with current state of the Active NameNode to > guarantee consistency of subsequent calls to an ObserverNode. Currently this > API is exposed via {{DFSClient}} only, which makes it hard for applications > to access {{msync()}}. One way is to use something like this: > {code} > if(fs instanceof DistributedFileSystem) { > ((DistributedFileSystem)fs).getClient().msync(); > } > {code} > This should be exposed both for {{FileSystem}} and {{FileContext}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15567) [SBN Read] HDFS should expose msync() API to allow downstream applications call it explicetly.
[ https://issues.apache.org/jira/browse/HDFS-15567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210499#comment-17210499 ] Chen Liang commented on HDFS-15567: --- Thanks for working on this [~shv]! Some comments: 1. Currently calling {{AbstractFileSystem.java}}'s msync throws UnsupportedOperationException, I was thinking whether it should be throwing UnsupportedOperationException, or just making it a noop. (similarly for {{FileSystem.java}}'s msync}}. I think making it noop might be better, any thoughts? 2. Change in {{MiniDFSCluster.java}}, is it really needed? 3. testMsyncFileContext has a LOG info call, seems unnecessary. Also looks like it only test FileContext, should we also test for FileSystem? > [SBN Read] HDFS should expose msync() API to allow downstream applications > call it explicetly. > -- > > Key: HDFS-15567 > URL: https://issues.apache.org/jira/browse/HDFS-15567 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ha, hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15567.001.patch, HDFS-15567.002.patch > > > Consistent reads from Standby introduced {{msync()}} API HDFS-13688, which > updates client's state ID with current state of the Active NameNode to > guarantee consistency of subsequent calls to an ObserverNode. Currently this > API is exposed via {{DFSClient}} only, which makes it hard for applications > to access {{msync()}}. One way is to use something like this: > {code} > if(fs instanceof DistributedFileSystem) { > ((DistributedFileSystem)fs).getClient().msync(); > } > {code} > This should be exposed both for {{FileSystem}} and {{FileContext}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15545) (S)Webhdfs will not use updated delegation tokens available in the ugi after the old ones expire
[ https://issues.apache.org/jira/browse/HDFS-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189568#comment-17189568 ] Chen Liang commented on HDFS-15545: --- Thanks [~ibuenros], I agree that HDFS-6222 looks to be about a different scenario. I'm +1 on the patch. I will take another pass on the failed tests, if it looks good I will commit the change, given no other concerns/objections from any other folks. > (S)Webhdfs will not use updated delegation tokens available in the ugi after > the old ones expire > > > Key: HDFS-15545 > URL: https://issues.apache.org/jira/browse/HDFS-15545 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Issac Buenrostro >Assignee: Issac Buenrostro >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15545.001.patch, HDFS-15545.002.patch > > Time Spent: 1h > Remaining Estimate: 0h > > WebHdfsFileSystem can select a delegation token to use from the current user > UGI. The token selection is sticky, and WebHdfsFileSystem will re-use it > every time without searching the UGI again. > If the previous token expires, WebHdfsFileSystem will catch the exception and > attempt to get a new token. However, the mechanism to get a new token > bypasses searching for one on the UGI, so even if there is external logic > that has retrieved a new token, it is not possible to make the FileSystem use > the new, valid token, rendering the FileSystem object unusable. > A simple fix would allow WebHdfsFileSystem to re-search the UGI, and if it > finds a different token than the cached one try to use it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15545) (S)Webhdfs will not use updated delegation tokens available in the ugi after the old ones expire
[ https://issues.apache.org/jira/browse/HDFS-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188775#comment-17188775 ] Chen Liang commented on HDFS-15545: --- Thanks for working on this [~ibuenros]! The change makes sense to me. But I noticed that in HDFS-6222 seems there can be concerns with how Webhdfs should renew the token. It seems to me a different scenario so we should be fine, and TestWebHdfsTokens was passing here. [~daryn], do you have any thoughts on this change? > (S)Webhdfs will not use updated delegation tokens available in the ugi after > the old ones expire > > > Key: HDFS-15545 > URL: https://issues.apache.org/jira/browse/HDFS-15545 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Issac Buenrostro >Assignee: Issac Buenrostro >Priority: Major > Labels: pull-request-available > Attachments: HDFS-15545.001.patch, HDFS-15545.002.patch > > Time Spent: 1h > Remaining Estimate: 0h > > WebHdfsFileSystem can select a delegation token to use from the current user > UGI. The token selection is sticky, and WebHdfsFileSystem will re-use it > every time without searching the UGI again. > If the previous token expires, WebHdfsFileSystem will catch the exception and > attempt to get a new token. However, the mechanism to get a new token > bypasses searching for one on the UGI, so even if there is external logic > that has retrieved a new token, it is not possible to make the FileSystem use > the new, valid token, rendering the FileSystem object unusable. > A simple fix would allow WebHdfsFileSystem to re-search the UGI, and if it > finds a different token than the cached one try to use it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15290) NPE in HttpServer during NameNode startup
[ https://issues.apache.org/jira/browse/HDFS-15290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181494#comment-17181494 ] Chen Liang commented on HDFS-15290: --- I have committed v03 patch to trunk, branch-3.x. Thanks for the contribution [~simbadzina]! There is a conflict when backporting to branch-2.10 though, due to log4j usage. Mind providing a version for branch-2.10? > NPE in HttpServer during NameNode startup > - > > Key: HDFS-15290 > URL: https://issues.apache.org/jira/browse/HDFS-15290 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 2.7.8, 3.3.0 >Reporter: Konstantin Shvachko >Assignee: Simbarashe Dzinamarira >Priority: Major > Attachments: HDFS-15290.001.patch, HDFS-15290.002.patch, > HDFS-15290.003.patch > > > When NameNode starts it first starts HttpServer, then starts loading fsImage > and edits. While loading the namesystem field in NameNode is null. I saw that > a StandbyNode sends a checkpoint request, which fails with NPE because > NNStorage is not instantiated yet. > We should check the NameNode startup status before accepting checkpoint > requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15290) NPE in HttpServer during NameNode startup
[ https://issues.apache.org/jira/browse/HDFS-15290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178072#comment-17178072 ] Chen Liang commented on HDFS-15290: --- Thanks for working on this [~simbadzina]! The v002 patch seems missing things. The method {{getAndSetFSImageInHttpServer}} is not called anywhere. Also two nits: 1. One extra space on the import line (NameNodeAdapter.java L#59) 2. One extra newline in the test (TestStandbyCheckpoints.java L#311/312) > NPE in HttpServer during NameNode startup > - > > Key: HDFS-15290 > URL: https://issues.apache.org/jira/browse/HDFS-15290 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 2.7.8, 3.3.0 >Reporter: Konstantin Shvachko >Assignee: Simbarashe Dzinamarira >Priority: Major > Attachments: HDFS-15290.001.patch, HDFS-15290.002.patch > > > When NameNode starts it first starts HttpServer, then starts loading fsImage > and edits. While loading the namesystem field in NameNode is null. I saw that > a StandbyNode sends a checkpoint request, which fails with NPE because > NNStorage is not instantiated yet. > We should check the NameNode startup status before accepting checkpoint > requests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Fix Version/s: 3.1.5 3.4.0 3.3.1 2.10.1 3.2.2 Resolution: Fixed Status: Resolved (was: Patch Available) > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Fix For: 3.2.2, 2.10.1, 3.3.1, 3.4.0, 3.1.5 > > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, > HDFS-15404.006.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161585#comment-17161585 ] Chen Liang commented on HDFS-15404: --- I have committed v006 patch to trunk, branch-3.3/3.2/3.1 and branch-2.10. Thanks Konstantin for the review! > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, > HDFS-15404.006.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160291#comment-17160291 ] Chen Liang commented on HDFS-15404: --- Upload v006 patch to address the remaining checkstyle issues. There is one that I didn't change, in order to be consistent in style with other lines in the class > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, > HDFS-15404.006.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.006.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, > HDFS-15404.006.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159408#comment-17159408 ] Chen Liang commented on HDFS-15404: --- Thanks for taking a look [~shv]! In general, I agree that should be fine with tooling, I don't have specific example, was mainly brainstorming any potential concerns. Uploaded v005 patch. > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.005.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148160#comment-17148160 ] Chen Liang commented on HDFS-15404: --- Upload v004 patch to fix {{TestDFSHAAdminMiniCluster}} which was expecting to see {{target_*}} environment variables, while with this change, the fencing on source will be using {{source_*}} variables. This brings in another question to my mind, which is that there could be existing tooling that relies on {{target_*}} on both source and dst of the fencing. This might not be the right use of the variables, but if they do exist, they may break. Do we plan to support such use cases is the question, any thoughts [~shv]? {{TestRollingUpgrade}} is passing in my local run. The other test fails are known flaky tests AFAIK. > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148160#comment-17148160 ] Chen Liang edited comment on HDFS-15404 at 6/29/20, 9:53 PM: - Upload v004 patch to fix {{TestDFSHAAdminMiniCluster}} which was expecting to see {{target_}} environment variables, while with this change, the fencing on source will be using {{source_}} variables. This brings in another question to my mind, which is that there could be existing tooling that relies on {{target_*}} on both source and dst of the fencing. This might not be the right use of the variables, but if they do exist, they may break. Do we plan to support such use cases is the question, any thoughts [~shv]? {{TestRollingUpgrade}} is passing in my local run. The other test fails are known flaky tests AFAIK. was (Author: vagarychen): Upload v004 patch to fix {{TestDFSHAAdminMiniCluster}} which was expecting to see {{target_*}} environment variables, while with this change, the fencing on source will be using {{source_*}} variables. This brings in another question to my mind, which is that there could be existing tooling that relies on {{target_*}} on both source and dst of the fencing. This might not be the right use of the variables, but if they do exist, they may break. Do we plan to support such use cases is the question, any thoughts [~shv]? {{TestRollingUpgrade}} is passing in my local run. The other test fails are known flaky tests AFAIK. > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.004.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch, HDFS-15404.004.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144249#comment-17144249 ] Chen Liang commented on HDFS-15404: --- Thanks for checking [~shv]! These three test might slipped through my previous local testing somehow. Updated with v03 patch to fix these tests. On high level, the fixes are: 1. some cases mock fencing with a null target HA state, which was treated as illegal state by this new change. 2. in the new fencing logic, for a successful failover, two tryFence gets called, no longer just one; for a failed failover, if fail happens on fencing target, fencing on source will be skipped. TestFailoverController needs to be changed to reflect this new logic. > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.003.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, > HDFS-15404.003.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15421) IBR leak causes standby NN to be stuck in safe mode
[ https://issues.apache.org/jira/browse/HDFS-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143504#comment-17143504 ] Chen Liang commented on HDFS-15421: --- Thanks for reporting [~kihwal] and thanks [~aajisaka] working on this! Good catch on the missing updates, the change looks good to me. > IBR leak causes standby NN to be stuck in safe mode > --- > > Key: HDFS-15421 > URL: https://issues.apache.org/jira/browse/HDFS-15421 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Akira Ajisaka >Priority: Blocker > Labels: release-blocker > Attachments: HDFS-15421-000.patch, HDFS-15421-001.patch, > HDFS-15421.002.patch, HDFS-15421.003.patch > > > After HDFS-14941, update of the global gen stamp is delayed in certain > situations. This makes the last set of incremental block reports from append > "from future", which causes it to be simply re-queued to the pending DN > message queue, rather than processed to complete the block. The last set of > IBRs will leak and never cleaned until it transitions to active. The size of > {{pendingDNMessages}} constantly grows until then. > If a leak happens while in a startup safe mode, the namenode will never be > able to come out of safe mode on its own. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140052#comment-17140052 ] Chen Liang commented on HDFS-15404: --- Upload v002 patch to fix the bug that caused failed tests. The bug is that parseArgs should allow cmd only having command, in which case both src and dst will execute the same command/script > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.002.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Status: Patch Available (was: Open) > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source
[ https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15404: -- Attachment: HDFS-15404.001.patch > ShellCommandFencer should expose info about source > -- > > Key: HDFS-15404 > URL: https://issues.apache.org/jira/browse/HDFS-15404 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15404.001.patch > > > Currently the HA fencing logic in ShellCommandFencer exposes environment > variable about only the fencing target. i.e. the $target_* variables as > mentioned in this [document > page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). > > But here only the fencing target variables are getting exposed. Sometimes it > is useful to expose info about the fencing source node. One use case is would > allow source and target node to identify themselves separately and run > different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15404) ShellCommandFencer should expose info about source
Chen Liang created HDFS-15404: - Summary: ShellCommandFencer should expose info about source Key: HDFS-15404 URL: https://issues.apache.org/jira/browse/HDFS-15404 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chen Liang Assignee: Chen Liang Currently the HA fencing logic in ShellCommandFencer exposes environment variable about only the fencing target. i.e. the $target_* variables as mentioned in this [document page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]). But here only the fencing target variables are getting exposed. Sometimes it is useful to expose info about the fencing source node. One use case is would allow source and target node to identify themselves separately and run different commands/scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
[ https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114247#comment-17114247 ] Chen Liang commented on HDFS-15368: --- Thanks [~hexiaoqiao]! Will look into it, but one quick question, was the run based on trunk? Because the line number in the trace does not seem to match the trunk code. > TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally > > > Key: HDFS-15368 > URL: https://issues.apache.org/jira/browse/HDFS-15368 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Labels: balancer, test > Attachments: HDFS-15368.001.patch, > TestBalancerWithHANameNodes.testBalancerObserver.log > > > When I am working on HDFS-13183, I found that > TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, > because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, > when invoke getBlocks with opening Observer Read feature, it could request > any one of two ObserverNN based on my observation. So only verify the first > ObserverNN and check times of invoke #getBlocks is not expected. > {code:java} > for (int i = 0; i < cluster.getNumNameNodes(); i++) { > // First observer node is at idx 2, or 3 if 2 has been shut down > // It should get both getBlocks calls, all other NNs should see 0 > calls > int expectedObserverIdx = withObserverFailure ? 3 : 2; > int expectedCount = (i == expectedObserverIdx) ? 2 : 0; > verify(namesystemSpies.get(i), times(expectedCount)) > .getBlocks(any(), anyLong(), anyLong()); > } > {code} > cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, > would you like give some suggestions? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
[ https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113424#comment-17113424 ] Chen Liang commented on HDFS-15368: --- [~hexiaoqiao] thanks for reporting and looking into! It is actually expected to always hit idx=2 observer as long as it's running. Reason is that, without NameNode randomization, client will always try first Observer (idx2 in this case) before the second (idx3 here), unless first observer failed to respond. So in the case of withObserverFailure = false, it should be Observer with idx=2 being the one responding all the time. I will need to look into this. It would be helpful if you have an error stack trace. > TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally > > > Key: HDFS-15368 > URL: https://issues.apache.org/jira/browse/HDFS-15368 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Labels: balancer, test > Attachments: HDFS-15368.001.patch > > > When I am working on HDFS-13183, I found that > TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, > because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, > when invoke getBlocks with opening Observer Read feature, it could request > any one of two ObserverNN based on my observation. So only verify the first > ObserverNN and check times of invoke #getBlocks is not expected. > {code:java} > for (int i = 0; i < cluster.getNumNameNodes(); i++) { > // First observer node is at idx 2, or 3 if 2 has been shut down > // It should get both getBlocks calls, all other NNs should see 0 > calls > int expectedObserverIdx = withObserverFailure ? 3 : 2; > int expectedCount = (i == expectedObserverIdx) ? 2 : 0; > verify(namesystemSpies.get(i), times(expectedCount)) > .getBlocks(any(), anyLong(), anyLong()); > } > {code} > cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, > would you like give some suggestions? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15293: -- Fix Version/s: 3.1.5 3.3.1 2.10.1 3.2.2 Resolution: Fixed Status: Resolved (was: Patch Available) > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Critical > Labels: multi-sbnn, release-blocker > Fix For: 3.2.2, 2.10.1, 3.3.1, 3.1.5 > > Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110561#comment-17110561 ] Chen Liang commented on HDFS-15293: --- I have committed to trunk, branch-3.2, branch-3.1 and branch-2.10. Thanks the reviewers! > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Critical > Labels: multi-sbnn, release-blocker > Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108491#comment-17108491 ] Chen Liang commented on HDFS-15293: --- Updated v002 patch to address Akira's comments. > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Critical > Labels: multi-sbnn, release-blocker > Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15293: -- Attachment: HDFS-15293.002.patch > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Critical > Labels: multi-sbnn, release-blocker > Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108470#comment-17108470 ] Chen Liang commented on HDFS-15293: --- Hi [~aajisaka] sorry I have been busy dealing with some internal work. Will update the patch later today. Also [~shv] would like to get your thoughts on this as you have been looking into our internal version of this fix. > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Critical > Labels: multi-sbnn, release-blocker > Attachments: HDFS-15293.001.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15293: -- Status: Patch Available (was: Open) > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-15293.001.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101136#comment-17101136 ] Chen Liang commented on HDFS-15293: --- Had some offline discussion with [~shv], the txnid check does not seem be relevant here actually. Post v001 patch. This is based on our internal version of this fix, with some additional logging added to capture this behavior. > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-15293.001.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15293: -- Attachment: HDFS-15293.001.patch > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Labels: multi-sbnn > Attachments: HDFS-15293.001.patch > > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing
[ https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097821#comment-17097821 ] Chen Liang commented on HDFS-15323: --- Thanks for the finding [~shv]! This is a tricky issue. In HDFS-14806, I encountered a similar issue during boostrap standby, where one call limited by QJM_RPC_MAX_TXNS is not sufficient to catch up. In HDFS-14806, the approach we took is to just disable inprogress tailing during boostrap standby, and fall back to the HTTP based edit tailing. The reasoning there was that inprogress tailing was not meant to handle rare cases such as standup/failover and we can avoid having multiple RPC calls. Do you think the same idea can apply here? Apart from this, one minor comment. Can we add some logging around this logic? So we can more easily identify issues like this in the future. > StandbyNode fails transition to active due to insufficient transaction tailing > -- > > Key: HDFS-15323 > URL: https://issues.apache.org/jira/browse/HDFS-15323 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm >Affects Versions: 2.7.7 >Reporter: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch > > > StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind > in tailing journal transaction (from QJM) it can crash with > {{IllegalStateException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14647) NPE during secure namenode startup
[ https://issues.apache.org/jira/browse/HDFS-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097589#comment-17097589 ] Chen Liang commented on HDFS-14647: --- Thanks Konstantin, I have committed the branch-2 patch to branch-2.10. > NPE during secure namenode startup > -- > > Key: HDFS-14647 > URL: https://issues.apache.org/jira/browse/HDFS-14647 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.8.2 >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14647-2.002.patch, HDFS-14647-trunk.001.patch, > HDFS-14647-trunk.002.patch, HDFS-14647-trunk.003.patch, > HDFS-14647-trunk.004.patch, HDFS-14647.001.patch > > > In secure HDFS, during Namenode loading fsimage, when hitting Namenode > through the REST API, below exception would be thrown out. (This is in > version 2.8.2) > {quote}org.apache.hadoop.hdfs.web.resources.ExceptionHandler: > INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283) > at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226) > at > org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54) > at > org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42) > at > com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) >
[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094792#comment-17094792 ] Chen Liang commented on HDFS-15293: --- [~shv] I don't think the the issue you mentioned will actually happen currently. Because the checks only skip an image if BOTH conditions are met: 1. time delta too small AND 2. txnid delta too small. It's an AND not OR. So in the case you mentioned, it is true that time delta will always be considered too small due to the ridiculously large interval, but if configured with a small txnid, it is easy to get enough txnid, so txnid delta won't be considered too small. It is not that time delta being small alone leads to rejecting an image. But indeed, it is possible that in a cluster with ridiculously large interval, plus a extremely light load (so txnid barely make progress), both conditions will always be true. In this case the checkpoint will all be rejected. Although realistically I don't think there is much value doing checkpoint in such situation any way, it is probably not a good idea to change behavior of the system by effectively rejecting all images from happening. Because of this, I'm thinking of removing the txnid condition all together, so the check only looks at time delta and allow any txnid delta. It seems more tricky to justify preventing all the use cases with slow txnid increase. (Time always proceed, but not necessarily txnid.) I think we were targeting mainly time condition originally. > Relax the condition for accepting a fsimage when receiving a checkpoint > > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Labels: multi-sbnn > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15287) HDFS rollingupgrade prepare never finishes
[ https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094722#comment-17094722 ] Chen Liang edited comment on HDFS-15287 at 4/28/20, 5:42 PM: - Thanks for the update [~kihwal]. Will follow up on HDFS-15293, the issue to resolve there should be relatively though. Issue mentioned in HDFS-15293 is not a consistently happening issue and can lead to missing at most one periodical image upload. And just to clarify though, the improvement from HDFS-15036 is not specific to Observer. It was for multiple SBN in general. Even without Observer, as long as there are multiple SBN, there can be frequent image upload. While even with Observer, if there is only one SBN, frequent upload would not be an issue. Regarding making this configurable, would like to have [~shv]'s thoughts here, as Konstantin was opposing adding this new config. was (Author: vagarychen): Thanks for the update [~kihwal]. Will follow up on HDFS-15293, the issue to resolve there should be relatively though. Issue mentioned in HDFS-15293 is not a consistently happening issue and can lead to missing at most one periodical image upload. And just to clarify though, the improvement from HDFS-15036 is not specific to Observer. It was for multiple SBN in general. Even without Observer, as long as there are multiple SBN, there can be frequent image upload. While even with Observer, if there is only one SBN, frequent upload would not be an issue. > HDFS rollingupgrade prepare never finishes > -- > > Key: HDFS-15287 > URL: https://issues.apache.org/jira/browse/HDFS-15287 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 3.3.0 >Reporter: Kihwal Lee >Priority: Blocker > > After HDFS-12979, the prepare step of rolling upgrade does not work. This is > because it added additional check for sufficient time passing since last > checkpoint. Since RU rollback image creation and upload can happen any time, > uploading of rollback image never succeeds. For a new cluster deployed for > testing, it might work since it never checkpointed before. > It was found that this check is disabled for unit tests, defeating the very > purpose of testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes
[ https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094722#comment-17094722 ] Chen Liang commented on HDFS-15287: --- Thanks for the update [~kihwal]. Will follow up on HDFS-15293, the issue to resolve there should be relatively though. Issue mentioned in HDFS-15293 is not a consistently happening issue and can lead to missing at most one periodical image upload. And just to clarify though, the improvement from HDFS-15036 is not specific to Observer. It was for multiple SBN in general. Even without Observer, as long as there are multiple SBN, there can be frequent image upload. While even with Observer, if there is only one SBN, frequent upload would not be an issue. > HDFS rollingupgrade prepare never finishes > -- > > Key: HDFS-15287 > URL: https://issues.apache.org/jira/browse/HDFS-15287 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 3.3.0 >Reporter: Kihwal Lee >Priority: Blocker > > After HDFS-12979, the prepare step of rolling upgrade does not work. This is > because it added additional check for sufficient time passing since last > checkpoint. Since RU rollback image creation and upload can happen any time, > uploading of rollback image never succeeds. For a new cluster deployed for > testing, it might work since it never checkpointed before. > It was found that this check is disabled for unit tests, defeating the very > purpose of testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14647) NPE during secure namenode startup
[ https://issues.apache.org/jira/browse/HDFS-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093983#comment-17093983 ] Chen Liang commented on HDFS-14647: --- [~ayushtkn] [~fengnanli], thanks for working on this issue. I see there is a patch that applies to branch-2, but I didn't see this fix in branch-2.10. Is the branch-2 version ready to be committed to branch-2.10? > NPE during secure namenode startup > -- > > Key: HDFS-14647 > URL: https://issues.apache.org/jira/browse/HDFS-14647 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.8.2 >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14647-2.002.patch, HDFS-14647-trunk.001.patch, > HDFS-14647-trunk.002.patch, HDFS-14647-trunk.003.patch, > HDFS-14647-trunk.004.patch, HDFS-14647.001.patch > > > In secure HDFS, during Namenode loading fsimage, when hitting Namenode > through the REST API, below exception would be thrown out. (This is in > version 2.8.2) > {quote}org.apache.hadoop.hdfs.web.resources.ExceptionHandler: > INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283) > at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226) > at > org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54) > at > org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42) > at > com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at >
[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes
[ https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091752#comment-17091752 ] Chen Liang commented on HDFS-15287: --- Thanks for the elaboration [~kihwal]. I did some testing in our 2.10 cluster a couple times, the rollingUpgrade prepare then finalize worked fine (i.e. the rollback did get uploaded to ANN successfully). So I'm still not able to reproduce this issue... But one quick question, any chance you were running a 2.10 version without HDFS-15036? This Jira was the jira that adds the exlusion of rollback image. A quick way to check whether this Jira is there is that there should be log message on SBN like "Image upload rejected by the other NameNode: " after HDFS-15036. This being said, I don't have strong preference of making it (not) configurable/default false, considering this feature is supposed to be an improvement only in case multiple SBN scenario, which may not be the case of most of deployments. In my initial patch under HDFS-12979 it was configurable. By then [~shv] had concern with HDFS having more or more configurations. So [~shv], would be good to have your thoughts here. > HDFS rollingupgrade prepare never finishes > -- > > Key: HDFS-15287 > URL: https://issues.apache.org/jira/browse/HDFS-15287 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 3.3.0 >Reporter: Kihwal Lee >Priority: Blocker > > After HDFS-12979, the prepare step of rolling upgrade does not work. This is > because it added additional check for sufficient time passing since last > checkpoint. Since RU rollback image creation and upload can happen any time, > uploading of rollback image never succeeds. For a new cluster deployed for > testing, it might work since it never checkpointed before. > It was found that this check is disabled for unit tests, defeating the very > purpose of testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15293) Relax FSImage upload time delta check restriction
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088924#comment-17088924 ] Chen Liang edited comment on HDFS-15293 at 4/21/20, 6:09 PM: - The reason this check might be too stringent is that, for example, say we configure fsImage interval to 6 hours, consider the case when SBN uploads image A at time 00:00, but there is a minor time skew when ANN actually sees this fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then ANN would consider the time delta is smaller than the configured delta of 6 hours and thus ANN would then reject this image. Despite that there is only a 5ms difference, and should acceptable. Essentially, the current check for exact timestamp can be too susceptible to random timing conditions. The consequence of this issue, is that ANN might be missing one image once in a while. Because even if ANN rejects the image at 06:00, next time SBN uploads at 12:00, ANN will not reject it, as by that time, the delta is guaranteed to be > 6 hours. This means there will not be more than one consecutive missing images. was (Author: vagarychen): The reason this check might be too stringent is that, for example, say we configure fsImage interval to 6 hours, consider the case when SBN uploads image A at time 00:00, but there is a minor time skew when ANN actually sees this fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then ANN would consider the time delta is smaller than the configured delta of 6 hours and thus ANN would then reject this image. Despite that there is only a 5ms difference, and should acceptable. Essentially, the current check for exact timestamp can be too susceptible to random timing conditions. The consequence of this issue, is that ANN might be missing one image once in a while. Because even if ANN rejects the image at 06:00, next time SBN uploads at 12:00, ANN will not reject it. So there will not be more than one consecutive missing images. > Relax FSImage upload time delta check restriction > - > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15293) Relax FSImage upload time delta check restriction
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088924#comment-17088924 ] Chen Liang edited comment on HDFS-15293 at 4/21/20, 6:08 PM: - The reason this check might be too stringent is that, for example, say we configure fsImage interval to 6 hours, consider the case when SBN uploads image A at time 00:00, but there is a minor time skew when ANN actually sees this fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then ANN would consider the time delta is smaller than the configured delta of 6 hours and thus ANN would then reject this image. Despite that there is only a 5ms difference, and should acceptable. Essentially, the current check for exact timestamp can be too susceptible to random timing conditions. The consequence of this issue, is that ANN might be missing one image once in a while. Because even if ANN rejects the image at 06:00, next time SBN uploads at 12:00, ANN will not reject it. So there will not be more than one consecutive missing images. was (Author: vagarychen): The reason this check might be too stringent is that, for example, say we configure fsImage interval to 6 hours, consider the case when SBN uploads image A at time 00:00, but there is a minor time skew when ANN actually sees this fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then ANN would consider the time delta is smaller than the configured delta of 6 hours and thus ANN would then reject this image. Despite that there is only a 5ms difference, and should acceptable. Essentially, the current check for exact timestamp can be too susceptible to random timing conditions. > Relax FSImage upload time delta check restriction > - > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes
[ https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088926#comment-17088926 ] Chen Liang commented on HDFS-15287: --- I filed HDFS-15293 to relax the time interval condition. But again, it is not related to RU. > HDFS rollingupgrade prepare never finishes > -- > > Key: HDFS-15287 > URL: https://issues.apache.org/jira/browse/HDFS-15287 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 3.3.0 >Reporter: Kihwal Lee >Priority: Blocker > > After HDFS-12979, the prepare step of rolling upgrade does not work. This is > because it added additional check for sufficient time passing since last > checkpoint. Since RU rollback image creation and upload can happen any time, > uploading of rollback image never succeeds. For a new cluster deployed for > testing, it might work since it never checkpointed before. > It was found that this check is disabled for unit tests, defeating the very > purpose of testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15293) Relax FSImage upload time delta check restriction
[ https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088924#comment-17088924 ] Chen Liang commented on HDFS-15293: --- The reason this check might be too stringent is that, for example, say we configure fsImage interval to 6 hours, consider the case when SBN uploads image A at time 00:00, but there is a minor time skew when ANN actually sees this fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then ANN would consider the time delta is smaller than the configured delta of 6 hours and thus ANN would then reject this image. Despite that there is only a 5ms difference, and should acceptable. Essentially, the current check for exact timestamp can be too susceptible to random timing conditions. > Relax FSImage upload time delta check restriction > - > > Key: HDFS-15293 > URL: https://issues.apache.org/jira/browse/HDFS-15293 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > > HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload > from Standby with a small delta comparing to previous fsImage. ANN would > reject this image. This is to avoid overly frequent fsImage in case of when > there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15293) Relax FSImage upload time delta check restriction
Chen Liang created HDFS-15293: - Summary: Relax FSImage upload time delta check restriction Key: HDFS-15293 URL: https://issues.apache.org/jira/browse/HDFS-15293 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Chen Liang Assignee: Chen Liang HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload from Standby with a small delta comparing to previous fsImage. ANN would reject this image. This is to avoid overly frequent fsImage in case of when there are multiple Standby node. However this check could be too stringent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12979) StandbyNode should upload FsImage to ObserverNode after checkpointing.
[ https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088127#comment-17088127 ] Chen Liang commented on HDFS-12979: --- [~kihwal], it is disabled in minidsfscluter because there are existing tests that relies on having small delta fsimages. The newly added tests that testing this feature have this flag explicitly enabled to override minidfscluster's disabling (such as in TestRollingUpgrade, as introduced in HDFS-15036). And I just did some quick testing in our 2.10 testing cluster, the RU image upload worked fine for me. Still though, acknowledge that this would be a blocker if it breaks RU, continuing the investigation. > StandbyNode should upload FsImage to ObserverNode after checkpointing. > -- > > Key: HDFS-12979 > URL: https://issues.apache.org/jira/browse/HDFS-12979 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-12979-branch-2.001.patch, HDFS-12979.001.patch, > HDFS-12979.002.patch, HDFS-12979.003.patch, HDFS-12979.004.patch, > HDFS-12979.005.patch, HDFS-12979.006.patch, HDFS-12979.007.patch, > HDFS-12979.008.patch, HDFS-12979.009.patch, HDFS-12979.010.patch, > HDFS-12979.011.patch, HDFS-12979.012.patch, HDFS-12979.013.patch, > HDFS-12979.014.patch, HDFS-12979.015.patch > > > ObserverNode does not create checkpoints. So it's fsimage file can get very > old making bootstrap of ObserverNode too long. A StandbyNode should copy > latest fsimage to ObserverNode(s) along with ANN. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes
[ https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088079#comment-17088079 ] Chen Liang commented on HDFS-15287: --- In case of roll back image, this check {code} NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) {code} would make sure the rollback image is not subject to this check. Because in case of rollback image, getNameNodeFile would return {{NameNodeFile.IMAGE_ROLLBACK}} rather than {{NameNodeFile.IMAGE}}. So this check should not apply to RU case at all. In case of regular, periodic image upload though, I agree that the current might be too stringent as there can be cases where ANN sees image with potentially very minor time gap difference comparing to SBN. We recently also have been thinking of relaxing this check. pinging [~shv]. > HDFS rollingupgrade prepare never finishes > -- > > Key: HDFS-15287 > URL: https://issues.apache.org/jira/browse/HDFS-15287 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0, 3.3.0 >Reporter: Kihwal Lee >Priority: Blocker > > After HDFS-12979, the prepare step of rolling upgrade does not work. This is > because it added additional check for sufficient time passing since last > checkpoint. Since RU rollback image creation and upload can happen any time, > uploading of rollback image never succeeds. For a new cluster deployed for > testing, it might work since it never checkpointed before. > It was found that this check is disabled for unit tests, defeating the very > purpose of testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071188#comment-17071188 ] Chen Liang commented on HDFS-15191: --- [~Steven Rand] I think this patch is already in branch-3.3. The trunk cut was after this gets committed. Or are you referring to another branch? > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Fix For: 3.2.2, 3.3.1 > > Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, > HDFS-15191.003.patch, HDFS-15191.004.patch > > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15191: -- Fix Version/s: 3.3.1 3.2.2 Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk and branch-3.2, thanks [~Steven Rand] for the contribution. > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Fix For: 3.2.2, 3.3.1 > > Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, > HDFS-15191.003.patch, HDFS-15191.004.patch > > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069016#comment-17069016 ] Chen Liang commented on HDFS-15191: --- [~Steven Rand] the change makes sense to me, nice catch on the issue! +1 to v004 patch, will commit soon. > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, > HDFS-15191.003.patch, HDFS-15191.004.patch > > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067897#comment-17067897 ] Chen Liang commented on HDFS-15191: --- Hey [~Steven Rand], Sorry I did plan to take another look, but have been busy recently. Will take a look today or tomorrow > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Assignee: Steven Rand >Priority: Major > Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, > HDFS-15191.003.patch, HDFS-15191.004.patch > > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15197) Change ObserverRetryOnActiveException log to debug
[ https://issues.apache.org/jira/browse/HDFS-15197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15197: -- Status: Patch Available (was: Open) > Change ObserverRetryOnActiveException log to debug > -- > > Key: HDFS-15197 > URL: https://issues.apache.org/jira/browse/HDFS-15197 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Minor > Attachments: HDFS-15197.001.patch > > > Currently in ObserverReadProxyProvider, when a ObserverRetryOnActiveException > happens, ObserverReadProxyProvider logs a message at INFO level. This can be > a large volume of logs in some scenarios. For example, when some job tries to > access lots of files that haven't been accessed for a long time, all these > accesses may trigger atime updates, which led to > ObserverRetryOnActiveException. We should change this log to DEBUG. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15197) Change ObserverRetryOnActiveException log to debug
[ https://issues.apache.org/jira/browse/HDFS-15197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15197: -- Attachment: HDFS-15197.001.patch > Change ObserverRetryOnActiveException log to debug > -- > > Key: HDFS-15197 > URL: https://issues.apache.org/jira/browse/HDFS-15197 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Minor > Attachments: HDFS-15197.001.patch > > > Currently in ObserverReadProxyProvider, when a ObserverRetryOnActiveException > happens, ObserverReadProxyProvider logs a message at INFO level. This can be > a large volume of logs in some scenarios. For example, when some job tries to > access lots of files that haven't been accessed for a long time, all these > accesses may trigger atime updates, which led to > ObserverRetryOnActiveException. We should change this log to DEBUG. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15197) Change ObserverRetryOnActiveException log to debug
Chen Liang created HDFS-15197: - Summary: Change ObserverRetryOnActiveException log to debug Key: HDFS-15197 URL: https://issues.apache.org/jira/browse/HDFS-15197 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Reporter: Chen Liang Assignee: Chen Liang Currently in ObserverReadProxyProvider, when a ObserverRetryOnActiveException happens, ObserverReadProxyProvider logs a message at INFO level. This can be a large volume of logs in some scenarios. For example, when some job tries to access lots of files that haven't been accessed for a long time, all these accesses may trigger atime updates, which led to ObserverRetryOnActiveException. We should change this log to DEBUG. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043864#comment-17043864 ] Chen Liang edited comment on HDFS-15191 at 2/24/20 9:01 PM: There could be token compatibility issue though, if you only have HDFS-13617, but not HDFS-14611. If both changes are there, this should be fine. But even if HDFS-14611 is missing, I would expect a different error. Because seems the error happened at the very first call of {{readVLong}} when parsing the token. Those two Jiras only changes the behavior of tails of the block token. Also, even if we hit compatibility issue, I expect it to only affect the selective SASL feature. Will be watching this issue. was (Author: vagarychen): There could be token compatibility issue though, if you only have HDFS-13617, but not HDFS-14611. If both changes are there, this should be fine. But even if HDFS-14611 is missing, I would expect a different error. Because seems the error happened at the very first call of {{readVLong}} when parsing the token. Those two Jiras only changes the behavior of tails of the block token. > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Priority: Major > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem
[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier
[ https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043864#comment-17043864 ] Chen Liang commented on HDFS-15191: --- There could be token compatibility issue though, if you only have HDFS-13617, but not HDFS-14611. If both changes are there, this should be fine. But even if HDFS-14611 is missing, I would expect a different error. Because seems the error happened at the very first call of {{readVLong}} when parsing the token. Those two Jiras only changes the behavior of tails of the block token. > EOF when reading legacy buffer in BlockTokenIdentifier > -- > > Key: HDFS-15191 > URL: https://issues.apache.org/jira/browse/HDFS-15191 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.2.1 >Reporter: Steven Rand >Priority: Major > > We have an HDFS client application which recently upgraded from 3.2.0 to > 3.2.1. After this upgrade (but not before), we sometimes see these errors > when this application is used with clusters still running Hadoop 2.x (more > specifically CDH 5.12.1): > {code} > WARN [2020-02-24T00:54:32.856Z] > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. (_sampled: true) > java.io.EOFException: > at java.io.DataInputStream.readByte(DataInputStream.java:272) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) > at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240) > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221) > at > org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644) > at > org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:100) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270) > at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291) > at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246) > at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765) > {code} > We get this warning for all DataNodes with a copy of the block, so the read > fails. > I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to > cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging > [~vagarychen] in case you have any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15185) StartupProgress reports edits segments until the entire startup completes
[ https://issues.apache.org/jira/browse/HDFS-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041290#comment-17041290 ] Chen Liang commented on HDFS-15185: --- I have tested this fix on a real cluster, the patch did get rid of the excessive ByteString displays. +1 with the Jenkins warnings addressed. > StartupProgress reports edits segments until the entire startup completes > - > > Key: HDFS-15185 > URL: https://issues.apache.org/jira/browse/HDFS-15185 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15185.001.patch > > > Startup Progress page keeps reporting edits segments after the {{LOAD_EDITS}} > stage is complete. New steps are added to StartupProgress while journal > tailing until all startup phases are completed. This adds a lot of edits > steps, since {{SAFEMODE}} phase can take a long time on a large cluster. > With fast tailing the segments are small, but the number of them is large - > 160K. This makes the page load forever. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15168) ABFS driver enhancement - Translate AAD Service Principal and Security Group To Linux user and group
[ https://issues.apache.org/jira/browse/HDFS-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang reassigned HDFS-15168: - Assignee: Karthik Amarnath > ABFS driver enhancement - Translate AAD Service Principal and Security Group > To Linux user and group > > > Key: HDFS-15168 > URL: https://issues.apache.org/jira/browse/HDFS-15168 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Karthik Amarnath >Assignee: Karthik Amarnath >Priority: Major > > ABFS driver does not support the translation of AAD Service principal (SPI) > to Linux identities causing metadata operation failure. Hadoop MapReduce > client > [[JobSubmissionFiles|https://github.com/apache/hadoop/blob/d842dfffa53c8b565f3d65af44ccd7e1cc706733/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmissionFiles.java#L138]] > expects the file owner permission to be the Linux identity, but the > underlying ABFS driver returns the AAD Object identity. Hence need ABFS > driver enhancement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15153) TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035573#comment-17035573 ] Chen Liang commented on HDFS-15153: --- Closing this ticket as it duplicates of HDFS-15164 > TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails > intermittently > --- > > Key: HDFS-15153 > URL: https://issues.apache.org/jira/browse/HDFS-15153 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > > The unit TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT is > failing consistently. Seems this is due to a log message change. We should > fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15153) TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently
[ https://issues.apache.org/jira/browse/HDFS-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang resolved HDFS-15153. --- Resolution: Duplicate > TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails > intermittently > --- > > Key: HDFS-15153 > URL: https://issues.apache.org/jira/browse/HDFS-15153 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > > The unit TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT is > failing consistently. Seems this is due to a log message change. We should > fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15164) Fix TestDelegationTokensWithHA
[ https://issues.apache.org/jira/browse/HDFS-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035560#comment-17035560 ] Chen Liang commented on HDFS-15164: --- Initially based on my quick check, I thought this was due to the change in HDFS-15099 to return RetryOnActiveException, causing the error message to change, and the test is asserts on log capture, so it got a different message and failed the assertion. On a second look, seems that the actual issue is that due to probing period, client was not connecting to Standby, so that the expected Standby exception is not happening. In this case, I think disabling the probing period for this test is the right fix. So +1 on 01 patch, pending Jenkins. And thanks [~ayushtkn] again for taking a look! > Fix TestDelegationTokensWithHA > -- > > Key: HDFS-15164 > URL: https://issues.apache.org/jira/browse/HDFS-15164 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15164-01.patch > > > {noformat} > java.lang.AssertionError > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT(TestDelegationTokensWithHA.java:156){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035548#comment-17035548 ] Chen Liang commented on HDFS-15086: --- TestDelegationTokensWithHA failing is tracked under HDFS-15153, I think it broke due to a log message change. > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, > HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035537#comment-17035537 ] Chen Liang edited comment on HDFS-15118 at 2/12/20 5:37 PM: [~ayushtkn] I already filed HDFS-15153 for fixing this test, just haven't got the bandwidth to work on it. I have taken a quick look and I think it was caused by a previous fix, not this one. UPDATE: Looks like you already started working on this on HDFS-15164 and have a patch there already, I guess we can just follow up under HDFS-15164. Thanks for picking this up! was (Author: vagarychen): [~ayushtkn] I already filed HDFS-15153 for fixing this test, just haven't got the bandwidth to work on it. I have taken a quick look and I think it was caused by a previous fix, not this one. > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035537#comment-17035537 ] Chen Liang commented on HDFS-15118: --- [~ayushtkn] I already filed HDFS-15153 for fixing this test, just haven't got the bandwidth to work on it. I have taken a quick look and I think it was caused by a previous fix, not this one. > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15148: -- Fix Version/s: 3.3.1 2.10.1 3.2.2 3.1.4 Resolution: Fixed Status: Resolved (was: Patch Available) > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Fix For: 3.1.4, 3.2.2, 2.10.1, 3.3.1 > > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch, HDFS-15148.004.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030085#comment-17030085 ] Chen Liang commented on HDFS-15148: --- Thanks [~shv]! I have filed HDFS-15146 to fix the test. Will commit v04 patch shortly. > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch, HDFS-15148.004.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15153) TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently
Chen Liang created HDFS-15153: - Summary: TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently Key: HDFS-15153 URL: https://issues.apache.org/jira/browse/HDFS-15153 Project: Hadoop HDFS Issue Type: Bug Components: test Reporter: Chen Liang Assignee: Chen Liang The unit TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT is failing consistently. Seems this is due to a log message change. We should fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029185#comment-17029185 ] Chen Liang commented on HDFS-15148: --- {{testObserverReadProxyProviderWithDT}} fail is unrelated and fails even without this patch. We should look into fixing this test fail, but that should be in another jira. [~shv] mind taking a look at v004 patch? > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch, HDFS-15148.004.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15148: -- Attachment: HDFS-15148.004.patch > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch, HDFS-15148.004.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028536#comment-17028536 ] Chen Liang commented on HDFS-15148: --- Thanks for taking a look [~shv]! post v004 patch > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch, HDFS-15148.004.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026282#comment-17026282 ] Chen Liang commented on HDFS-15148: --- The failed test TestMultipleNNPortQOP seems unrelated to the change in this jira, and has been passing in my local runs. I think it failed because of the hard coded 100ms sleep may not be long enough for Jenkins run. So this is a test that may randomly fail if unlucky. Although I update the patch here with a fix, since it is separate issue, maybe this test fix should be in another Jira. [~shv] please let me know if you have a preference. > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15148: -- Attachment: HDFS-15148.003.patch > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, > HDFS-15148.003.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15118: -- Fix Version/s: 2.10.1 3.2.2 3.1.4 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026199#comment-17026199 ] Chen Liang commented on HDFS-15118: --- The failed tests are unrelated. I've committed to trunk, branch-3.2, branch-3.1 and branch-2.10, with the checkstyle fixed at the commit time. Thanks for the view [~shv] > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15148: -- Attachment: HDFS-15148.002.patch > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025434#comment-17025434 ] Chen Liang commented on HDFS-15148: --- {{TestBlockTokenWrappingQOP}} test fail is actually related, update with v02 patch to fix. > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024808#comment-17024808 ] Chen Liang commented on HDFS-15118: --- Thanks for the catch [~shv]! Updated in v02 patch > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15118: -- Attachment: HDFS-15118.002.patch > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15148: -- Status: Patch Available (was: Open) > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
[ https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15148: -- Attachment: HDFS-15148.001.patch > dfs.namenode.send.qop.enabled should not apply to primary NN port > - > > Key: HDFS-15148 > URL: https://issues.apache.org/jira/browse/HDFS-15148 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Chen Liang >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15148.001.patch > > > In HDFS-13617, NameNode can be configured to wrap its established QOP into > block access token as an encrypted message. Later on DataNode will use this > message to create SASL connection. But this new behavior should only apply to > new auxiliary NameNode ports, not the primary port (the one configured in > fs.defaultFS), as it may cause conflicting behavior with existing other SASL > related configuration (e.g. dfs.data.transfer.protection). Since this > configure is introduced for to auxiliary ports only, we should restrict this > new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port
Chen Liang created HDFS-15148: - Summary: dfs.namenode.send.qop.enabled should not apply to primary NN port Key: HDFS-15148 URL: https://issues.apache.org/jira/browse/HDFS-15148 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.10.1, 3.3.1 Reporter: Chen Liang Assignee: Chen Liang In HDFS-13617, NameNode can be configured to wrap its established QOP into block access token as an encrypted message. Later on DataNode will use this message to create SASL connection. But this new behavior should only apply to new auxiliary NameNode ports, not the primary port (the one configured in fs.defaultFS), as it may cause conflicting behavior with existing other SASL related configuration (e.g. dfs.data.transfer.protection). Since this configure is introduced for to auxiliary ports only, we should restrict this new behavior to not apply to primary port. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15118: -- Status: Patch Available (was: Open) > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15118.001.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15118: -- Attachment: HDFS-15118.001.patch > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15118.001.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.
[ https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang reassigned HDFS-15118: - Assignee: Chen Liang > [SBN Read] Slow clients when Observer reads are enabled but there are no > Observers on the cluster. > -- > > Key: HDFS-15118 > URL: https://issues.apache.org/jira/browse/HDFS-15118 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15118.001.patch > > > We see substantial degradation in performance of HDFS clients, when Observer > reads are enabled via {{ObserverReadProxyProvider}}, but there are no > ObserverNodes on the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12943) Consistent Reads from Standby Node
[ https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015234#comment-17015234 ] Chen Liang commented on HDFS-12943: --- [~lindy_hopper] access time update is a write call so it can not be processed by Observer. Access time should be turned off on Observer, as mentioned in HDFS-14959. > Consistent Reads from Standby Node > -- > > Key: HDFS-12943 > URL: https://issues.apache.org/jira/browse/HDFS-12943 > Project: Hadoop HDFS > Issue Type: New Feature > Components: hdfs >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2 > > Attachments: ConsistentReadsFromStandbyNode.pdf, > ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, > HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, > TestPlan-ConsistentReadsFromStandbyNode.pdf > > > StandbyNode in HDFS is a replica of the active NameNode. The states of the > NameNodes are coordinated via the journal. It is natural to consider > StandbyNode as a read-only replica. As with any replicated distributed system > the problem of stale reads should be resolved. Our main goal is to provide > reads from standby in a consistent way in order to enable a wide range of > existing applications running on top of HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode
[ https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15099: -- Attachment: HDFS-15099-branch-2.10.003.patch > [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on > an attempt to change aTime on ObserverNode > > > Key: HDFS-15099 > URL: https://issues.apache.org/jira/browse/HDFS-15099 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15099-branch-2.10.001.patch, > HDFS-15099-branch-2.10.002.patch, HDFS-15099-branch-2.10.003.patch > > > The precision of updating an INode's aTime while executing > {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by > ObserverNode, so the call should be redirected to Active NameNode. In order > to redirect to active the ObserverNode should through > {{ObserverRetryOnActiveException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode
[ https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013159#comment-17013159 ] Chen Liang commented on HDFS-15099: --- Checked with Konstantin offline, a better approach for the test seems to just don't rely on unpredictable time diff. But rather, manipulate {{fs.setTimes()}}. Post v003 patch. > [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on > an attempt to change aTime on ObserverNode > > > Key: HDFS-15099 > URL: https://issues.apache.org/jira/browse/HDFS-15099 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15099-branch-2.10.001.patch, > HDFS-15099-branch-2.10.002.patch, HDFS-15099-branch-2.10.003.patch > > > The precision of updating an INode's aTime while executing > {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by > ObserverNode, so the call should be redirected to Active NameNode. In order > to redirect to active the ObserverNode should through > {{ObserverRetryOnActiveException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode
[ https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013146#comment-17013146 ] Chen Liang commented on HDFS-15099: --- Thanks for the great suggestions [~shv]! Post v002 patch. Only difference is that I removed {code:java} +dfs.open(testPath).close(); +assertSentTo(2); {code} Because if my test, seems if machine is slow enough, the access time 200ms may already passed here, and this call went to active already, failing the assertion. I removed this check completely, as I think this is just to verify an open can go to observer, which is already being covered by other tests, should be no necessity to have it here. > [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on > an attempt to change aTime on ObserverNode > > > Key: HDFS-15099 > URL: https://issues.apache.org/jira/browse/HDFS-15099 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15099-branch-2.10.001.patch, > HDFS-15099-branch-2.10.002.patch > > > The precision of updating an INode's aTime while executing > {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by > ObserverNode, so the call should be redirected to Active NameNode. In order > to redirect to active the ObserverNode should through > {{ObserverRetryOnActiveException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode
[ https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15099: -- Attachment: HDFS-15099-branch-2.10.002.patch > [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on > an attempt to change aTime on ObserverNode > > > Key: HDFS-15099 > URL: https://issues.apache.org/jira/browse/HDFS-15099 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15099-branch-2.10.001.patch, > HDFS-15099-branch-2.10.002.patch > > > The precision of updating an INode's aTime while executing > {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by > ObserverNode, so the call should be redirected to Active NameNode. In order > to redirect to active the ObserverNode should through > {{ObserverRetryOnActiveException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode
[ https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15099: -- Status: Patch Available (was: Open) > [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on > an attempt to change aTime on ObserverNode > > > Key: HDFS-15099 > URL: https://issues.apache.org/jira/browse/HDFS-15099 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15099-branch-2.10.001.patch > > > The precision of updating an INode's aTime while executing > {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by > ObserverNode, so the call should be redirected to Active NameNode. In order > to redirect to active the ObserverNode should through > {{ObserverRetryOnActiveException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode
[ https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15099: -- Attachment: HDFS-15099-branch-2.10.001.patch > [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on > an attempt to change aTime on ObserverNode > > > Key: HDFS-15099 > URL: https://issues.apache.org/jira/browse/HDFS-15099 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Attachments: HDFS-15099-branch-2.10.001.patch > > > The precision of updating an INode's aTime while executing > {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by > ObserverNode, so the call should be redirected to Active NameNode. In order > to redirect to active the ObserverNode should through > {{ObserverRetryOnActiveException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down
[ https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007828#comment-17007828 ] Chen Liang edited comment on HDFS-14655 at 1/3/20 11:25 PM: Although it's a different message, checked again, does look like HDFS-14934 should fix this too. Thanks for the pointer [~ayushtkn]! was (Author: vagarychen): HDFS-14934 does look like the fix. Thanks for the pointer [~ayushtkn]! > [SBN Read] Namenode crashes if one of The JN is down > > > Key: HDFS-14655 > URL: https://issues.apache.org/jira/browse/HDFS-14655 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Critical > Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, > HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, > HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, > HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, > HDFS-14655.poc.patch > > > {noformat} > 2019-07-04 17:35:54,064 | INFO | Logger channel (from parallel executor) to > XXX/XXX | Retrying connect to server: XXX/XXX. Already tried > 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) | Client.java:975 > 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered > while tailing edits. Shutting down standby NN. | EditLogTailer.java:474 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at > com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440) > at > com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565) > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > 2019-07-04 17:35:54,112 | INFO | Edit log tailer | Exiting with status 1: > java.lang.OutOfMemoryError: unable to create new native thread | > ExitUtil.java:210 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down
[ https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007828#comment-17007828 ] Chen Liang commented on HDFS-14655: --- HDFS-14934 does look like the fix. Thanks for the pointer [~ayushtkn]! > [SBN Read] Namenode crashes if one of The JN is down > > > Key: HDFS-14655 > URL: https://issues.apache.org/jira/browse/HDFS-14655 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Critical > Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, > HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, > HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, > HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, > HDFS-14655.poc.patch > > > {noformat} > 2019-07-04 17:35:54,064 | INFO | Logger channel (from parallel executor) to > XXX/XXX | Retrying connect to server: XXX/XXX. Already tried > 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) | Client.java:975 > 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered > while tailing edits. Shutting down standby NN. | EditLogTailer.java:474 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at > com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440) > at > com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565) > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > 2019-07-04 17:35:54,112 | INFO | Edit log tailer | Exiting with status 1: > java.lang.OutOfMemoryError: unable to create new native thread | > ExitUtil.java:210 > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down
[ https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007662#comment-17007662 ] Chen Liang commented on HDFS-14655: --- [~ayushtkn] shared below, it may not help too much though, as it seems to be thrown from the thread being cancelled {code:java} 2020-01-03 17:50:10,887 WARN org.apache.hadoop.util.concurrent.ExecutorHelper: Caught exception in thread Logger channel (from parallel executor) to [...some JN hostname:port...]: 2020-01-03 17:50:10,887 WARN org.apache.hadoop.util.concurrent.ExecutorHelper: Caught exception in thread Logger channel (from parallel executor) to [...some JN hostname:port...]:java.util.concurrent.CancellationException at java.util.concurrent.FutureTask.report(FutureTask.java:121) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.util.concurrent.ExecutorHelper.logThrowableFromAfterExecute(ExecutorHelper.java:47) at org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor.afterExecute(HadoopThreadPoolExecutor.java:90) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} > [SBN Read] Namenode crashes if one of The JN is down > > > Key: HDFS-14655 > URL: https://issues.apache.org/jira/browse/HDFS-14655 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Critical > Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, > HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, > HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, > HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, > HDFS-14655.poc.patch > > > {noformat} > 2019-07-04 17:35:54,064 | INFO | Logger channel (from parallel executor) to > XXX/XXX | Retrying connect to server: XXX/XXX. Already tried > 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, > sleepTime=1000 MILLISECONDS) | Client.java:975 > 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered > while tailing edits. Shutting down standby NN. | EditLogTailer.java:474 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378) > at > com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440) > at > com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56) > at > org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565) > at > org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533) > at > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508) > at > org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423) > 2019-07-04 17:35:54,112 | INFO | Edit log tailer | Exiting with status 1: > java.lang.OutOfMemoryError: unable to create new native thread | >
[jira] [Commented] (HDFS-15036) Active NameNode should not silently fail the image transfer
[ https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998561#comment-16998561 ] Chen Liang commented on HDFS-15036: --- [~Jim_Brennan] I filed https://issues.apache.org/jira/browse/INFRA-19581, but haven't got update from Infra folks yet. > Active NameNode should not silently fail the image transfer > --- > > Key: HDFS-15036 > URL: https://issues.apache.org/jira/browse/HDFS-15036 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, > HDFS-15036.003.patch > > > Image transfer from Standby NameNode to Active silently fails on Active, > without any logging and not notifying the receiver side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15036) Active NameNode should not silently fail the image transfer
[ https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16996000#comment-16996000 ] Chen Liang commented on HDFS-15036: --- Oops! Did not realize it's already deleted, guess I missed the messages... will work on deleting it again... > Active NameNode should not silently fail the image transfer > --- > > Key: HDFS-15036 > URL: https://issues.apache.org/jira/browse/HDFS-15036 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, > HDFS-15036.003.patch > > > Image transfer from Standby NameNode to Active silently fails on Active, > without any logging and not notifying the receiver side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15036) Active NameNode should not silently fail the image transfer
[ https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Liang updated HDFS-15036: -- Fix Version/s: 3.2.2 3.1.4 > Active NameNode should not silently fail the image transfer > --- > > Key: HDFS-15036 > URL: https://issues.apache.org/jira/browse/HDFS-15036 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Chen Liang >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, > HDFS-15036.003.patch > > > Image transfer from Standby NameNode to Active silently fails on Active, > without any logging and not notifying the receiver side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org