[jira] [Updated] (HDFS-14272) [SBN read] ObserverReadProxyProvider should sync with active txnID on startup

2020-12-15 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-14272:
--
Fix Version/s: 2.10.0
   3.2.1
   3.1.3

> [SBN read] ObserverReadProxyProvider should sync with active txnID on startup
> -
>
> Key: HDFS-14272
> URL: https://issues.apache.org/jira/browse/HDFS-14272
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
> Environment: CDH6.1 (Hadoop 3.0.x) + Consistency Reads from Standby + 
> SSL + Kerberos + RPC encryption
>Reporter: Wei-Chiu Chuang
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14272.000.patch, HDFS-14272.001.patch, 
> HDFS-14272.002.patch
>
>
> It is typical for integration tests to create some files and then check their 
> existence. For example, like the following simple bash script:
> {code:java}
> # hdfs dfs -touchz /tmp/abc
> # hdfs dfs -ls /tmp/abc
> {code}
> The test executes HDFS bash command sequentially, but it may fail with 
> Consistent Standby Read because the -ls does not find the file.
> Analysis: the second bash command, while launched sequentially after the 
> first one, is not aware of the state id returned from the first bash command. 
> So ObserverNode wouldn't wait for the the edits to get propagated, and thus 
> fails.
> I've got a cluster where the Observer has tens of seconds of RPC latency, and 
> this becomes very annoying. (I am still trying to figure out why this 
> Observer has such a long RPC latency. But that's another story.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15665) Balancer logging improvement

2020-11-03 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225636#comment-17225636
 ] 

Chen Liang commented on HDFS-15665:
---

Thanks for the clarification [~shv] , +1 to v002 patch 

> Balancer logging improvement
> 
>
> Key: HDFS-15665
> URL: https://issues.apache.org/jira/browse/HDFS-15665
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15665.001.patch, HDFS-15665.002.patch
>
>
> It would be good to have Balancer log all relevant configuration parameters 
> on each iteration along with some data, which reflects its progress and the 
> amount of resources it involves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15665) Balancer logging improvement

2020-11-02 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17225011#comment-17225011
 ] 

Chen Liang commented on HDFS-15665:
---

Thanks for working on this [~shv]! v001 patch looks good to me. Just two minor 
comments:

1. The {{getInt}} line Balancer.java:L#286 seems redundant? no variable is 
taking that value
2. Balancer.java:L#663 and L#665, the two LOG.info lines, would it be better to 
merge them to one line?

> Balancer logging improvement
> 
>
> Key: HDFS-15665
> URL: https://issues.apache.org/jira/browse/HDFS-15665
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15665.001.patch
>
>
> It would be good to have Balancer log all relevant configuration parameters 
> on each iteration along with some data, which reflects its progress and the 
> amount of resources it involves.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15567) [SBN Read] HDFS should expose msync() API to allow downstream applications call it explicetly.

2020-10-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212685#comment-17212685
 ] 

Chen Liang commented on HDFS-15567:
---

Thanks [~shv]. Makes sense, +1 on v002 patch.

> [SBN Read] HDFS should expose msync() API to allow downstream applications 
> call it explicetly.
> --
>
> Key: HDFS-15567
> URL: https://issues.apache.org/jira/browse/HDFS-15567
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15567.001.patch, HDFS-15567.002.patch
>
>
> Consistent reads from Standby introduced {{msync()}} API HDFS-13688, which 
> updates client's state ID with current state of the Active NameNode to 
> guarantee consistency of subsequent calls to an ObserverNode. Currently this 
> API is exposed via {{DFSClient}} only, which makes it hard for applications 
> to access {{msync()}}. One way is to use something like this:
> {code}
> if(fs instanceof DistributedFileSystem) {
>   ((DistributedFileSystem)fs).getClient().msync();
> }
> {code}
> This should be exposed both for {{FileSystem}} and {{FileContext}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15567) [SBN Read] HDFS should expose msync() API to allow downstream applications call it explicetly.

2020-10-08 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210499#comment-17210499
 ] 

Chen Liang commented on HDFS-15567:
---

Thanks for working on this [~shv]! Some comments:

1. Currently calling {{AbstractFileSystem.java}}'s msync throws 
UnsupportedOperationException, I was thinking whether it should be throwing 
UnsupportedOperationException, or just making it a noop. (similarly for 
{{FileSystem.java}}'s msync}}. I think making it noop might be better, any 
thoughts?
 2. Change in {{MiniDFSCluster.java}}, is it really needed?
 3. testMsyncFileContext has a LOG info call, seems unnecessary. Also looks 
like it only test FileContext, should we also test for FileSystem?

> [SBN Read] HDFS should expose msync() API to allow downstream applications 
> call it explicetly.
> --
>
> Key: HDFS-15567
> URL: https://issues.apache.org/jira/browse/HDFS-15567
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15567.001.patch, HDFS-15567.002.patch
>
>
> Consistent reads from Standby introduced {{msync()}} API HDFS-13688, which 
> updates client's state ID with current state of the Active NameNode to 
> guarantee consistency of subsequent calls to an ObserverNode. Currently this 
> API is exposed via {{DFSClient}} only, which makes it hard for applications 
> to access {{msync()}}. One way is to use something like this:
> {code}
> if(fs instanceof DistributedFileSystem) {
>   ((DistributedFileSystem)fs).getClient().msync();
> }
> {code}
> This should be exposed both for {{FileSystem}} and {{FileContext}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15545) (S)Webhdfs will not use updated delegation tokens available in the ugi after the old ones expire

2020-09-02 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189568#comment-17189568
 ] 

Chen Liang commented on HDFS-15545:
---

Thanks [~ibuenros], I agree that HDFS-6222 looks to be about a different 
scenario. I'm +1 on the patch. I will take another pass on the failed tests, if 
it looks good I will commit the change, given no other concerns/objections from 
any other folks. 

> (S)Webhdfs will not use updated delegation tokens available in the ugi after 
> the old ones expire
> 
>
> Key: HDFS-15545
> URL: https://issues.apache.org/jira/browse/HDFS-15545
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Issac Buenrostro
>Assignee: Issac Buenrostro
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15545.001.patch, HDFS-15545.002.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> WebHdfsFileSystem can select a delegation token to use from the current user 
> UGI. The token selection is sticky, and WebHdfsFileSystem will re-use it 
> every time without searching the UGI again.
> If the previous token expires, WebHdfsFileSystem will catch the exception and 
> attempt to get a new token. However, the mechanism to get a new token 
> bypasses searching for one on the UGI, so even if there is external logic 
> that has retrieved a new token, it is not possible to make the FileSystem use 
> the new, valid token, rendering the FileSystem object unusable.
> A simple fix would allow WebHdfsFileSystem to re-search the UGI, and if it 
> finds a different token than the cached one try to use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15545) (S)Webhdfs will not use updated delegation tokens available in the ugi after the old ones expire

2020-09-01 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188775#comment-17188775
 ] 

Chen Liang commented on HDFS-15545:
---

Thanks for working on this [~ibuenros]! The change makes sense to me. But I 
noticed that in HDFS-6222 seems there can be concerns with how Webhdfs should 
renew the token. It seems to me a different scenario so we should be fine, and 
TestWebHdfsTokens was passing here. [~daryn], do you have any thoughts on this 
change?

> (S)Webhdfs will not use updated delegation tokens available in the ugi after 
> the old ones expire
> 
>
> Key: HDFS-15545
> URL: https://issues.apache.org/jira/browse/HDFS-15545
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Issac Buenrostro
>Assignee: Issac Buenrostro
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-15545.001.patch, HDFS-15545.002.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> WebHdfsFileSystem can select a delegation token to use from the current user 
> UGI. The token selection is sticky, and WebHdfsFileSystem will re-use it 
> every time without searching the UGI again.
> If the previous token expires, WebHdfsFileSystem will catch the exception and 
> attempt to get a new token. However, the mechanism to get a new token 
> bypasses searching for one on the UGI, so even if there is external logic 
> that has retrieved a new token, it is not possible to make the FileSystem use 
> the new, valid token, rendering the FileSystem object unusable.
> A simple fix would allow WebHdfsFileSystem to re-search the UGI, and if it 
> finds a different token than the cached one try to use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15290) NPE in HttpServer during NameNode startup

2020-08-20 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181494#comment-17181494
 ] 

Chen Liang commented on HDFS-15290:
---

I have committed v03 patch to trunk, branch-3.x. Thanks for the contribution 
[~simbadzina]! 

There is a conflict when backporting to branch-2.10 though, due to log4j usage. 
Mind providing a version for branch-2.10?

> NPE in HttpServer during NameNode startup
> -
>
> Key: HDFS-15290
> URL: https://issues.apache.org/jira/browse/HDFS-15290
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 2.7.8, 3.3.0
>Reporter: Konstantin Shvachko
>Assignee: Simbarashe Dzinamarira
>Priority: Major
> Attachments: HDFS-15290.001.patch, HDFS-15290.002.patch, 
> HDFS-15290.003.patch
>
>
> When NameNode starts it first starts HttpServer, then starts loading fsImage 
> and edits. While loading the namesystem field in NameNode is null. I saw that 
> a StandbyNode sends a checkpoint request, which fails with NPE because 
> NNStorage is not instantiated yet.
> We should check the NameNode startup status before accepting checkpoint 
> requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15290) NPE in HttpServer during NameNode startup

2020-08-14 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178072#comment-17178072
 ] 

Chen Liang commented on HDFS-15290:
---

Thanks for working on this [~simbadzina]! The v002 patch seems missing things. 
The method {{getAndSetFSImageInHttpServer}} is not called anywhere.
Also two nits:
 1. One extra space on the import line (NameNodeAdapter.java L#59)
 2. One extra newline in the test (TestStandbyCheckpoints.java L#311/312)

> NPE in HttpServer during NameNode startup
> -
>
> Key: HDFS-15290
> URL: https://issues.apache.org/jira/browse/HDFS-15290
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 2.7.8, 3.3.0
>Reporter: Konstantin Shvachko
>Assignee: Simbarashe Dzinamarira
>Priority: Major
> Attachments: HDFS-15290.001.patch, HDFS-15290.002.patch
>
>
> When NameNode starts it first starts HttpServer, then starts loading fsImage 
> and edits. While loading the namesystem field in NameNode is null. I saw that 
> a StandbyNode sends a checkpoint request, which fails with NPE because 
> NNStorage is not instantiated yet.
> We should check the NameNode startup status before accepting checkpoint 
> requests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-07-20 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Fix Version/s: 3.1.5
   3.4.0
   3.3.1
   2.10.1
   3.2.2
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.2, 2.10.1, 3.3.1, 3.4.0, 3.1.5
>
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, 
> HDFS-15404.006.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source

2020-07-20 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161585#comment-17161585
 ] 

Chen Liang commented on HDFS-15404:
---

I have committed v006 patch to trunk, branch-3.3/3.2/3.1 and branch-2.10. 
Thanks Konstantin for the review!

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, 
> HDFS-15404.006.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source

2020-07-17 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160291#comment-17160291
 ] 

Chen Liang commented on HDFS-15404:
---

Upload v006 patch to address the remaining checkstyle issues. There is one that 
I didn't change, in order to be consistent in style with other lines in the 
class

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, 
> HDFS-15404.006.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-07-17 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Attachment: HDFS-15404.006.patch

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch, 
> HDFS-15404.006.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source

2020-07-16 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159408#comment-17159408
 ] 

Chen Liang commented on HDFS-15404:
---

Thanks for taking a look [~shv]! In general, I agree that should be fine with 
tooling, I don't have specific example, was mainly brainstorming any potential 
concerns. 

Uploaded v005 patch.

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-07-16 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Attachment: HDFS-15404.005.patch

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch, HDFS-15404.005.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-29 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148160#comment-17148160
 ] 

Chen Liang commented on HDFS-15404:
---

Upload v004 patch to fix {{TestDFSHAAdminMiniCluster}} which was expecting to 
see {{target_*}} environment variables, while with this change, the fencing on 
source will be using {{source_*}} variables. This brings in another question to 
my mind, which is that there could be existing tooling that relies on 
{{target_*}} on both source and dst of the fencing. This might not be the right 
use of the variables, but if they do exist, they may break. Do we plan to 
support such use cases is the question, any thoughts [~shv]? 

{{TestRollingUpgrade}} is passing in my local run. The other test fails are 
known flaky tests AFAIK.

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-29 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148160#comment-17148160
 ] 

Chen Liang edited comment on HDFS-15404 at 6/29/20, 9:53 PM:
-

Upload v004 patch to fix {{TestDFSHAAdminMiniCluster}} which was expecting to 
see {{target_}} environment variables, while with this change, the fencing on 
source will be using {{source_}} variables. This brings in another question to 
my mind, which is that there could be existing tooling that relies on 
{{target_*}} on both source and dst of the fencing. This might not be the right 
use of the variables, but if they do exist, they may break. Do we plan to 
support such use cases is the question, any thoughts [~shv]? 

{{TestRollingUpgrade}} is passing in my local run. The other test fails are 
known flaky tests AFAIK.


was (Author: vagarychen):
Upload v004 patch to fix {{TestDFSHAAdminMiniCluster}} which was expecting to 
see {{target_*}} environment variables, while with this change, the fencing on 
source will be using {{source_*}} variables. This brings in another question to 
my mind, which is that there could be existing tooling that relies on 
{{target_*}} on both source and dst of the fencing. This might not be the right 
use of the variables, but if they do exist, they may break. Do we plan to 
support such use cases is the question, any thoughts [~shv]? 

{{TestRollingUpgrade}} is passing in my local run. The other test fails are 
known flaky tests AFAIK.

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-29 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Attachment: HDFS-15404.004.patch

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch, HDFS-15404.004.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-24 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144249#comment-17144249
 ] 

Chen Liang commented on HDFS-15404:
---

Thanks for checking [~shv]! These three test might slipped through my previous 
local testing somehow. Updated with v03 patch to fix these tests. On high 
level, the fixes are:
1. some cases mock fencing with a null target HA state, which was treated as 
illegal state by this new change.
2. in the new fencing logic, for a successful failover, two tryFence gets 
called, no longer just one; for a failed failover, if fail happens on fencing 
target, fencing on source will be skipped. TestFailoverController needs to be 
changed to reflect this new logic. 

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-24 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Attachment: HDFS-15404.003.patch

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch, 
> HDFS-15404.003.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15421) IBR leak causes standby NN to be stuck in safe mode

2020-06-23 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143504#comment-17143504
 ] 

Chen Liang commented on HDFS-15421:
---

Thanks for reporting [~kihwal] and thanks [~aajisaka] working on this! Good 
catch on the missing updates, the change looks good to me.

> IBR leak causes standby NN to be stuck in safe mode
> ---
>
> Key: HDFS-15421
> URL: https://issues.apache.org/jira/browse/HDFS-15421
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Kihwal Lee
>Assignee: Akira Ajisaka
>Priority: Blocker
>  Labels: release-blocker
> Attachments: HDFS-15421-000.patch, HDFS-15421-001.patch, 
> HDFS-15421.002.patch, HDFS-15421.003.patch
>
>
> After HDFS-14941, update of the global gen stamp is delayed in certain 
> situations.  This makes the last set of incremental block reports from append 
> "from future", which causes it to be simply re-queued to the pending DN 
> message queue, rather than processed to complete the block.  The last set of 
> IBRs will leak and never cleaned until it transitions to active.  The size of 
> {{pendingDNMessages}} constantly grows until then.
> If a leak happens while in a startup safe mode, the namenode will never be 
> able to come out of safe mode on its own.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-18 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140052#comment-17140052
 ] 

Chen Liang commented on HDFS-15404:
---

Upload v002 patch to fix the bug that caused failed tests. The bug is that 
parseArgs should allow cmd only having command, in which case both src and dst 
will execute the same command/script

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-18 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Attachment: HDFS-15404.002.patch

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch, HDFS-15404.002.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-13 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Status: Patch Available  (was: Open)

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-13 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15404:
--
Attachment: HDFS-15404.001.patch

> ShellCommandFencer should expose info about source
> --
>
> Key: HDFS-15404
> URL: https://issues.apache.org/jira/browse/HDFS-15404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15404.001.patch
>
>
> Currently the HA fencing logic in ShellCommandFencer exposes environment 
> variable about only the fencing target. i.e. the $target_* variables as 
> mentioned in this [document 
> page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
>  
> But here only the fencing target variables are getting exposed. Sometimes it 
> is useful to expose info about the fencing source node. One use case is would 
> allow source and target node to identify themselves separately and run 
> different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15404) ShellCommandFencer should expose info about source

2020-06-09 Thread Chen Liang (Jira)
Chen Liang created HDFS-15404:
-

 Summary: ShellCommandFencer should expose info about source
 Key: HDFS-15404
 URL: https://issues.apache.org/jira/browse/HDFS-15404
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chen Liang
Assignee: Chen Liang


Currently the HA fencing logic in ShellCommandFencer exposes environment 
variable about only the fencing target. i.e. the $target_* variables as 
mentioned in this [document 
page|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html]).
 

But here only the fencing target variables are getting exposed. Sometimes it is 
useful to expose info about the fencing source node. One use case is would 
allow source and target node to identify themselves separately and run 
different commands/scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally

2020-05-22 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114247#comment-17114247
 ] 

Chen Liang commented on HDFS-15368:
---

Thanks [~hexiaoqiao]! Will look into it, but one quick question, was the run 
based on trunk? Because the line number in the trace does not seem to match the 
trunk code.

> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
> 
>
> Key: HDFS-15368
> URL: https://issues.apache.org/jira/browse/HDFS-15368
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: balancer, test
> Attachments: HDFS-15368.001.patch, 
> TestBalancerWithHANameNodes.testBalancerObserver.log
>
>
> When I am working on HDFS-13183, I found that 
> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, 
> because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, 
> when invoke getBlocks with opening Observer Read feature, it could request 
> any one of two ObserverNN based on my observation. So only verify the first 
> ObserverNN and check times of invoke #getBlocks is not expected.
> {code:java}
>   for (int i = 0; i < cluster.getNumNameNodes(); i++) {
> // First observer node is at idx 2, or 3 if 2 has been shut down
> // It should get both getBlocks calls, all other NNs should see 0 
> calls
> int expectedObserverIdx = withObserverFailure ? 3 : 2;
> int expectedCount = (i == expectedObserverIdx) ? 2 : 0;
> verify(namesystemSpies.get(i), times(expectedCount))
> .getBlocks(any(), anyLong(), anyLong());
>   }
> {code}
> cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, 
> would you like give some suggestions? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15368) TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally

2020-05-21 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113424#comment-17113424
 ] 

Chen Liang commented on HDFS-15368:
---

[~hexiaoqiao] thanks for reporting and looking into!

It is actually expected to always hit idx=2 observer as long as it's running. 
Reason is that, without NameNode randomization, client will always try first 
Observer (idx2 in this case) before the second (idx3 here), unless first 
observer failed to respond. So in the case of withObserverFailure = false, it 
should be Observer with idx=2 being the one responding all the time. 

I will need to look into this. It would be helpful if you have an error stack 
trace.

> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally
> 
>
> Key: HDFS-15368
> URL: https://issues.apache.org/jira/browse/HDFS-15368
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: balancer, test
> Attachments: HDFS-15368.001.patch
>
>
> When I am working on HDFS-13183, I found that 
> TestBalancerWithHANameNodes#testBalancerWithObserver failed occasionally, 
> because the following code segment. Consider there are 1 ANN + 1 SBN + 2ONN, 
> when invoke getBlocks with opening Observer Read feature, it could request 
> any one of two ObserverNN based on my observation. So only verify the first 
> ObserverNN and check times of invoke #getBlocks is not expected.
> {code:java}
>   for (int i = 0; i < cluster.getNumNameNodes(); i++) {
> // First observer node is at idx 2, or 3 if 2 has been shut down
> // It should get both getBlocks calls, all other NNs should see 0 
> calls
> int expectedObserverIdx = withObserverFailure ? 3 : 2;
> int expectedCount = (i == expectedObserverIdx) ? 2 : 0;
> verify(namesystemSpies.get(i), times(expectedCount))
> .getBlocks(any(), anyLong(), anyLong());
>   }
> {code}
> cc [~xkrogen],[~weichiu]. I am not very familiar for Observer Read feature, 
> would you like give some suggestions? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-18 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15293:
--
Fix Version/s: 3.1.5
   3.3.1
   2.10.1
   3.2.2
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Critical
>  Labels: multi-sbnn, release-blocker
> Fix For: 3.2.2, 2.10.1, 3.3.1, 3.1.5
>
> Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-18 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110561#comment-17110561
 ] 

Chen Liang commented on HDFS-15293:
---

I have committed to trunk, branch-3.2, branch-3.1 and branch-2.10. Thanks the 
reviewers!

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Critical
>  Labels: multi-sbnn, release-blocker
> Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-15 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108491#comment-17108491
 ] 

Chen Liang commented on HDFS-15293:
---

Updated v002 patch to address Akira's comments.

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Critical
>  Labels: multi-sbnn, release-blocker
> Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-15 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15293:
--
Attachment: HDFS-15293.002.patch

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Critical
>  Labels: multi-sbnn, release-blocker
> Attachments: HDFS-15293.001.patch, HDFS-15293.002.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-15 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108470#comment-17108470
 ] 

Chen Liang commented on HDFS-15293:
---

Hi [~aajisaka] sorry I have been busy dealing with some internal work. Will 
update the patch later today.

Also  [~shv]  would like to get your thoughts on this as you have been looking 
into our internal version of this fix.

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Critical
>  Labels: multi-sbnn, release-blocker
> Attachments: HDFS-15293.001.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-06 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15293:
--
Status: Patch Available  (was: Open)

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-15293.001.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-06 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101136#comment-17101136
 ] 

Chen Liang commented on HDFS-15293:
---

Had some offline discussion with [~shv], the txnid check does not seem be 
relevant here actually. Post v001 patch. This is based on our internal version 
of this fix, with some additional logging added to capture this behavior.

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-15293.001.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-05-06 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15293:
--
Attachment: HDFS-15293.001.patch

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-15293.001.patch
>
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15323) StandbyNode fails transition to active due to insufficient transaction tailing

2020-05-01 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097821#comment-17097821
 ] 

Chen Liang commented on HDFS-15323:
---

Thanks for the finding [~shv]! This is a tricky issue. In HDFS-14806, I 
encountered a similar issue during boostrap standby, where  one call limited by 
QJM_RPC_MAX_TXNS is not sufficient to catch up. In HDFS-14806, the approach we 
took is to just disable inprogress tailing during boostrap standby, and fall 
back to the HTTP based edit tailing. The reasoning there was that  inprogress 
tailing was not meant to handle rare cases such as standup/failover and we can 
avoid having multiple RPC calls. Do you think the same idea can apply here?

Apart from this, one minor comment. Can we add some logging around this logic? 
So we can more easily identify issues like this in the future.


> StandbyNode fails transition to active due to insufficient transaction tailing
> --
>
> Key: HDFS-15323
> URL: https://issues.apache.org/jira/browse/HDFS-15323
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, qjm
>Affects Versions: 2.7.7
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15323.000.unitTest.patch, HDFS-15323.001.patch
>
>
> StandbyNode is asked to {{transitionToActive()}}. If it fell too far behind 
> in tailing journal transaction (from QJM) it can crash with 
> {{IllegalStateException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14647) NPE during secure namenode startup

2020-05-01 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097589#comment-17097589
 ] 

Chen Liang commented on HDFS-14647:
---

Thanks Konstantin, I have committed the branch-2 patch to branch-2.10.

> NPE during secure namenode startup
> --
>
> Key: HDFS-14647
> URL: https://issues.apache.org/jira/browse/HDFS-14647
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14647-2.002.patch, HDFS-14647-trunk.001.patch, 
> HDFS-14647-trunk.002.patch, HDFS-14647-trunk.003.patch, 
> HDFS-14647-trunk.004.patch, HDFS-14647.001.patch
>
>
> In secure HDFS, during Namenode loading fsimage, when hitting Namenode 
> through the REST API, below exception would be thrown out. (This is in 
> version 2.8.2)
> {quote}org.apache.hadoop.hdfs.web.resources.ExceptionHandler: 
> INTERNAL_SERVER_ERROR
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283)
>  at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226)
>  at 
> org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54)
>  at 
> org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42)
>  at 
> com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>  at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>  at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>  at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>  at org.mortbay.jetty.Server.handle(Server.java:326)
>  

[jira] [Commented] (HDFS-15293) Relax the condition for accepting a fsimage when receiving a checkpoint

2020-04-28 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094792#comment-17094792
 ] 

Chen Liang commented on HDFS-15293:
---

[~shv] I don't think the the issue you mentioned will actually happen 
currently. Because the checks only skip an image if BOTH conditions are met: 1. 
time delta too small AND 2. txnid delta too small. It's an AND not OR.

So in the case you mentioned, it is true that time delta will always be 
considered too small due to the ridiculously large interval, but if configured 
with a small txnid, it is easy to get enough txnid, so txnid delta won't be 
considered too small. It is not that time delta being small alone leads to 
rejecting an image.

But indeed, it is possible that in a cluster with ridiculously large interval, 
plus a extremely light load (so txnid barely make progress), both conditions 
will always be true. In this case the checkpoint will all be rejected. Although 
realistically I don't think there is much value doing checkpoint in such 
situation any way, it is probably not a good idea to change behavior of the 
system by effectively rejecting all images from happening.

Because of this, I'm thinking of removing the txnid condition all together, so 
the check only looks at time delta and allow any txnid delta. It seems more 
tricky to justify preventing all the use cases with slow txnid increase. (Time 
always proceed, but not necessarily txnid.) I think we were targeting mainly 
time condition originally.

> Relax the condition for accepting a fsimage when receiving a checkpoint 
> 
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>  Labels: multi-sbnn
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15287) HDFS rollingupgrade prepare never finishes

2020-04-28 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094722#comment-17094722
 ] 

Chen Liang edited comment on HDFS-15287 at 4/28/20, 5:42 PM:
-

Thanks for the update [~kihwal]. Will follow up on HDFS-15293, the issue to 
resolve there should be relatively though. Issue mentioned in HDFS-15293 is not 
a consistently happening issue and can lead to missing at most one periodical 
image upload.

And just to clarify though, the improvement from HDFS-15036 is not specific to 
Observer. It was for multiple SBN in general. Even without Observer, as long as 
there are multiple SBN, there can be frequent image upload. While even with 
Observer, if there is only one SBN, frequent upload would not be an issue.

Regarding making this configurable, would like to have [~shv]'s thoughts here, 
as Konstantin was opposing adding this new config.


was (Author: vagarychen):
Thanks for the update [~kihwal]. Will follow up on HDFS-15293, the issue to 
resolve there should be relatively though. Issue mentioned in HDFS-15293 is not 
a consistently happening issue and can lead to missing at most one periodical 
image upload.

And just to clarify though, the improvement from HDFS-15036 is not specific to 
Observer. It was for multiple SBN in general. Even without Observer, as long as 
there are multiple SBN, there can be frequent image upload. While even with 
Observer, if there is only one SBN, frequent upload would not be an issue.

> HDFS rollingupgrade prepare never finishes
> --
>
> Key: HDFS-15287
> URL: https://issues.apache.org/jira/browse/HDFS-15287
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 3.3.0
>Reporter: Kihwal Lee
>Priority: Blocker
>
> After HDFS-12979, the prepare step of rolling upgrade does not work. This is 
> because it added additional check for sufficient time passing since last 
> checkpoint. Since RU rollback image creation and upload can happen any time, 
> uploading of rollback image never succeeds. For a new cluster deployed for 
> testing, it might work since it never checkpointed before.
> It was found that this check is disabled for unit tests, defeating the very 
> purpose of testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes

2020-04-28 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094722#comment-17094722
 ] 

Chen Liang commented on HDFS-15287:
---

Thanks for the update [~kihwal]. Will follow up on HDFS-15293, the issue to 
resolve there should be relatively though. Issue mentioned in HDFS-15293 is not 
a consistently happening issue and can lead to missing at most one periodical 
image upload.

And just to clarify though, the improvement from HDFS-15036 is not specific to 
Observer. It was for multiple SBN in general. Even without Observer, as long as 
there are multiple SBN, there can be frequent image upload. While even with 
Observer, if there is only one SBN, frequent upload would not be an issue.

> HDFS rollingupgrade prepare never finishes
> --
>
> Key: HDFS-15287
> URL: https://issues.apache.org/jira/browse/HDFS-15287
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 3.3.0
>Reporter: Kihwal Lee
>Priority: Blocker
>
> After HDFS-12979, the prepare step of rolling upgrade does not work. This is 
> because it added additional check for sufficient time passing since last 
> checkpoint. Since RU rollback image creation and upload can happen any time, 
> uploading of rollback image never succeeds. For a new cluster deployed for 
> testing, it might work since it never checkpointed before.
> It was found that this check is disabled for unit tests, defeating the very 
> purpose of testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14647) NPE during secure namenode startup

2020-04-27 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093983#comment-17093983
 ] 

Chen Liang commented on HDFS-14647:
---

[~ayushtkn] [~fengnanli], thanks for working on this issue. I see there is a 
patch that applies to branch-2, but I didn't see this fix in branch-2.10. Is 
the branch-2 version ready to be committed to branch-2.10?

> NPE during secure namenode startup
> --
>
> Key: HDFS-14647
> URL: https://issues.apache.org/jira/browse/HDFS-14647
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14647-2.002.patch, HDFS-14647-trunk.001.patch, 
> HDFS-14647-trunk.002.patch, HDFS-14647-trunk.003.patch, 
> HDFS-14647-trunk.004.patch, HDFS-14647.001.patch
>
>
> In secure HDFS, during Namenode loading fsimage, when hitting Namenode 
> through the REST API, below exception would be thrown out. (This is in 
> version 2.8.2)
> {quote}org.apache.hadoop.hdfs.web.resources.ExceptionHandler: 
> INTERNAL_SERVER_ERROR
>  java.lang.NullPointerException
>  at 
> org.apache.hadoop.hdfs.server.common.JspHelper.getTokenUGI(JspHelper.java:283)
>  at org.apache.hadoop.hdfs.server.common.JspHelper.getUGI(JspHelper.java:226)
>  at 
> org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:54)
>  at 
> org.apache.hadoop.hdfs.web.resources.UserProvider.getValue(UserProvider.java:42)
>  at 
> com.sun.jersey.server.impl.inject.InjectableValuesProvider.getInjectableValues(InjectableValuesProvider.java:46)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$EntityParamInInvoker.getParams(AbstractResourceMethodDispatchProvider.java:153)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:203)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>  at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:87)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1353)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>  at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>  at 
> 

[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes

2020-04-24 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091752#comment-17091752
 ] 

Chen Liang commented on HDFS-15287:
---

Thanks for the elaboration [~kihwal]. I did some testing in our 2.10 cluster a 
couple times, the rollingUpgrade prepare then finalize worked fine (i.e. the 
rollback did get uploaded to ANN successfully). So I'm still not able to 
reproduce this issue... But one quick question, any chance you were running a 
2.10 version without HDFS-15036? This Jira was the jira that adds the exlusion 
of rollback image. A quick way to check whether this Jira is there is that 
there should be log message on SBN like "Image upload rejected by the other 
NameNode: " after HDFS-15036.

This being said, I don't have strong preference of making it (not) 
configurable/default false, considering this feature is supposed to be an 
improvement only in case multiple SBN scenario, which may not be the case of 
most of deployments. In my initial patch under HDFS-12979 it was configurable. 
By then [~shv] had concern with HDFS having more or more configurations. So 
[~shv], would be good to have your thoughts here.

> HDFS rollingupgrade prepare never finishes
> --
>
> Key: HDFS-15287
> URL: https://issues.apache.org/jira/browse/HDFS-15287
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 3.3.0
>Reporter: Kihwal Lee
>Priority: Blocker
>
> After HDFS-12979, the prepare step of rolling upgrade does not work. This is 
> because it added additional check for sufficient time passing since last 
> checkpoint. Since RU rollback image creation and upload can happen any time, 
> uploading of rollback image never succeeds. For a new cluster deployed for 
> testing, it might work since it never checkpointed before.
> It was found that this check is disabled for unit tests, defeating the very 
> purpose of testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15293) Relax FSImage upload time delta check restriction

2020-04-21 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088924#comment-17088924
 ] 

Chen Liang edited comment on HDFS-15293 at 4/21/20, 6:09 PM:
-

The reason this check might be too stringent is that, for example, say we 
configure fsImage interval to 6 hours, consider the case when SBN uploads image 
A at time 00:00, but there is a minor time skew when ANN actually sees this 
fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next 
image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then 
ANN would consider the time delta is smaller than the configured delta of 6 
hours and thus ANN would then reject this image. Despite that there is only a 
5ms difference, and should acceptable. Essentially, the current check for exact 
timestamp can be too susceptible to random timing conditions.

The consequence of this issue, is that ANN might be missing one image once in a 
while. Because even if ANN rejects the image at 06:00, next time SBN uploads at 
12:00, ANN will not reject it, as by that time, the delta is guaranteed to be > 
6 hours. This means there will not be more than one consecutive missing images.


was (Author: vagarychen):
The reason this check might be too stringent is that, for example, say we 
configure fsImage interval to 6 hours, consider the case when SBN uploads image 
A at time 00:00, but there is a minor time skew when ANN actually sees this 
fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next 
image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then 
ANN would consider the time delta is smaller than the configured delta of 6 
hours and thus ANN would then reject this image. Despite that there is only a 
5ms difference, and should acceptable. Essentially, the current check for exact 
timestamp can be too susceptible to random timing conditions.

The consequence of this issue, is that ANN might be missing one image once in a 
while. Because even if ANN rejects the image at 06:00, next time SBN uploads at 
12:00, ANN will not reject it. So there will not be more than one consecutive 
missing images.

> Relax FSImage upload time delta check restriction
> -
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15293) Relax FSImage upload time delta check restriction

2020-04-21 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088924#comment-17088924
 ] 

Chen Liang edited comment on HDFS-15293 at 4/21/20, 6:08 PM:
-

The reason this check might be too stringent is that, for example, say we 
configure fsImage interval to 6 hours, consider the case when SBN uploads image 
A at time 00:00, but there is a minor time skew when ANN actually sees this 
fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next 
image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then 
ANN would consider the time delta is smaller than the configured delta of 6 
hours and thus ANN would then reject this image. Despite that there is only a 
5ms difference, and should acceptable. Essentially, the current check for exact 
timestamp can be too susceptible to random timing conditions.

The consequence of this issue, is that ANN might be missing one image once in a 
while. Because even if ANN rejects the image at 06:00, next time SBN uploads at 
12:00, ANN will not reject it. So there will not be more than one consecutive 
missing images.


was (Author: vagarychen):
The reason this check might be too stringent is that, for example, say we 
configure fsImage interval to 6 hours, consider the case when SBN uploads image 
A at time 00:00, but there is a minor time skew when ANN actually sees this 
fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next 
image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then 
ANN would consider the time delta is smaller than the configured delta of 6 
hours and thus ANN would then reject this image. Despite that there is only a 
5ms difference, and should acceptable. Essentially, the current check for exact 
timestamp can be too susceptible to random timing conditions.

> Relax FSImage upload time delta check restriction
> -
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes

2020-04-21 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088926#comment-17088926
 ] 

Chen Liang commented on HDFS-15287:
---

I filed HDFS-15293 to relax the time interval condition. But again, it is not 
related to RU.

> HDFS rollingupgrade prepare never finishes
> --
>
> Key: HDFS-15287
> URL: https://issues.apache.org/jira/browse/HDFS-15287
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 3.3.0
>Reporter: Kihwal Lee
>Priority: Blocker
>
> After HDFS-12979, the prepare step of rolling upgrade does not work. This is 
> because it added additional check for sufficient time passing since last 
> checkpoint. Since RU rollback image creation and upload can happen any time, 
> uploading of rollback image never succeeds. For a new cluster deployed for 
> testing, it might work since it never checkpointed before.
> It was found that this check is disabled for unit tests, defeating the very 
> purpose of testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15293) Relax FSImage upload time delta check restriction

2020-04-21 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088924#comment-17088924
 ] 

Chen Liang commented on HDFS-15293:
---

The reason this check might be too stringent is that, for example, say we 
configure fsImage interval to 6 hours, consider the case when SBN uploads image 
A at time 00:00, but there is a minor time skew when ANN actually sees this 
fsImage, so ANN sees actually at 00:00.010. When next time SBN uploads next 
image at 06:00. And ANN sees this one with a smaller skew at 00:00.005. Then 
ANN would consider the time delta is smaller than the configured delta of 6 
hours and thus ANN would then reject this image. Despite that there is only a 
5ms difference, and should acceptable. Essentially, the current check for exact 
timestamp can be too susceptible to random timing conditions.

> Relax FSImage upload time delta check restriction
> -
>
> Key: HDFS-15293
> URL: https://issues.apache.org/jira/browse/HDFS-15293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
> from Standby with a small delta comparing to previous fsImage. ANN would 
> reject this image. This is to avoid overly frequent fsImage in case of when 
> there are multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15293) Relax FSImage upload time delta check restriction

2020-04-21 Thread Chen Liang (Jira)
Chen Liang created HDFS-15293:
-

 Summary: Relax FSImage upload time delta check restriction
 Key: HDFS-15293
 URL: https://issues.apache.org/jira/browse/HDFS-15293
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Chen Liang
Assignee: Chen Liang


HDFS-12979 introduced the logic that, if ANN sees consecutive fs image upload 
from Standby with a small delta comparing to previous fsImage. ANN would reject 
this image. This is to avoid overly frequent fsImage in case of when there are 
multiple Standby node. However this check could be too stringent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12979) StandbyNode should upload FsImage to ObserverNode after checkpointing.

2020-04-20 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088127#comment-17088127
 ] 

Chen Liang commented on HDFS-12979:
---

[~kihwal], it is disabled in minidsfscluter because there are existing tests 
that relies on having small delta fsimages. The newly added tests that testing 
this feature have this flag explicitly enabled to override minidfscluster's 
disabling (such as in TestRollingUpgrade, as introduced in HDFS-15036). And I 
just did some quick testing in our 2.10 testing cluster, the RU image upload 
worked fine for me. Still though, acknowledge that this would be a blocker if 
it breaks RU, continuing the investigation.

> StandbyNode should upload FsImage to ObserverNode after checkpointing.
> --
>
> Key: HDFS-12979
> URL: https://issues.apache.org/jira/browse/HDFS-12979
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-12979-branch-2.001.patch, HDFS-12979.001.patch, 
> HDFS-12979.002.patch, HDFS-12979.003.patch, HDFS-12979.004.patch, 
> HDFS-12979.005.patch, HDFS-12979.006.patch, HDFS-12979.007.patch, 
> HDFS-12979.008.patch, HDFS-12979.009.patch, HDFS-12979.010.patch, 
> HDFS-12979.011.patch, HDFS-12979.012.patch, HDFS-12979.013.patch, 
> HDFS-12979.014.patch, HDFS-12979.015.patch
>
>
> ObserverNode does not create checkpoints. So it's fsimage file can get very 
> old making bootstrap of ObserverNode too long. A StandbyNode should copy 
> latest fsimage to ObserverNode(s) along with ANN.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15287) HDFS rollingupgrade prepare never finishes

2020-04-20 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088079#comment-17088079
 ] 

Chen Liang commented on HDFS-15287:
---

In case of roll back image, this check
{code}
NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile())
{code}
would make sure the rollback image is not subject to this check. Because in 
case of rollback image, getNameNodeFile would return 
{{NameNodeFile.IMAGE_ROLLBACK}} rather than {{NameNodeFile.IMAGE}}. So this 
check should not apply to RU case at all.

In case of regular, periodic image upload though, I agree that the current 
might be too stringent as there can be cases where ANN sees image with 
potentially very minor time gap difference comparing to SBN. We recently also 
have been thinking of relaxing this check. pinging [~shv]. 

> HDFS rollingupgrade prepare never finishes
> --
>
> Key: HDFS-15287
> URL: https://issues.apache.org/jira/browse/HDFS-15287
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0, 3.3.0
>Reporter: Kihwal Lee
>Priority: Blocker
>
> After HDFS-12979, the prepare step of rolling upgrade does not work. This is 
> because it added additional check for sufficient time passing since last 
> checkpoint. Since RU rollback image creation and upload can happen any time, 
> uploading of rollback image never succeeds. For a new cluster deployed for 
> testing, it might work since it never checkpointed before.
> It was found that this check is disabled for unit tests, defeating the very 
> purpose of testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-30 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071188#comment-17071188
 ] 

Chen Liang commented on HDFS-15191:
---

[~Steven Rand] I think this patch is already in branch-3.3. The trunk cut was 
after this gets committed. Or are you referring to another branch?

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Fix For: 3.2.2, 3.3.1
>
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-27 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15191:
--
Fix Version/s: 3.3.1
   3.2.2
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-3.2, thanks [~Steven Rand] for the contribution.

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Fix For: 3.2.2, 3.3.1
>
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-27 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069016#comment-17069016
 ] 

Chen Liang commented on HDFS-15191:
---

[~Steven Rand] the change makes sense to me, nice catch on the issue! +1 to 
v004 patch, will commit soon.

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-03-26 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067897#comment-17067897
 ] 

Chen Liang commented on HDFS-15191:
---

Hey [~Steven Rand],  Sorry I did plan to take another look, but have been busy 
recently. Will take a look today or tomorrow

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Major
> Attachments: HDFS-15191-001.patch, HDFS-15191-002.patch, 
> HDFS-15191.003.patch, HDFS-15191.004.patch
>
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15197) Change ObserverRetryOnActiveException log to debug

2020-02-27 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15197:
--
Status: Patch Available  (was: Open)

> Change ObserverRetryOnActiveException log to debug
> --
>
> Key: HDFS-15197
> URL: https://issues.apache.org/jira/browse/HDFS-15197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Minor
> Attachments: HDFS-15197.001.patch
>
>
> Currently in ObserverReadProxyProvider, when a ObserverRetryOnActiveException 
> happens, ObserverReadProxyProvider logs a message at INFO level. This can be 
> a large volume of logs in some scenarios. For example, when some job tries to 
> access lots of files that haven't been accessed for a long time, all these 
> accesses may trigger atime updates, which led to 
> ObserverRetryOnActiveException. We should change this log to DEBUG.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15197) Change ObserverRetryOnActiveException log to debug

2020-02-27 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15197:
--
Attachment: HDFS-15197.001.patch

> Change ObserverRetryOnActiveException log to debug
> --
>
> Key: HDFS-15197
> URL: https://issues.apache.org/jira/browse/HDFS-15197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Minor
> Attachments: HDFS-15197.001.patch
>
>
> Currently in ObserverReadProxyProvider, when a ObserverRetryOnActiveException 
> happens, ObserverReadProxyProvider logs a message at INFO level. This can be 
> a large volume of logs in some scenarios. For example, when some job tries to 
> access lots of files that haven't been accessed for a long time, all these 
> accesses may trigger atime updates, which led to 
> ObserverRetryOnActiveException. We should change this log to DEBUG.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15197) Change ObserverRetryOnActiveException log to debug

2020-02-27 Thread Chen Liang (Jira)
Chen Liang created HDFS-15197:
-

 Summary: Change ObserverRetryOnActiveException log to debug
 Key: HDFS-15197
 URL: https://issues.apache.org/jira/browse/HDFS-15197
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Reporter: Chen Liang
Assignee: Chen Liang


Currently in ObserverReadProxyProvider, when a ObserverRetryOnActiveException 
happens, ObserverReadProxyProvider logs a message at INFO level. This can be a 
large volume of logs in some scenarios. For example, when some job tries to 
access lots of files that haven't been accessed for a long time, all these 
accesses may trigger atime updates, which led to 
ObserverRetryOnActiveException. We should change this log to DEBUG.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-02-24 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043864#comment-17043864
 ] 

Chen Liang edited comment on HDFS-15191 at 2/24/20 9:01 PM:


 There could be token compatibility issue though, if you only have HDFS-13617, 
but not HDFS-14611. If both changes are there, this should be fine. But even if 
HDFS-14611 is missing, I would expect a different error. Because seems the 
error happened at the very first call of {{readVLong}} when parsing the token. 
Those two Jiras only changes the behavior of tails of the block token. Also, 
even if we hit compatibility issue, I expect it to only affect the selective 
SASL feature. Will be watching this issue.


was (Author: vagarychen):
 There could be token compatibility issue though, if you only have HDFS-13617, 
but not HDFS-14611. If both changes are there, this should be fine. But even if 
HDFS-14611 is missing, I would expect a different error. Because seems the 
error happened at the very first call of {{readVLong}} when parsing the token. 
Those two Jiras only changes the behavior of tails of the block token.

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Priority: Major
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem 

[jira] [Commented] (HDFS-15191) EOF when reading legacy buffer in BlockTokenIdentifier

2020-02-24 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043864#comment-17043864
 ] 

Chen Liang commented on HDFS-15191:
---

 There could be token compatibility issue though, if you only have HDFS-13617, 
but not HDFS-14611. If both changes are there, this should be fine. But even if 
HDFS-14611 is missing, I would expect a different error. Because seems the 
error happened at the very first call of {{readVLong}} when parsing the token. 
Those two Jiras only changes the behavior of tails of the block token.

> EOF when reading legacy buffer in BlockTokenIdentifier
> --
>
> Key: HDFS-15191
> URL: https://issues.apache.org/jira/browse/HDFS-15191
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Priority: Major
>
> We have an HDFS client application which recently upgraded from 3.2.0 to 
> 3.2.1. After this upgrade (but not before), we sometimes see these errors 
> when this application is used with clusters still running Hadoop 2.x (more 
> specifically CDH 5.12.1):
> {code}
> WARN  [2020-02-24T00:54:32.856Z] 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing 
> remote block reader. (_sampled: true)
> java.io.EOFException:
> at java.io.DataInputStream.readByte(DataInputStream.java:272)
> at 
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFieldsLegacy(BlockTokenIdentifier.java:240)
> at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenIdentifier.readFields(BlockTokenIdentifier.java:221)
> at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:200)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:530)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
> at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:730)
> at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2942)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:644)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:575)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2314)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2270)
> at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2291)
> at org.apache.commons.io.IOUtils.copy(IOUtils.java:2246)
> at org.apache.commons.io.IOUtils.toByteArray(IOUtils.java:765)
> {code}
> We get this warning for all DataNodes with a copy of the block, so the read 
> fails.
> I haven't been able to figure out what changed between 3.2.0 and 3.2.1 to 
> cause this, but HDFS-13617 and HDFS-14611 seem related, so tagging 
> [~vagarychen] in case you have any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15185) StartupProgress reports edits segments until the entire startup completes

2020-02-20 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041290#comment-17041290
 ] 

Chen Liang commented on HDFS-15185:
---

I have tested this fix on a real cluster, the patch did get rid of the 
excessive ByteString displays. +1 with the Jenkins warnings addressed.

> StartupProgress reports edits segments until the entire startup completes
> -
>
> Key: HDFS-15185
> URL: https://issues.apache.org/jira/browse/HDFS-15185
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15185.001.patch
>
>
> Startup Progress page keeps reporting edits segments after the {{LOAD_EDITS}} 
> stage is complete. New steps are added to StartupProgress while journal 
> tailing until all startup phases are completed. This adds a lot of edits 
> steps, since {{SAFEMODE}} phase can take a long time on a large cluster.
> With fast tailing the segments are small, but the number of them is large - 
> 160K. This makes the page load forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15168) ABFS driver enhancement - Translate AAD Service Principal and Security Group To Linux user and group

2020-02-13 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang reassigned HDFS-15168:
-

Assignee: Karthik Amarnath

> ABFS driver enhancement - Translate AAD Service Principal and Security Group 
> To Linux user and group
> 
>
> Key: HDFS-15168
> URL: https://issues.apache.org/jira/browse/HDFS-15168
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Karthik Amarnath
>Assignee: Karthik Amarnath
>Priority: Major
>
> ABFS driver does not support the translation of AAD Service principal (SPI) 
> to Linux identities causing metadata operation failure. Hadoop MapReduce 
> client 
> [[JobSubmissionFiles|https://github.com/apache/hadoop/blob/d842dfffa53c8b565f3d65af44ccd7e1cc706733/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmissionFiles.java#L138]]
>  expects the file owner permission to be the Linux identity, but the 
> underlying ABFS driver returns the AAD Object identity. Hence need ABFS 
> driver enhancement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15153) TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently

2020-02-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035573#comment-17035573
 ] 

Chen Liang commented on HDFS-15153:
---

Closing this ticket as it duplicates of HDFS-15164

> TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails 
> intermittently
> ---
>
> Key: HDFS-15153
> URL: https://issues.apache.org/jira/browse/HDFS-15153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> The unit TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT is 
> failing consistently. Seems this is due to a log message change. We should 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15153) TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently

2020-02-12 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang resolved HDFS-15153.
---
Resolution: Duplicate

> TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails 
> intermittently
> ---
>
> Key: HDFS-15153
> URL: https://issues.apache.org/jira/browse/HDFS-15153
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
>
> The unit TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT is 
> failing consistently. Seems this is due to a log message change. We should 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15164) Fix TestDelegationTokensWithHA

2020-02-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035560#comment-17035560
 ] 

Chen Liang commented on HDFS-15164:
---

Initially based on my quick check, I thought this was due to the change in 
HDFS-15099 to return RetryOnActiveException, causing the error message to 
change, and the test is asserts on log capture, so it got a different message 
and failed the assertion. On a second look, seems that the actual issue is that 
due to probing period, client was not connecting to Standby, so that the 
expected Standby exception is not happening. In this case, I think disabling 
the probing period for this test is the right fix.

So +1 on 01 patch, pending Jenkins. And thanks [~ayushtkn] again for taking a 
look!

> Fix TestDelegationTokensWithHA
> --
>
> Key: HDFS-15164
> URL: https://issues.apache.org/jira/browse/HDFS-15164
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15164-01.patch
>
>
> {noformat}
> java.lang.AssertionError
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertTrue(Assert.java:52)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT(TestDelegationTokensWithHA.java:156){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035548#comment-17035548
 ] 

Chen Liang commented on HDFS-15086:
---

TestDelegationTokensWithHA failing is tracked under HDFS-15153, I think it 
broke due to a log message change.

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, 
> HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-02-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035537#comment-17035537
 ] 

Chen Liang edited comment on HDFS-15118 at 2/12/20 5:37 PM:


[~ayushtkn] I already filed HDFS-15153 for fixing this test, just haven't got 
the bandwidth to work on it. I have taken a quick look and I think it was 
caused by a previous fix, not this one.

UPDATE: Looks like you already started working on this on HDFS-15164 and have a 
patch there already, I guess we can just follow up under HDFS-15164. Thanks for 
picking this up!


was (Author: vagarychen):
[~ayushtkn] I already filed HDFS-15153 for fixing this test, just haven't got 
the bandwidth to work on it. I have taken a quick look and I think it was 
caused by a previous fix, not this one.

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-02-12 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035537#comment-17035537
 ] 

Chen Liang commented on HDFS-15118:
---

[~ayushtkn] I already filed HDFS-15153 for fixing this test, just haven't got 
the bandwidth to work on it. I have taken a quick look and I think it was 
caused by a previous fix, not this one.

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-02-04 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15148:
--
Fix Version/s: 3.3.1
   2.10.1
   3.2.2
   3.1.4
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.1.4, 3.2.2, 2.10.1, 3.3.1
>
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch, HDFS-15148.004.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-02-04 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030085#comment-17030085
 ] 

Chen Liang commented on HDFS-15148:
---

Thanks [~shv]! I have filed HDFS-15146 to fix the test. Will commit v04 patch 
shortly.

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch, HDFS-15148.004.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15153) TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails intermittently

2020-02-04 Thread Chen Liang (Jira)
Chen Liang created HDFS-15153:
-

 Summary: 
TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT fails 
intermittently
 Key: HDFS-15153
 URL: https://issues.apache.org/jira/browse/HDFS-15153
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Chen Liang
Assignee: Chen Liang


The unit TestDelegationTokensWithHA.testObserverReadProxyProviderWithDT is 
failing consistently. Seems this is due to a log message change. We should fix 
it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-02-03 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029185#comment-17029185
 ] 

Chen Liang commented on HDFS-15148:
---

{{testObserverReadProxyProviderWithDT}} fail is unrelated and fails even 
without this patch. We should look into fixing this test fail, but that should 
be in another jira. [~shv] mind taking a look at v004 patch? 

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch, HDFS-15148.004.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-02-02 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15148:
--
Attachment: HDFS-15148.004.patch

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch, HDFS-15148.004.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-02-02 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028536#comment-17028536
 ] 

Chen Liang commented on HDFS-15148:
---

Thanks for taking a look [~shv]! post v004 patch

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch, HDFS-15148.004.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-29 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026282#comment-17026282
 ] 

Chen Liang commented on HDFS-15148:
---

The failed test TestMultipleNNPortQOP seems unrelated to the change in this 
jira, and has been passing in my local runs. I think it failed because of the 
hard coded 100ms sleep may not be long enough for Jenkins run. So this is a 
test that may randomly fail if unlucky. Although I update the patch here with a 
fix, since it is separate issue, maybe this test fix should be in another Jira. 
[~shv] please let me know if you have a preference.

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-29 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15148:
--
Attachment: HDFS-15148.003.patch

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch, 
> HDFS-15148.003.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-29 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15118:
--
Fix Version/s: 2.10.1
   3.2.2
   3.1.4
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-29 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026199#comment-17026199
 ] 

Chen Liang commented on HDFS-15118:
---

The failed tests are unrelated. I've committed to trunk, branch-3.2, branch-3.1 
and branch-2.10, with the checkstyle fixed at the commit time. Thanks for the 
view [~shv]

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-28 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15148:
--
Attachment: HDFS-15148.002.patch

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-28 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025434#comment-17025434
 ] 

Chen Liang commented on HDFS-15148:
---

{{TestBlockTokenWrappingQOP}} test fail is actually related, update with v02 
patch to fix.

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch, HDFS-15148.002.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-27 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024808#comment-17024808
 ] 

Chen Liang commented on HDFS-15118:
---

Thanks for the catch [~shv]! Updated in v02 patch

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-27 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15118:
--
Attachment: HDFS-15118.002.patch

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15118.001.patch, HDFS-15118.002.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-27 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15148:
--
Status: Patch Available  (was: Open)

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-27 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15148:
--
Attachment: HDFS-15148.001.patch

> dfs.namenode.send.qop.enabled should not apply to primary NN port
> -
>
> Key: HDFS-15148
> URL: https://issues.apache.org/jira/browse/HDFS-15148
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15148.001.patch
>
>
> In HDFS-13617, NameNode can be configured to wrap its established QOP into 
> block access token as an encrypted message. Later on DataNode will use this 
> message to create SASL connection. But this new behavior should only apply to 
> new auxiliary NameNode ports, not the primary port (the one configured in 
> fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
> related configuration (e.g. dfs.data.transfer.protection). Since this 
> configure is introduced for to auxiliary ports only, we should restrict this 
> new behavior to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15148) dfs.namenode.send.qop.enabled should not apply to primary NN port

2020-01-27 Thread Chen Liang (Jira)
Chen Liang created HDFS-15148:
-

 Summary: dfs.namenode.send.qop.enabled should not apply to primary 
NN port
 Key: HDFS-15148
 URL: https://issues.apache.org/jira/browse/HDFS-15148
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.10.1, 3.3.1
Reporter: Chen Liang
Assignee: Chen Liang


In HDFS-13617, NameNode can be configured to wrap its established QOP into 
block access token as an encrypted message. Later on DataNode will use this 
message to create SASL connection. But this new behavior should only apply to 
new auxiliary NameNode ports, not the primary port (the one configured in 
fs.defaultFS), as it may cause conflicting behavior with existing other SASL 
related configuration (e.g. dfs.data.transfer.protection). Since this configure 
is introduced for to auxiliary ports only, we should restrict this new behavior 
to not apply to primary port.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-16 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15118:
--
Status: Patch Available  (was: Open)

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15118.001.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-16 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15118:
--
Attachment: HDFS-15118.001.patch

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15118.001.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15118) [SBN Read] Slow clients when Observer reads are enabled but there are no Observers on the cluster.

2020-01-16 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang reassigned HDFS-15118:
-

Assignee: Chen Liang

> [SBN Read] Slow clients when Observer reads are enabled but there are no 
> Observers on the cluster.
> --
>
> Key: HDFS-15118
> URL: https://issues.apache.org/jira/browse/HDFS-15118
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15118.001.patch
>
>
> We see substantial degradation in performance of HDFS clients, when Observer 
> reads are enabled via {{ObserverReadProxyProvider}}, but there are no 
> ObserverNodes on the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12943) Consistent Reads from Standby Node

2020-01-14 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015234#comment-17015234
 ] 

Chen Liang commented on HDFS-12943:
---

[~lindy_hopper] access time update is a write call so it can not be processed 
by Observer. Access time should be turned off on Observer, as mentioned in 
HDFS-14959.

> Consistent Reads from Standby Node
> --
>
> Key: HDFS-12943
> URL: https://issues.apache.org/jira/browse/HDFS-12943
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: ConsistentReadsFromStandbyNode.pdf, 
> ConsistentReadsFromStandbyNode.pdf, HDFS-12943-001.patch, 
> HDFS-12943-002.patch, HDFS-12943-003.patch, HDFS-12943-004.patch, 
> TestPlan-ConsistentReadsFromStandbyNode.pdf
>
>
> StandbyNode in HDFS is a replica of the active NameNode. The states of the 
> NameNodes are coordinated via the journal. It is natural to consider 
> StandbyNode as a read-only replica. As with any replicated distributed system 
> the problem of stale reads should be resolved. Our main goal is to provide 
> reads from standby in a consistent way in order to enable a wide range of 
> existing applications running on top of HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode

2020-01-10 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15099:
--
Attachment: HDFS-15099-branch-2.10.003.patch

> [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on 
> an attempt to change aTime on ObserverNode
> 
>
> Key: HDFS-15099
> URL: https://issues.apache.org/jira/browse/HDFS-15099
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15099-branch-2.10.001.patch, 
> HDFS-15099-branch-2.10.002.patch, HDFS-15099-branch-2.10.003.patch
>
>
> The precision of updating an INode's aTime while executing 
> {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by 
> ObserverNode, so the call should be redirected to Active NameNode. In order 
> to redirect to active the ObserverNode should through 
> {{ObserverRetryOnActiveException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode

2020-01-10 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013159#comment-17013159
 ] 

Chen Liang commented on HDFS-15099:
---

Checked with Konstantin offline, a better approach for the test seems to just 
don't rely on unpredictable time diff. But rather, manipulate 
{{fs.setTimes()}}. Post v003 patch.

> [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on 
> an attempt to change aTime on ObserverNode
> 
>
> Key: HDFS-15099
> URL: https://issues.apache.org/jira/browse/HDFS-15099
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15099-branch-2.10.001.patch, 
> HDFS-15099-branch-2.10.002.patch, HDFS-15099-branch-2.10.003.patch
>
>
> The precision of updating an INode's aTime while executing 
> {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by 
> ObserverNode, so the call should be redirected to Active NameNode. In order 
> to redirect to active the ObserverNode should through 
> {{ObserverRetryOnActiveException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode

2020-01-10 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013146#comment-17013146
 ] 

Chen Liang commented on HDFS-15099:
---

Thanks for the great suggestions [~shv]! Post v002 patch. Only difference is 
that I removed 
{code:java}
+dfs.open(testPath).close();
+assertSentTo(2); {code}
Because if my test, seems if machine is slow enough, the access time 200ms may 
already passed here, and this call went to active already, failing the 
assertion. I removed this check completely, as I think this is just to verify 
an open can go to observer, which is already being covered by other tests, 
should be no necessity to have it here.

> [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on 
> an attempt to change aTime on ObserverNode
> 
>
> Key: HDFS-15099
> URL: https://issues.apache.org/jira/browse/HDFS-15099
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15099-branch-2.10.001.patch, 
> HDFS-15099-branch-2.10.002.patch
>
>
> The precision of updating an INode's aTime while executing 
> {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by 
> ObserverNode, so the call should be redirected to Active NameNode. In order 
> to redirect to active the ObserverNode should through 
> {{ObserverRetryOnActiveException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode

2020-01-10 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15099:
--
Attachment: HDFS-15099-branch-2.10.002.patch

> [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on 
> an attempt to change aTime on ObserverNode
> 
>
> Key: HDFS-15099
> URL: https://issues.apache.org/jira/browse/HDFS-15099
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15099-branch-2.10.001.patch, 
> HDFS-15099-branch-2.10.002.patch
>
>
> The precision of updating an INode's aTime while executing 
> {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by 
> ObserverNode, so the call should be redirected to Active NameNode. In order 
> to redirect to active the ObserverNode should through 
> {{ObserverRetryOnActiveException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode

2020-01-09 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15099:
--
Status: Patch Available  (was: Open)

> [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on 
> an attempt to change aTime on ObserverNode
> 
>
> Key: HDFS-15099
> URL: https://issues.apache.org/jira/browse/HDFS-15099
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15099-branch-2.10.001.patch
>
>
> The precision of updating an INode's aTime while executing 
> {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by 
> ObserverNode, so the call should be redirected to Active NameNode. In order 
> to redirect to active the ObserverNode should through 
> {{ObserverRetryOnActiveException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15099) [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on an attempt to change aTime on ObserverNode

2020-01-09 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15099:
--
Attachment: HDFS-15099-branch-2.10.001.patch

> [SBN Read] getBlockLocations() should throw ObserverRetryOnActiveException on 
> an attempt to change aTime on ObserverNode
> 
>
> Key: HDFS-15099
> URL: https://issues.apache.org/jira/browse/HDFS-15099
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Attachments: HDFS-15099-branch-2.10.001.patch
>
>
> The precision of updating an INode's aTime while executing 
> {{getBlockLocations()}} is 1 hour by default. Updates cannot be handled by 
> ObserverNode, so the call should be redirected to Active NameNode. In order 
> to redirect to active the ObserverNode should through 
> {{ObserverRetryOnActiveException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down

2020-01-03 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007828#comment-17007828
 ] 

Chen Liang edited comment on HDFS-14655 at 1/3/20 11:25 PM:


Although it's a different message, checked again, does look like HDFS-14934 
should fix this too. Thanks  for the pointer [~ayushtkn]!


was (Author: vagarychen):
HDFS-14934  does look like the fix. Thanks  for the pointer [~ayushtkn]!

> [SBN Read] Namenode crashes if one of The JN is down
> 
>
> Key: HDFS-14655
> URL: https://issues.apache.org/jira/browse/HDFS-14655
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Harshakiran Reddy
>Assignee: Ayush Saxena
>Priority: Critical
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, 
> HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, 
> HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, 
> HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, 
> HDFS-14655.poc.patch
>
>
> {noformat}
> 2019-07-04 17:35:54,064 | INFO  | Logger channel (from parallel executor) to 
> XXX/XXX | Retrying connect to server: XXX/XXX. Already tried 
> 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS) | Client.java:975
> 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered 
> while tailing edits. Shutting down standby NN. | EditLogTailer.java:474
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:717)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440)
>   at 
> com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:360)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
> 2019-07-04 17:35:54,112 | INFO  | Edit log tailer | Exiting with status 1: 
> java.lang.OutOfMemoryError: unable to create new native thread | 
> ExitUtil.java:210
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down

2020-01-03 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007828#comment-17007828
 ] 

Chen Liang commented on HDFS-14655:
---

HDFS-14934  does look like the fix. Thanks  for the pointer [~ayushtkn]!

> [SBN Read] Namenode crashes if one of The JN is down
> 
>
> Key: HDFS-14655
> URL: https://issues.apache.org/jira/browse/HDFS-14655
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Harshakiran Reddy
>Assignee: Ayush Saxena
>Priority: Critical
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, 
> HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, 
> HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, 
> HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, 
> HDFS-14655.poc.patch
>
>
> {noformat}
> 2019-07-04 17:35:54,064 | INFO  | Logger channel (from parallel executor) to 
> XXX/XXX | Retrying connect to server: XXX/XXX. Already tried 
> 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS) | Client.java:975
> 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered 
> while tailing edits. Shutting down standby NN. | EditLogTailer.java:474
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:717)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440)
>   at 
> com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:360)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
> 2019-07-04 17:35:54,112 | INFO  | Edit log tailer | Exiting with status 1: 
> java.lang.OutOfMemoryError: unable to create new native thread | 
> ExitUtil.java:210
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14655) [SBN Read] Namenode crashes if one of The JN is down

2020-01-03 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007662#comment-17007662
 ] 

Chen Liang commented on HDFS-14655:
---

[~ayushtkn] shared below, it may not help too much though, as it seems to be 
thrown from the thread being cancelled
{code:java}
2020-01-03 17:50:10,887 WARN org.apache.hadoop.util.concurrent.ExecutorHelper: 
Caught exception in thread Logger channel (from parallel executor) to [...some 
JN hostname:port...]:
2020-01-03 17:50:10,887 WARN org.apache.hadoop.util.concurrent.ExecutorHelper: 
Caught exception in thread Logger channel (from parallel executor) to [...some 
JN hostname:port...]:java.util.concurrent.CancellationException at 
java.util.concurrent.FutureTask.report(FutureTask.java:121) at 
java.util.concurrent.FutureTask.get(FutureTask.java:192) at 
org.apache.hadoop.util.concurrent.ExecutorHelper.logThrowableFromAfterExecute(ExecutorHelper.java:47)
 at 
org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor.afterExecute(HadoopThreadPoolExecutor.java:90)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) {code}

> [SBN Read] Namenode crashes if one of The JN is down
> 
>
> Key: HDFS-14655
> URL: https://issues.apache.org/jira/browse/HDFS-14655
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Harshakiran Reddy
>Assignee: Ayush Saxena
>Priority: Critical
> Fix For: 2.10.0, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14655-01.patch, HDFS-14655-02.patch, 
> HDFS-14655-03.patch, HDFS-14655-04.patch, HDFS-14655-05.patch, 
> HDFS-14655-06.patch, HDFS-14655-07.patch, HDFS-14655-08.patch, 
> HDFS-14655-branch-2-01.patch, HDFS-14655-branch-2-02.patch, 
> HDFS-14655.poc.patch
>
>
> {noformat}
> 2019-07-04 17:35:54,064 | INFO  | Logger channel (from parallel executor) to 
> XXX/XXX | Retrying connect to server: XXX/XXX. Already tried 
> 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS) | Client.java:975
> 2019-07-04 17:35:54,087 | FATAL | Edit log tailer | Unknown error encountered 
> while tailing edits. Shutting down standby NN. | EditLogTailer.java:474
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:717)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1378)
>   at 
> com.google.common.util.concurrent.MoreExecutors$ListeningDecorator.execute(MoreExecutors.java:440)
>   at 
> com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:56)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getJournaledEdits(IPCLoggerChannel.java:565)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournaledEdits(AsyncLoggerSet.java:272)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectRpcInputStreams(QuorumJournalManager.java:533)
>   at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:508)
>   at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:275)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1681)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1714)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:307)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:460)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:410)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:427)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:360)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:483)
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:423)
> 2019-07-04 17:35:54,112 | INFO  | Edit log tailer | Exiting with status 1: 
> java.lang.OutOfMemoryError: unable to create new native thread | 
> 

[jira] [Commented] (HDFS-15036) Active NameNode should not silently fail the image transfer

2019-12-17 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998561#comment-16998561
 ] 

Chen Liang commented on HDFS-15036:
---

[~Jim_Brennan] I filed https://issues.apache.org/jira/browse/INFRA-19581, but 
haven't got update from Infra folks yet.

> Active NameNode should not silently fail the image transfer
> ---
>
> Key: HDFS-15036
> URL: https://issues.apache.org/jira/browse/HDFS-15036
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, 
> HDFS-15036.003.patch
>
>
> Image transfer from Standby NameNode to  Active silently fails on Active, 
> without any logging and not notifying the receiver side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15036) Active NameNode should not silently fail the image transfer

2019-12-13 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16996000#comment-16996000
 ] 

Chen Liang commented on HDFS-15036:
---

Oops! Did not realize it's already deleted, guess I missed the messages... will 
work on deleting it again...

> Active NameNode should not silently fail the image transfer
> ---
>
> Key: HDFS-15036
> URL: https://issues.apache.org/jira/browse/HDFS-15036
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, 
> HDFS-15036.003.patch
>
>
> Image transfer from Standby NameNode to  Active silently fails on Active, 
> without any logging and not notifying the receiver side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15036) Active NameNode should not silently fail the image transfer

2019-12-12 Thread Chen Liang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Liang updated HDFS-15036:
--
Fix Version/s: 3.2.2
   3.1.4

> Active NameNode should not silently fail the image transfer
> ---
>
> Key: HDFS-15036
> URL: https://issues.apache.org/jira/browse/HDFS-15036
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-15036.001.patch, HDFS-15036.002.patch, 
> HDFS-15036.003.patch
>
>
> Image transfer from Standby NameNode to  Active silently fails on Active, 
> without any logging and not notifying the receiver side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >