[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-08 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970352#comment-16970352
 ] 

Chen Zhang commented on HDFS-12288:
---

update patch v7 to fix failed test

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, 
> HDFS-12288.006.patch, HDFS-12288.007.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-08 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12288:
--
Attachment: HDFS-12288.007.patch

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, 
> HDFS-12288.006.patch, HDFS-12288.007.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969862#comment-16969862
 ] 

Chen Zhang commented on HDFS-12288:
---

Thanks [~elgoiri] for the comments, update the patch v6 according to comment 
1&2.
{quote} * What's the deal with TestNamenodeCapacityReport?{quote}
Before the patch, every live DataNode will report at least 1 xceriver to 
NameNode, but it's not actually the real xceiver, it's just the 
\{{DataXceiverServer}} thread, which is added to \{{threadGroup}} during the 
initialization of DataNode. After the patch, the xceiver count will be the 
*real* number of data-transfer thread, so when the Datanode is idle, it will 
report 0 xceiver count to NameNode.

{\{TestNamenodeCapacityReport}} assumes there would be at least 1 xceiver for 
each DataNode, it leverage this to check the number of cluster's live node, so 
we need to update the test with this patch.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, 
> HDFS-12288.006.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12288:
--
Attachment: HDFS-12288.006.patch

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, 
> HDFS-12288.006.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969641#comment-16969641
 ] 

Chen Zhang commented on HDFS-12288:
---

The failed tests are unrelated. [~elgoiri] do you have time to help review? 
Thanks.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969370#comment-16969370
 ] 

Chen Zhang commented on HDFS-12288:
---

update patch v5 to fix checkstyle error and failed UT

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12288:
--
Attachment: HDFS-12288.005.patch

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12288:
--
Attachment: HDFS-12288.004.patch

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-11-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969153#comment-16969153
 ] 

Chen Zhang commented on HDFS-12288:
---

update patch v4, add unit tests

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch, HDFS-12288.004.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-11-04 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967181#comment-16967181
 ] 

Chen Zhang commented on HDFS-14775:
---

Hi [~xkrogen] [~elgoiri] [~hexiaoqiao], any further comments?

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch, HDFS-14775.005.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-10-28 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961616#comment-16961616
 ] 

Chen Zhang commented on HDFS-14730:
---

Thanks [~eyang]

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-10-25 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959458#comment-16959458
 ] 

Chen Zhang commented on HDFS-14730:
---

[~ayushtkn] [~crh] [~eyang], this config key is not used any more after 
HADOOP-16314, some UT failed due to the miss use of this config key(fixed in 
HDFS-14609), so I propose to remove it. Could you help to review the patch?

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-25 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959450#comment-16959450
 ] 

Chen Zhang commented on HDFS-14775:
---

Thanks [~xkrogen] for your comments, update patch v5.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch, HDFS-14775.005.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-25 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14775:
--
Attachment: HDFS-14775.005.patch

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch, HDFS-14775.005.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-10-24 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Attachment: (was: HDFS-14730.002.patch)

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-10-24 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Attachment: HDFS-14730.002.patch

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-24 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14775:
--
Attachment: (was: HDFS-14775.004.patch)

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-24 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14775:
--
Attachment: HDFS-14775.004.patch

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-24 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958657#comment-16958657
 ] 

Chen Zhang commented on HDFS-14775:
---

Uploaded patch v4, removed the override of \{{hashCode()}} and \{{equals()}} 
method.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-24 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14775:
--
Attachment: HDFS-14775.004.patch

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch, HDFS-14775.004.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-24 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958641#comment-16958641
 ] 

Chen Zhang commented on HDFS-14775:
---

I agree that \{{HashCodeBuilder}} and \{{EqualsBuilder}} is a better choice 
when overriding the \{{hashCode()}} and \{{equals()}} method. But in this case, 
seems it's unnecessary to override these two methods.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-10-23 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958486#comment-16958486
 ] 

Chen Zhang commented on HDFS-14775:
---

Hi [~hexiaoqiao] and [~elgoiri] ,Thanks for your comments, sorry for the late 
response.
{quote}I guess that you remote AtomicReference for `longestWriteLockHeldInfo` 
based on only one thread could hold WriteLock, right? maybe we could keep 
AtomicReference safety and it will not take any performance overhead in my 
opinion
{quote}
Yep, this critical section is already protected by the write lock, so we don't 
need to consider any race condition, which will simplify the code.
{quote}const in FSNamesystemLock#hashCode is just a little confused. is it 
better with {{Objects.hashCode}}?
{quote}
For the {{hashCode}} and {{equals}} method, I think [~hexiaoqiao] is right, we 
can just use the default implementation, since we need neither to compare by 
value nor to save it in some container. What's your opinion? [~elgoiri]
{quote}I wonder if by moving from ReadLockHeldInfo to LockHeldInfo we are 
losing some of the semantic here.
{quote}
I think it's ok, because ReadLockHeldInfo is not designed specifically for 
ReadLock, it just save some general information to track when and where we hold 
a lock.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-09-27 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Status: Patch Available  (was: Open)

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-09-26 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938662#comment-16938662
 ] 

Chen Zhang commented on HDFS-14775:
---

Hi [~xkrogen], you've worked on the related code, do you have time to help 
review this patch? Thanks.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-09-26 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Attachment: HDFS-14730.002.patch

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch, HDFS-14730.002.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14461) RBF: Fix intermittently failing kerberos related unit test

2019-09-26 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938486#comment-16938486
 ] 

Chen Zhang commented on HDFS-14461:
---

Patch v5 LGTM, +1, thanks [~hexiaoqiao] for the patch

> RBF: Fix intermittently failing kerberos related unit test
> --
>
> Key: HDFS-14461
> URL: https://issues.apache.org/jira/browse/HDFS-14461
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: CR Hota
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-14461.001.patch, HDFS-14461.002.patch, 
> HDFS-14461.003.patch, HDFS-14461.004.patch, HDFS-14461.005.patch
>
>
> TestRouterHttpDelegationToken#testGetDelegationToken fails intermittently. It 
> may be due to some race condition before using the keytab that's created for 
> testing.
>  
> {code:java}
>  Failed
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.testGetDelegationToken
>  Failing for the past 1 build (Since 
> [!https://builds.apache.org/static/1e9ab9cc/images/16x16/red.png! 
> #26721|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/] )
>  [Took 89 
> ms.|https://builds.apache.org/job/PreCommit-HDFS-Build/26721/testReport/org.apache.hadoop.hdfs.server.federation.security/TestRouterHttpDelegationToken/testGetDelegationToken/history]
>   
>  Error Message
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED
> h3. Stacktrace
> org.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.security.KerberosAuthException: failure to login: for 
> principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  javax.security.auth.login.LoginException: Integrity check on decrypted field 
> failed (31) - PREAUTH_FAILED at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:173) 
> at 
> org.apache.hadoop.hdfs.server.federation.security.TestRouterHttpDelegationToken.setup(TestRouterHttpDelegationToken.java:99)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: router/localh...@example.com from keytab 
> /testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs-rbf/target/test/data/SecurityConfUtil/test.keytab
>  

[jira] [Commented] (HDFS-10648) Expose Balancer metrics through Metrics2

2019-09-25 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938168#comment-16938168
 ] 

Chen Zhang commented on HDFS-10648:
---

Hi [~LeonG], yes, I've a draft patch in local, but it's delayed by the work on 
other Jira. I'll refine the draft patch and hopefully can upload the initial 
patch for review in one week. Thanks for the ping.

> Expose Balancer metrics through Metrics2
> 
>
> Key: HDFS-10648
> URL: https://issues.apache.org/jira/browse/HDFS-10648
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: balancer  mover, metrics
>Reporter: Mark Wagner
>Assignee: Chen Zhang
>Priority: Major
>  Labels: metrics
>
> The Balancer currently prints progress information to the console. For 
> deployments that run the balancer frequently, it would be helpful to collect 
> those metrics for publishing to the available sinks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-17 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931120#comment-16931120
 ] 

Chen Zhang commented on HDFS-14609:
---

Thanks [~crh] [~tasanuma] [~eyang] [~elgoiri] for your review and comments, now 
is this patch ready for commit? Thanks.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch, 
> HDFS-14609.006.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-11 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.006.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch, 
> HDFS-14609.006.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928138#comment-16928138
 ] 

Chen Zhang commented on HDFS-14609:
---

Thanks [~eyang] for your response, upload patch v6 to fix checkstyle error.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch, 
> HDFS-14609.006.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928132#comment-16928132
 ] 

Chen Zhang commented on HDFS-14811:
---

Actually, I'm considering another improvement: Adding a configurable threshold 
of counting overloaded(e.g. 50), if the xceiverCount is lower than that 
threshold, then that DN will not count as an overloaded node. Anyway, a DN with 
xceiverCount 2 is treated as an overloaded node is not reasonable.

Do you think it's a better solution? [~elgoiri] [~ayushtkn]

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> 

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928128#comment-16928128
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/12/19 1:39 AM:


{quote}Let's see if we can fix the count of active threads.
{quote}
[~elgoiri],HDFS-12288 won't help in this case. when some client writing data on 
1 DN, it will start 2 threads({{DatanodeXceiver}} and {{PacketResponder}}), 
after the patch of HDFS-12288, {{DatanodeXceiverServer}} thread will not be 
included in {{xceiverCount}} of heartbeat. In this case, there will be 5 DN 
with {{xceiverCount}} equals to 0 and 1 DN equals to 2(which is overloaded).


was (Author: zhangchen):
{quote}Let's see if we can fix the count of active threads.
{quote}
[~elgoiri],HDFS-12288 won't help in this case. when some client writing data on 
1 DN, it will start 2 threads(\{{DatanodeXceiver}} and {{PacketResponder}}), 
after the patch of HDFS-12288, {{DatanodeXceiverServer}} will not included in 
{{xceiverCount}} of heartbeat. In this case, there will be 5 DN with 
{{xceiverCount}} equals to 0 and 1 DN equals to 2(which is overloaded).

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> 

[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928128#comment-16928128
 ] 

Chen Zhang commented on HDFS-14811:
---

{quote}Let's see if we can fix the count of active threads.
{quote}
[~elgoiri],HDFS-12288 won't help in this case. when some client writing data on 
1 DN, it will start 2 threads(\{{DatanodeXceiver}} and {{PacketResponder}}), 
after the patch of HDFS-12288, {{DatanodeXceiverServer}} will not included in 
{{xceiverCount}} of heartbeat. In this case, there will be 5 DN with 
{{xceiverCount}} equals to 0 and 1 DN equals to 2(which is overloaded).

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 

[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928125#comment-16928125
 ] 

Chen Zhang commented on HDFS-12288:
---

Thanks [~elgoiri] for your response, I'll add more tests to make the patch 
complete.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang reassigned HDFS-12288:
-

Assignee: Chen Zhang  (was: Lukas Majercak)

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927314#comment-16927314
 ] 

Chen Zhang edited comment on HDFS-12288 at 9/11/19 7:26 AM:


Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code 
according previous discussion, and uploaded patch v3, it's not a complete 
patch, only a draft without tests.
{quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the 
sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery 
threads}}.
We just need to have another metric or member variable to track currently 
running {{Block recovery threads}}.
The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also 
create {{Packet Responder thread}}
{quote}
Actually not all the DataXceiver thread creates PacketResponder thread, only 
the xceiver processing WRITE_BLOCK operation will create a PacketResponder 
thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and 
{{dataNodeBlockRecoveryWorkerCount}}

I think the variable name {{xceiverCount}} in the {{HeartbeatRequestProto}} 
also need to change to another name(e.g. transferThreadCount), but changing 
this variable name affects lots of logic in NameNode side, including 
{{FSNameSystem}}, {{DataNodeManager}}, {{BlockManager}}, {{HeartBeatManager}}, 
{{DataNodeStats}} and so on, this variable name is used everywhere, do you 
think it's necessary to change it?


was (Author: zhangchen):
Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code 
according previous discussion, and uploaded patch v3, it's not a complete 
patch, only a draft without tests.
{quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the 
sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery 
threads}}.
We just need to have another metric or member variable to track currently 
running {{Block recovery threads}}.
The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also 
create {{Packet Responder thread}}
{quote}
Actually not all the DataXceiver thread creates PacketResponder thread, only 
the xceiver processing WRITE_BLOCK operation will create a PacketResponder 
thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and 
{{dataNodeBlockRecoveryWorkerCount}}

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927314#comment-16927314
 ] 

Chen Zhang commented on HDFS-12288:
---

Hi [~shahrs87] [~elgoiri], do you have time to take a look? I changed the code 
according previous discussion, and uploaded patch v3, it's not a complete 
patch, only a draft without tests.
{quote}The method {{DataNode#getActiveNumberOfThreads()}} will be return the 
sum of {{new DataNode#getXceiverCount() * 2}} + {{Num of Block recovery 
threads}}.
We just need to have another metric or member variable to track currently 
running {{Block recovery threads}}.
The reason we have multiplier of 2 is for every {{Dataxceiver}} thread, we also 
create {{Packet Responder thread}}
{quote}
Actually not all the DataXceiver thread creates PacketResponder thread, only 
the xceiver processing WRITE_BLOCK operation will create a PacketResponder 
thread, so I added 2 additional metrics: {{dataNodePacketResponderCount}} and 
{{dataNodeBlockRecoveryWorkerCount}}

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927307#comment-16927307
 ] 

Chen Zhang commented on HDFS-12288:
---

Thanks [~lukmajercak], I uploaded the patch v3.

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-11 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12288:
--
Attachment: HDFS-12288.003.patch

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, 
> HDFS-12288.003.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927286#comment-16927286
 ] 

Chen Zhang edited comment on HDFS-14609 at 9/11/19 6:47 AM:


Hi [~crh], thanks for your response, I've some explaination in this [comment 
|https://issues.apache.org/jira/browse/HDFS-14609?focusedCommentId=16907486=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16907486]about
 the change of test's behavior, FYI.

For your question, the answer is :
{quote} # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, I use the 403 detection 
instead.{quote}


was (Author: zhangchen):
Hi [~crh], thanks for your response, I've some explaination in this comment 
about the change of test's behavior, FYI.

For your question, the answer is :
{quote} # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, I use the 403 detection 
instead.{quote}

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-11 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927286#comment-16927286
 ] 

Chen Zhang commented on HDFS-14609:
---

Hi [~crh], thanks for your response, I've some explaination in this comment 
about the change of test's behavior, FYI.

For your question, the answer is :
{quote} # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, I use the 403 detection 
instead.{quote}

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-10 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927187#comment-16927187
 ] 

Chen Zhang commented on HDFS-14811:
---

[~ayushtkn] [~elgoiri], any comments? Thanks

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> 

[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-10 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927185#comment-16927185
 ] 

Chen Zhang commented on HDFS-14609:
---

Ping [~crh] [~eyang]...
Could you help to take a look, thanks!

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-09 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925202#comment-16925202
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/9/19 7:27 AM:
---

Hi [~ayushtkn], I've gone through the discussion in HDFS-12288, the latest 
conclusion is to modify getXceiverCount() method to return real number of 
DataXceiver threads (current is much more than the real number), but the load 
of each DN is still not changed (using the activeNumberOfThread instead), so 
when a DN start writing a block, the load would still be 3, which makes it 
overloaded.

My initial idea is quite same as [~lukmajercak] mentioned at HDFS-12288: do not 
consider packetResponder thread when calculating DN's load. But this solution 
looks not a good choice.


was (Author: zhangchen):
Hi [~ayushtkn], I've gone through the discussion in HDFS-12288, the latest 
conclusion is to modify getXceiverCount() method to return real number of 
DataXceiver threads (current is much more than the real number), but the load 
of each DN is still not changed (using the activeNumberOfThread instead), so 
when a DN start writing a block, the load would still be 3, which makes it 
overloaded.

My initially idea is quite same as [~lukmajercak] mentioned at HDFS-12288: do 
not consider packetResponder thread when calculating DN's load. But this 
solution looks not a good choice.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> 

[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-08 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925206#comment-16925206
 ] 

Chen Zhang commented on HDFS-14609:
---

Update patch v5.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-08 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.005.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch, HDFS-14609.005.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-08 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925202#comment-16925202
 ] 

Chen Zhang commented on HDFS-14811:
---

Hi [~ayushtkn], I've gone through the discussion in HDFS-12288, the latest 
conclusion is to modify getXceiverCount() method to return real number of 
DataXceiver threads (current is much more than the real number), but the load 
of each DN is still not changed (using the activeNumberOfThread instead), so 
when a DN start writing a block, the load would still be 3, which makes it 
overloaded.

My initially idea is quite same as [~lukmajercak] mentioned at HDFS-12288: do 
not consider packetResponder thread when calculating DN's load. But this 
solution looks not a good choice.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at 

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-08 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924765#comment-16924765
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/8/19 2:32 PM:
---

Uploaded patch v2 to disable considerLoad option. I've run the whole class 
test(using {{mvn -Dtest=TestRouterRpc test}}) 50 times in local, all of them 
passed after patch.

I've filed another Jira HDFS-14830 to track the xceiverCount problem.


was (Author: zhangchen):
Uploaded patch v2 to disable considerLoad option. I've run the whole class 
test(using {{mvn -Dtest=TestRouterRpc test}}) 50 times in local, all of them 
passed after patch.

I've filed another Jira HDFS-14803 to track the xceiverCount problem.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at 

[jira] [Resolved] (HDFS-14830) The calculation of DataXceiver count is not accurate

2019-09-07 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang resolved HDFS-14830.
---
Resolution: Duplicate

> The calculation of DataXceiver count is not accurate
> 
>
> Key: HDFS-14830
> URL: https://issues.apache.org/jira/browse/HDFS-14830
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> DataNode use threadGroup.activeCount() as the number of DataXceiver, it's not 
> accurate since threadGroup includes not only DataXceiver thread and 
> DataXceiverServer thread, PacketResponder thread and BlockRecoveryWorker 
> thread is also in the same threadGroup.
> In the worst case, the reported DataXceiver count maybe double of actual 
> count(e.g. all DataXceiver process write block operation, they create same 
> number of PacketResponder thread at the same time).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924885#comment-16924885
 ] 

Chen Zhang commented on HDFS-14775:
---

Just realized that {{longestWriteLockHeldInfo}} don't need to be 
AtomicReference type, update the patch.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-09-07 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14775:
--
Attachment: HDFS-14775.003.patch

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch, 
> HDFS-14775.003.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14775) Add Timestamp for longest FSN write/read lock held log

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924848#comment-16924848
 ] 

Chen Zhang commented on HDFS-14775:
---

Hi [~xkrogen] and [~linyiqun], do you have time to help review this patch? 
Thanks a lot.

> Add Timestamp for longest FSN write/read lock held log
> --
>
> Key: HDFS-14775
> URL: https://issues.apache.org/jira/browse/HDFS-14775
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14775.001.patch, HDFS-14775.002.patch
>
>
> HDFS-13946 improved the log for longest read/write lock held time, it's very 
> useful improvement.
> In some condition, we need to locate the detailed call information(user, ip, 
> path, etc.) for longest lock holder, but the default throttle interval(10s) 
> is too long to find the corresponding audit log. I think we should add the 
> timestamp for the {{longestWriteLockHeldStackTrace}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924776#comment-16924776
 ] 

Chen Zhang edited comment on HDFS-14609 at 9/7/19 8:42 AM:
---

Thanks [~elgoiri] for your review and comments.
{quote}TestRouterWithSecureStartup#HTTP_KERBEROS_PRINCIPAL_CONF_KEY, don't we 
have this defined somewhere else?{quote}
The conf key is joined by the {{HttpServer2#authFilterConfigurationPrefix}} and 
".type" string, so I thought that there might not be a complete key already 
defined in somewhere else. But I'm wrong, 
{{CommonConfigurationKeysPublic#HADOOP_HTTP_AUTHENTICATION_TYPE}} defines it. 
I'll update the patch to use it instead.
{quote}BTW, I've seen the use of equivalents to NoAuthFilterInitializer, don't 
we have this already in the common test lib?{quote}Can you point out where is 
the equivalents of NoAuthFilterInitializer? I try to find it by going through 
all the sub class of {{FilterInitializer}}, but not found.


was (Author: zhangchen):
Thanks [~elgoiri] for your review and comments.
{quote}TestRouterWithSecureStartup#HTTP_KERBEROS_PRINCIPAL_CONF_KEY, don't we 
have this defined somewhere else?{quote}
The conf key is joined by the {{HttpServer2#authFilterConfigurationPrefix}} and 
".type" string, so I thought that there might not be a complete key already 
defined in somewhere else. But I'm wrong, 
{{CommonConfigurationKeysPublic#HADOOP_HTTP_AUTHENTICATION_TYPE}} defines it. 
I'll update the patch to use this instead.
{quote}BTW, I've seen the use of equivalents to NoAuthFilterInitializer, don't 
we have this already in the common test lib?{quote}Can you point out where is 
the equivalents of NoAuthFilterInitializer? I try to find it by going through 
all the sub class of {{FilterInitializer}}, but not found.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924776#comment-16924776
 ] 

Chen Zhang commented on HDFS-14609:
---

Thanks [~elgoiri] for your review and comments.
{quote}TestRouterWithSecureStartup#HTTP_KERBEROS_PRINCIPAL_CONF_KEY, don't we 
have this defined somewhere else?{quote}
The conf key is joined by the {{HttpServer2#authFilterConfigurationPrefix}} and 
".type" string, so I thought that there might not be a complete key already 
defined in somewhere else. But I'm wrong, 
{{CommonConfigurationKeysPublic#HADOOP_HTTP_AUTHENTICATION_TYPE}} defines it. 
I'll update the patch to use this instead.
{quote}BTW, I've seen the use of equivalents to NoAuthFilterInitializer, don't 
we have this already in the common test lib?{quote}Can you point out where is 
the equivalents of NoAuthFilterInitializer? I try to find it by going through 
all the sub class of {{FilterInitializer}}, but not found.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924767#comment-16924767
 ] 

Chen Zhang commented on HDFS-12288:
---

Hi [~lukmajercak], are you still working on this Jira? I hit the same issue in 
another Jira HDFS-14830

> Fix DataNode's xceiver count calculation
> 
>
> Key: HDFS-12288
> URL: https://issues.apache.org/jira/browse/HDFS-12288
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch
>
>
> The problem with the ThreadGroup.activeCount() method is that the method is 
> only a very rough estimate, and in reality returns the total number of 
> threads in the thread group as opposed to the threads actually running.
> In some DNs, we saw this to return 50~ for a long time, even though the 
> actual number of DataXceiver threads was next to none.
> This is a big issue as we use the xceiverCount to make decisions on the NN 
> for choosing replication source DN or returning DNs to clients for R/W.
> The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value 
> which only accounts for actual number of DataXcevier threads currently 
> running and thus represents the load on the DN much better.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14830) The calculation of DataXceiver count is not accurate

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924766#comment-16924766
 ] 

Chen Zhang commented on HDFS-14830:
---

Thanks [~elgoiri], they looks like the same issue. I'll try to follow 
HDFS-12288 first.

> The calculation of DataXceiver count is not accurate
> 
>
> Key: HDFS-14830
> URL: https://issues.apache.org/jira/browse/HDFS-14830
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> DataNode use threadGroup.activeCount() as the number of DataXceiver, it's not 
> accurate since threadGroup includes not only DataXceiver thread and 
> DataXceiverServer thread, PacketResponder thread and BlockRecoveryWorker 
> thread is also in the same threadGroup.
> In the worst case, the reported DataXceiver count maybe double of actual 
> count(e.g. all DataXceiver process write block operation, they create same 
> number of PacketResponder thread at the same time).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-07 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924765#comment-16924765
 ] 

Chen Zhang commented on HDFS-14811:
---

Uploaded patch v2 to disable considerLoad option. I've run the whole class 
test(using {{mvn -Dtest=TestRouterRpc test}}) 50 times in local, all of them 
passed after patch.

I've filed another Jira HDFS-14803 to track the xceiverCount problem.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 

[jira] [Updated] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-07 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14811:
--
Attachment: HDFS-14811.002.patch

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch, HDFS-14811.002.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 

[jira] [Assigned] (HDFS-14830) The calculation of DataXceiver count is not accurate

2019-09-06 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang reassigned HDFS-14830:
-

Assignee: Chen Zhang

> The calculation of DataXceiver count is not accurate
> 
>
> Key: HDFS-14830
> URL: https://issues.apache.org/jira/browse/HDFS-14830
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> DataNode use threadGroup.activeCount() as the number of DataXceiver, it's not 
> accurate since threadGroup includes not only DataXceiver thread and 
> DataXceiverServer thread, PacketResponder thread and BlockRecoveryWorker 
> thread is also in the same threadGroup.
> In the worst case, the reported DataXceiver count maybe double of actual 
> count(e.g. all DataXceiver process write block operation, they create same 
> number of PacketResponder thread at the same time).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14830) The calculation of DataXceiver count is not accurate

2019-09-06 Thread Chen Zhang (Jira)
Chen Zhang created HDFS-14830:
-

 Summary: The calculation of DataXceiver count is not accurate
 Key: HDFS-14830
 URL: https://issues.apache.org/jira/browse/HDFS-14830
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chen Zhang


DataNode use threadGroup.activeCount() as the number of DataXceiver, it's not 
accurate since threadGroup includes not only DataXceiver thread and 
DataXceiverServer thread, PacketResponder thread and BlockRecoveryWorker thread 
is also in the same threadGroup.

In the worst case, the reported DataXceiver count maybe double of actual 
count(e.g. all DataXceiver process write block operation, they create same 
number of PacketResponder thread at the same time).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924717#comment-16924717
 ] 

Chen Zhang commented on HDFS-14609:
---

[~crh] [~elgoiri], do you have time to help to review the last patch?

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924386#comment-16924386
 ] 

Chen Zhang commented on HDFS-14609:
---

submit patch v4 to fix checkstyle error

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: (was: HDFS-14609.003.patch)

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.004.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.004.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.003.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.003.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924229#comment-16924229
 ] 

Chen Zhang commented on HDFS-14609:
---

Sorry, uploaded the wrong file for path v3, re-submit again.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch, HDFS-14609.003.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923952#comment-16923952
 ] 

Chen Zhang commented on HDFS-14609:
---

HDFS-14784 has been resolved and committed to trunk, I rebase the patch to the 
latest trunk code and upload the path v3.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-09-06 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.003.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch, 
> HDFS-14609.003.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-05 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923945#comment-16923945
 ] 

Chen Zhang commented on HDFS-14784:
---

Thanks [~crh] for the review and [~elgoiri] for pushing this Jira forward. I'll 
update HDFS-14609 soon.

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-05 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923661#comment-16923661
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/6/19 12:44 AM:


Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, createFile method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count (every xceiver processing op WRITE_BLOCK will create an extra 
PacketResponder thread) .


was (Author: zhangchen):
Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-05 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923661#comment-16923661
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/5/19 6:02 PM:
---

Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]
   
Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, createFile method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count.


was (Author: zhangchen):
Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 

[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-05 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923661#comment-16923661
 ] 

Chen Zhang commented on HDFS-14811:
---

Thanks [~ayushtkn] for your comments, I think I've found the root cause now.
 # The xceiver count number is not accurate.
 ## When bpServiceActor send heartbeat to NN, it use 
\{{DataNode#getXceiverCount()}} to get xceiver count
 ## {\{DataNode#getXceiverCount()}} actually use the 
\{{ThreadGroup#activeCount()}} as the xceiver count
 ## But threadGroup is not only related with DataXceiver threads, the 
DataXceiverServer and PacketResponder is also added to the same threadGroup.
 ## So if a DN sends heartbeat when it's receiving 1 block, the reported 
xceiver count would be 3. If DN is free, the reported xceiver count would be 1 
(only includes DataXceiverServer thread)
 ## Here is some tracing logs before sending heartbeat, I dumped all the treads 
information in threadGroup and the xcevierCount from 
\{{ThreadGroup#activeCount()}}:
 ## 
{code:java}
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56120, xceiverCount: 1


java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@ea9b7c6,5,dataXceiverServer]
java.lang.ThreadGroup[name=dataXceiverServer,maxpri=10]

Thread[org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@3b220bcb,5,dataXceiverServer]
Thread[DataXceiver for client DFSClient_NONMAPREDUCE_-2116461392_1 at 
/127.0.0.1:56218 [Receiving block 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002],5,dataXcei
verServer]
Thread[PacketResponder: 
BP-829366685-192.168.1.3-1567701683406:blk_1073741826_1002, 
type=LAST_IN_PIPELINE,5,dataXceiverServer]
2019-09-06 00:41:49,182 [BP-664190069-192.168.1.3-1567701681845 heartbeating to 
localhost/127.0.0.1:56096] INFO  datanode.DataNode 
(BPServiceActor.java:sendHeartBeat(537)) - send
HeartBeat from 127.0.0.1:56108, xceiverCount: 3{code}

 # When allocating new blocks, the xceiver count in NN may not updated.
 ## In this test, createFile method sets replication factor to 1 for 
each file, so every file will write only 1 replica
 ## So the NN may receive heartbeat like this: 5 DN have 1 xceiver and 1 DN 
have 3 xceivers
 ## If we complete a common file and start another EC file, the NN may not 
receive the latest heartbeat, so the DN with 3 xceiver will be considered as 
overloaded(average is (5*1+3)/6*2=2.666...
 ## 
{code:java}
Datanode 127.0.0.1:56108 is not chosen since the node is too busy (load: 3 > 
2.6665){code}

If we want to work around, we can simply don't consider load, or trigger and 
wait for heartbeat before creating a new EC file, I think both is ok, what's 
your opinion?

Further more, should we fix the inaccurate problem of xceiver count? In the 
worst case, the reported xceiver count may be  the double of actual xceiver 
count.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy 

[jira] [Commented] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-04 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923001#comment-16923001
 ] 

Chen Zhang commented on HDFS-14784:
---

Jenkins build not triggered, re-submit patch v2 to trigger Jenkins build

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch, 
> HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-04 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14784:
--
Attachment: HDFS-14784.002.patch

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch, 
> HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-04 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922646#comment-16922646
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/5/19 2:37 AM:
---

>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 clients are writing and 2 clients are reading, since the write 
target and read target is chosen randomly, it is possible that 2 read clients 
read on the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.


was (Author: zhangchen):
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File 

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-04 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922646#comment-16922646
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/4/19 4:33 PM:
---

>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

You are right, marking {{dfs.namenode.redundancy.considerLoad}} as false is the 
right solution.


was (Author: zhangchen):
>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) 

[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-04 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922646#comment-16922646
 ] 

Chen Zhang commented on HDFS-14811:
---

>From the error log:
{quote}Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 
3 > 2.6665)
{quote}
We can conduct that there is 8 xceiver running (2./2*6 = 8), and 1 node 
have 3 xceivers and other 5 node have 5 xceiver in total. I think it's easy to 
happen when 2 client is writing and 2 client is reading, since the write target 
and read target is chosen randomly, it is possible that 2 read clients read on 
the same DN and that DN is writing a block at the same time.

In normal cluster, NN will chose other DN, but in this special case, there is 
no other choice when any DN overloaded. As I described above, we can't avoid 
some read clients concentrate on 1 DN, then we can't avoid the allocation 
failure if we consider load.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>  

[jira] [Comment Edited] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921450#comment-16921450
 ] 

Chen Zhang edited comment on HDFS-14811 at 9/3/19 2:38 PM:
---

I've verified the patch, run the test by call {{mvn -Dtest=TestRouterRpc test}} 
50 times using shell script. Before the patch, the test failed 8 times, after 
the patch they all succeed.


was (Author: zhangchen):
I've verify the patch, run the test by call {{mvn -Dtest=TestRouterRpc test}} 
50 times using shell script. Before the patch, the test failed 8 times, after 
the patch they all succeed.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 

[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921450#comment-16921450
 ] 

Chen Zhang commented on HDFS-14811:
---

I've verify the patch, run the test by call {{mvn -Dtest=TestRouterRpc test}} 
50 times using shell script. Before the patch, the test failed 8 times, after 
the patch they all succeed.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server 

[jira] [Updated] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-03 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14784:
--
Status: Open  (was: Patch Available)

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14784) Add more methods to WebHdfsTestUtil to support tests outside of package

2019-09-03 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14784:
--
Status: Patch Available  (was: Open)

try to trigger Jenkins build again

> Add more methods to WebHdfsTestUtil to support tests outside of package
> ---
>
> Key: HDFS-14784
> URL: https://issues.apache.org/jira/browse/HDFS-14784
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14784.001.patch, HDFS-14784.002.patch
>
>
> Before HDFS-14434, we can access a secure cluster by WebHDFS using user.name 
> parameter and {{PseudoAuthenticationHandler}} without kerberos 
> authentication, it's quite useful for some test situation.
> HDFS-14434 ignores user.name query parameter in secure WebHDFS when we using 
> WebHdfsFileSystem, so the only way to use user.name parameter is to access by 
> URL.
> This Jira try to add more methods to WebHdfsTestUtil to support UT out of 
> package to test WebHDFS in customize way.
> More background and discuss, see HDFS-14609.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14811:
--
Status: Patch Available  (was: Open)

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
> 

[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921279#comment-16921279
 ] 

Chen Zhang commented on HDFS-14811:
---

Uploaded patch v1, I'll try to verify if it works locally.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> 

[jira] [Commented] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921278#comment-16921278
 ] 

Chen Zhang commented on HDFS-14811:
---

Hi [~ayushtkn], since the test failed because DN overloaded, and we should not 
change the DN count, I think we can use a bigger considerLoadFactor(default is 
2) to workaround.

Since it's just a test and load can't be very high, so bigger 
considerLoadFactor won't have bad impaction.

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at 

[jira] [Updated] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-03 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14811:
--
Attachment: HDFS-14811.001.patch

> RBF: TestRouterRpc#testErasureCoding is flaky
> -
>
> Key: HDFS-14811
> URL: https://issues.apache.org/jira/browse/HDFS-14811
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14811.001.patch
>
>
> The Failed reason:
> {code:java}
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
> Node /default-rack/127.0.0.1:53148 [
> ]
> Node /default-rack/127.0.0.1:53161 [
> ]
> Node /default-rack/127.0.0.1:53157 [
>   Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 
> > 2.6665).
> Node /default-rack/127.0.0.1:53143 [
> ]
> Node /default-rack/127.0.0.1:53165 [
> ]
> 2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas 
> was chosen. Reason: {NODE_TOO_BUSY=1}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) 
> - Failed to place enough replicas: expected size is 1 but only 0 storage 
> types can be selected (replication=6, selected=[], unavailable=[DISK], 
> removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
> blockmanagement.BlockPlacementPolicy 
> (BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
> replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
> required storage types are unavailable:  unavailableStorages=[DISK], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> 2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
> ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
> port 53140, call Call#1270 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
> java.io.IOException: File /testec/testfile2 could only be written to 5 of the 
> 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 
> node(s) are excluded in this operation.
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
> 2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
> ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
> port 53197, call Call#1268 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
> 

[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-03 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921211#comment-16921211
 ] 

Chen Zhang commented on HDFS-13157:
---

Thanks [~sodonnell] for your detailed analysis, since the comments are very 
long, I'll try to outline the conclusion we made util now:
 # When decommissioning one DN, the blocks is added to replication queue one 
disk by another.
 # When scheduling replication works, the redundancyMonitor get the blocks from 
highest priority queue one by one, so for each DN, the blocks will concentrate 
on 1 disk.
 # In some cases, the decommission may happen on only one DN (e.g. first 
liveNode*2 blocks are all from one DN, and the blocks have no overlap with 
other decommissioning nodes).
 ## *My thoughts here*: actually in most case, all DN will decommission in 
progress at the same time ,but the 1st node will be much faster, which is also 
not desired by us
 # [~sodonnell] observed that the time the NN lock is held when processing a 
node for decommission maybe very long.

 

Actually we've encountered all these problem on our online cluster, I'd like to 
share our solutions here:
 # Add a new replication queue implementation, it randomized the order of 
getting blocks from it.
 # Add a configuration, which makes NN to release the lock every 
1(configurable) blocks.

*I think randomize replication queue is better than randomize the 
blockIterator,* because randomize blockIterator won't resolve the issue-3 we 
mentioned above and makes the optimization of issue-4 harder.

*But randomize the replication queue have some other bad impactions that we 
should consider here*: usually the blocks in the head should schedule before 
the blocks in the tail, because the longer time a block waiting for 
replication, the probability of data loss is bigger, but randomize the order 
makes it's impossible.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

2019-09-02 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920648#comment-16920648
 ] 

Chen Zhang commented on HDFS-14654:
---

Thanks [~ayushtkn], I've filed another Jira to track how to fix 
{{TestRouterRpc#testErasureCoding}}

> RBF: TestRouterRpc#testNamenodeMetrics is flaky
> ---
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, HDFS-14654.005.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14811) RBF: TestRouterRpc#testErasureCoding is flaky

2019-09-02 Thread Chen Zhang (Jira)
Chen Zhang created HDFS-14811:
-

 Summary: RBF: TestRouterRpc#testErasureCoding is flaky
 Key: HDFS-14811
 URL: https://issues.apache.org/jira/browse/HDFS-14811
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chen Zhang
Assignee: Chen Zhang


The Failed reason:

{code:java}
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
Node /default-rack/127.0.0.1:53148 [
]
Node /default-rack/127.0.0.1:53161 [
]
Node /default-rack/127.0.0.1:53157 [
  Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 > 
2.6665).
Node /default-rack/127.0.0.1:53143 [
]
Node /default-rack/127.0.0.1:53165 [
]
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was 
chosen. Reason: {NODE_TOO_BUSY=1}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - 
Failed to place enough replicas: expected size is 1 but only 0 storage types 
can be selected (replication=6, selected=[], unavailable=[DISK], 
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
required storage types are unavailable:  unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
port 53140, call Call#1270 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
java.io.IOException: File /testec/testfile2 could only be written to 5 of the 6 
required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 node(s) 
are excluded in this operation.
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
port 53197, call Call#1268 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only be 
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) 
running and 6 node(s) are excluded in this operation.
{code}

More discussion, see: 

[jira] [Comment Edited] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920438#comment-16920438
 ] 

Chen Zhang edited comment on HDFS-14654 at 9/1/19 4:27 PM:
---

BTW, the test {{testErasureCoding}} happened to fail again on my machine, we've 
encountered this failure in the penultimate build. The failure reason is some 
node too busy when allocating block:
{code:java}
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [
Node /default-rack/127.0.0.1:53148 [
]
Node /default-rack/127.0.0.1:53161 [
]
Node /default-rack/127.0.0.1:53157 [
  Datanode 127.0.0.1:53157 is not chosen since the node is too busy (load: 3 > 
2.6665).
Node /default-rack/127.0.0.1:53143 [
]
Node /default-rack/127.0.0.1:53165 [
]
2019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was 
chosen. Reason: {NODE_TOO_BUSY=1}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - 
Failed to place enough replicas: expected size is 1 but only 0 storage types 
can be selected (replication=6, selected=[], unavailable=[DISK], 
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
required storage types are unavailable:  unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] INFO  
ipc.Server (Server.java:logException(2982)) - IPC Server handler 5 on default 
port 53140, call Call#1270 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 127.0.0.1:53202
java.io.IOException: File /testec/testfile2 could only be written to 5 of the 6 
required nodes for RS-6-3-1024k. There are 6 datanode(s) running and 6 node(s) 
are excluded in this operation.
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
2019-09-01 18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  
ipc.Server (Server.java:logException(2975)) - IPC Server handler 6 on default 
port 53197, call Call#1268 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could only be 
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) 
running and 6 node(s) are excluded in this operation.
{code}
When we creating an EC file with the policy 6+3, it requires at least 6 block 
succeed, but the num of 

[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920438#comment-16920438
 ] 

Chen Zhang commented on HDFS-14654:
---

BTW, the test {{testErasureCoding}} happened to fail again on my machine, we've 
encountered this failure in the penultimate build. The failure reason is some 
node too busy when allocating block:
{code:java}
019-09-01 18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [019-09-01 18:19:20,940 
[IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(838)) - [Node 
/default-rack/127.0.0.1:53148 []Node /default-rack/127.0.0.1:53161 []Node 
/default-rack/127.0.0.1:53157 [  Datanode 127.0.0.1:53157 is not chosen since 
the node is too busy (load: 3 > 2.6665).Node 
/default-rack/127.0.0.1:53143 []Node /default-rack/127.0.0.1:53165 []2019-09-01 
18:19:20,940 [IPC Server handler 5 on default port 53140] INFO  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseRandom(846)) - Not enough replicas was 
chosen. Reason: {NODE_TOO_BUSY=1}2019-09-01 18:19:20,941 [IPC Server handler 5 
on default port 53140] WARN  blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 
2019-09-01 18:19:20,941 [IPC Server handler 5 on default port 53140] WARN  
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - 
Failed to place enough replicas: expected size is 1 but only 0 storage types 
can be selected (replication=6, selected=[], unavailable=[DISK], 
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})2019-09-01 18:19:20,941 
[IPC Server handler 5 on default port 53140] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(449)) - Failed to place enough 
replicas, still in need of 1 to reach 6 (unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All 
required storage types are unavailable:  unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}2019-09-01 18:19:20,941 
[IPC Server handler 5 on default port 53140] INFO  ipc.Server 
(Server.java:logException(2982)) - IPC Server handler 5 on default port 53140, 
call Call#1270 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock 
from 127.0.0.1:53202java.io.IOException: File /testec/testfile2 could only be 
written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 datanode(s) 
running and 6 node(s) are excluded in this operation. at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2815)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:893)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:574)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at 
org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)2019-09-01 
18:19:20,942 [IPC Server handler 6 on default port 53197] INFO  ipc.Server 
(Server.java:logException(2975)) - IPC Server handler 6 on default port 53197, 
call Call#1268 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock 
from 192.168.1.112:53201: java.io.IOException: File /testec/testfile2 could 
only be written to 5 of the 6 required nodes for RS-6-3-1024k. There are 6 
datanode(s) running and 6 node(s) are excluded in this operation.
{code}
When we creating an EC file with the policy 6+3, it requires at least 6 block 
succeed, 

[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920426#comment-16920426
 ] 

Chen Zhang commented on HDFS-14654:
---

It's a good catch, [~ayushtkn], added a finally block, upload patch v5.

> RBF: TestRouterRpc#testNamenodeMetrics is flaky
> ---
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, HDFS-14654.005.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14654) RBF: TestRouterRpc#testNamenodeMetrics is flaky

2019-09-01 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14654:
--
Attachment: HDFS-14654.005.patch

> RBF: TestRouterRpc#testNamenodeMetrics is flaky
> ---
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, HDFS-14654.005.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920359#comment-16920359
 ] 

Chen Zhang edited comment on HDFS-13157 at 9/1/19 9:55 AM:
---

We've observed same problem on our production cluster about half year ago, it's 
very easy to repro the issue when we decommissioning the node with warm data, 
we use HDFS Raid to convert these data to Erasure Coding, so every block have 
only 1 replica, and every block only have 1 source node to replicate the data.

We observed that the disks I/O utilization raise to 100% one by one on the 
decommissioning node, it make the decommission progress very slow.


was (Author: zhangchen):
We've observed same problem on our production cluster about half year ago, it's 
very easy to repro the issue when we decommissioning the node with warm data, 
we use HDFS Raid to convert these data to Erasure Coding, so every block have 
only 1 replica, so every block only have 1 source node to replicate the data.

We observed that the disks I/O utilization raise to 100% one by one on the 
decommissioning node, it make the decommission progress very slow.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920359#comment-16920359
 ] 

Chen Zhang edited comment on HDFS-13157 at 9/1/19 9:54 AM:
---

We've observed same problem on our production cluster about half year ago, it's 
very easy to repro the issue when we decommissioning the node with warm data, 
we use HDFS Raid to convert these data to Erasure Coding, so every block have 
only 1 replica, so every block only have 1 source node to replicate the data.

We observed that the disks I/O utilization raise to 100% one by one on the 
decommissioning node, it make the decommission progress very slow.


was (Author: zhangchen):
We've observed same problem on our production cluster about half year ago, it's 
very to repro the issue when we decommissioning the node with warm data, we use 
HDFS Raid to convert these data to Erasure Coding, so every block have only 1 
replica, so every block only have 1 source node to replicate the data.

We observed that the disks I/O utilization raise to 100% one by one on the 
decommissioning node, it make the decommission progress very slow.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920359#comment-16920359
 ] 

Chen Zhang commented on HDFS-13157:
---

We've observed same problem on our production cluster about half year ago, it's 
very to repro the issue when we decommissioning the node with warm data, we use 
HDFS Raid to convert these data to Erasure Coding, so every block have only 1 
replica, so every block only have 1 source node to replicate the data.

We observed that the disks I/O utilization raise to 100% one by one on the 
decommissioning node, it make the decommission progress very slow.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920340#comment-16920340
 ] 

Chen Zhang commented on HDFS-14711:
---

Thanks [~ayushtkn] for the commit

> RBF: RBFMetrics throws NullPointerException if stateStore disabled
> --
>
> Key: HDFS-14711
> URL: https://issues.apache.org/jira/browse/HDFS-14711
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14711.001.patch, HDFS-14711.002.patch, 
> HDFS-14711.003.patch, HDFS-14711.004.patch, HDFS-14711.005.patch
>
>
> In current implementation, if \{{stateStore}} initialize fail, only log an 
> error message. Actually RBFMetrics can't work normally at this time.
> {code:java}
> 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet 
> (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of 
> Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception
> javax.management.RuntimeMBeanException: java.lang.NullPointerException
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
> at 
> org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
> at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316)
> at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> 

[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-31 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920263#comment-16920263
 ] 

Chen Zhang commented on HDFS-14654:
---

Hi [~ayushtkn], I try to turn the Jira status to open and patch available 
again, but Jenkins build not triggered, could you help? Thanks.

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled

2019-08-31 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920262#comment-16920262
 ] 

Chen Zhang commented on HDFS-14711:
---

Thanks [~ayushtkn] for your suggestion, I've go through all router test with 
state-store disabled, \{{TestRouterRpc}} is a good candidate.

Uploaded patch v5.

> RBF: RBFMetrics throws NullPointerException if stateStore disabled
> --
>
> Key: HDFS-14711
> URL: https://issues.apache.org/jira/browse/HDFS-14711
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14711.001.patch, HDFS-14711.002.patch, 
> HDFS-14711.003.patch, HDFS-14711.004.patch, HDFS-14711.005.patch
>
>
> In current implementation, if \{{stateStore}} initialize fail, only log an 
> error message. Actually RBFMetrics can't work normally at this time.
> {code:java}
> 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet 
> (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of 
> Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception
> javax.management.RuntimeMBeanException: java.lang.NullPointerException
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
> at 
> org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
> at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316)
> at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> 

[jira] [Updated] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled

2019-08-31 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14711:
--
Attachment: HDFS-14711.005.patch

> RBF: RBFMetrics throws NullPointerException if stateStore disabled
> --
>
> Key: HDFS-14711
> URL: https://issues.apache.org/jira/browse/HDFS-14711
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14711.001.patch, HDFS-14711.002.patch, 
> HDFS-14711.003.patch, HDFS-14711.004.patch, HDFS-14711.005.patch
>
>
> In current implementation, if \{{stateStore}} initialize fail, only log an 
> error message. Actually RBFMetrics can't work normally at this time.
> {code:java}
> 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet 
> (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of 
> Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception
> javax.management.RuntimeMBeanException: java.lang.NullPointerException
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
> at 
> org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
> at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316)
> at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
> 

[jira] [Commented] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled

2019-08-31 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920176#comment-16920176
 ] 

Chen Zhang commented on HDFS-14711:
---

Thanks [~ayushtkn] for the review.
{quote}Here you can directly return rather than extracting into variable and 
then return, as you did in {{getNamespaceInfo}}.
{quote}
It's a little different with the case in {{getNamespaceInfo}}, the variable 
{{resultList}} will be used later.

Also added UT in the patch v4.

> RBF: RBFMetrics throws NullPointerException if stateStore disabled
> --
>
> Key: HDFS-14711
> URL: https://issues.apache.org/jira/browse/HDFS-14711
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14711.001.patch, HDFS-14711.002.patch, 
> HDFS-14711.003.patch, HDFS-14711.004.patch
>
>
> In current implementation, if \{{stateStore}} initialize fail, only log an 
> error message. Actually RBFMetrics can't work normally at this time.
> {code:java}
> 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet 
> (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of 
> Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception
> javax.management.RuntimeMBeanException: java.lang.NullPointerException
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
> at 
> org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
> at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316)
> at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> 

  1   2   3   4   5   >