[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2016-02-09 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

Release Note: There is now support for offloading HA health check RPC 
activity to a separate RPC server endpoint running within the NameNode process. 
 This may improve reliability of HA health checks and prevent spurious 
failovers in highly overloaded conditions.  For more details, please refer to 
the hdfs-default.xml documentation for properties 
dfs.namenode.lifeline.rpc-address, dfs.namenode.lifeline.rpc-bind-host and 
dfs.namenode.lifeline.handler.count.

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch, 
> HDFS-9311.003.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2015-10-28 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

Issue Type: Improvement  (was: Bug)

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch, 
> HDFS-9311.003.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2015-10-28 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Jitendra, thank you for the code review.  I cannot repro the test failures from 
the last Jenkins run.  The whitespace check flagged on lines that are unchanged 
in my patch, but I took its advice anyway and used {{git apply 
--whitespace=fix}}.

I have committed this to trunk and branch-2.

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Fix For: 2.8.0
>
> Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch, 
> HDFS-9311.003.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2015-10-27 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

Attachment: HDFS-9311.002.patch

[~jnp], thank you for the review.  I agree.  The handler ratio was a holdover 
from splitting away from the HDFS-9239 patch.  For the scope of this issue, we 
don't need that many threads.

I'm uploading patch v002 with these changes:
# Adjusted the threads as per Jitendra's feedback.
# Corrected some mocking that was causing failures in 
{{TestZKFailoverController}} and {{TestZKFailoverControllerStress}} in the last 
Jenkins run.
# Corrected Checkstyle and whitespace warning.

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2015-10-27 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

Attachment: HDFS-9311.003.patch

I'm attaching patch v003.  The only difference compared to v002 is the log 
message in {{NameNode#setRpcLifelineServerAddress}}.  The message text had been 
talking about DataNodes.  It was another holdover from splitting this out from 
the HDFS-9239 work.  I generalized the text to "Setting lifeline RPC address".

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch, 
> HDFS-9311.003.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2015-10-26 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

Status: Patch Available  (was: Open)

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-9311.001.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.

2015-10-26 Thread Chris Nauroth (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-9311:

Attachment: HDFS-9311.001.patch

This is similar in spirit to what I proposed for DataNode liveness reporting in 
HDFS-9239.  I decided to tackle this one first, because the implementation is 
simpler and can lay the groundwork for HDFS-9239.  I have attached patch v001.

> Support optional offload of NameNode HA service health checks to a separate 
> RPC server.
> ---
>
> Key: HDFS-9311
> URL: https://issues.apache.org/jira/browse/HDFS-9311
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, namenode
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
> Attachments: HDFS-9311.001.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion 
> of the RPC handler pools (both client-facing and service-facing).  
> Eventually, this blocks the health check RPC issued from ZKFC, which triggers 
> a failover.  Depending on fencing configuration, the former active NameNode 
> may be killed.  In an overloaded situation, the new active NameNode is likely 
> to suffer the same fate, because client load patterns don't change after the 
> failover.  This can degenerate into flapping between the 2 NameNodes without 
> real recovery.  If a NameNode had been killed by fencing, then it would have 
> to transition through safe mode, further delaying time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for 
> isolating the HA health checks.  These health checks are lightweight 
> operations that do not suffer from contention issues on the namesystem lock 
> or other shared resources.  Isolating the RPC handlers is sufficient to avoid 
> this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)