[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.5.0) > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 3.0.0, 2.2.0, 1.4.8, 2.1.1 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch, > HBASE-18549-.master.004.patch, HBASE-18549.branch-1.001.patch, > HBASE-18549.branch-1.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: (was: 1.3.3) 2.1.1 2.2.0 3.0.0 Status: Resolved (was: Patch Available) Committed. Thanks for taking this one on [~xucang] > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 3.0.0, 1.5.0, 2.2.0, 1.4.8, 2.1.1 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch, > HBASE-18549-.master.004.patch, HBASE-18549.branch-1.001.patch, > HBASE-18549.branch-1.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549.branch-1.001.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch, > HBASE-18549-.master.004.patch, HBASE-18549.branch-1.001.patch, > HBASE-18549.branch-1.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549.branch-1.001.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch, > HBASE-18549-.master.004.patch, HBASE-18549.branch-1.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549-.master.004.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch, > HBASE-18549-.master.004.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549-.master.003.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch, HBASE-18549-.master.003.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549-.master.002.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch, > HBASE-18549-.master.002.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: (was: HBASE-18549-.master.001.patch) > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549-.master.001.patch Status: Patch Available (was: Open) > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549-.master.001.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: (was: HBASE-18549-.master.wip.patch) > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.001.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xu Cang updated HBASE-18549: Attachment: HBASE-18549-.master.wip.patch > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Assignee: Xu Cang >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > Attachments: HBASE-18549-.master.wip.patch > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.4.7) 1.4.8 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.8 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.4.6) 1.4.7 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.7 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.4.4) 1.4.5 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.5 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.4.3) 1.4.4 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.4 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francis Liu updated HBASE-18549: Fix Version/s: (was: 1.3.2) 1.3.3 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.5.0, 1.3.3, 1.4.3 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.4.2) 1.4.3 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.3.2, 1.5.0, 1.4.3 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: (was: 1.4.1) 1.4.2 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.3.2, 1.5.0, 1.4.2 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-18549) Unclaimed replication queues can go undetected
[ https://issues.apache.org/jira/browse/HBASE-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Purtell updated HBASE-18549: --- Fix Version/s: 1.5.0 1.4.1 > Unclaimed replication queues can go undetected > -- > > Key: HBASE-18549 > URL: https://issues.apache.org/jira/browse/HBASE-18549 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Ashu Pachauri >Priority: Critical > Fix For: 1.3.2, 1.4.1, 1.5.0 > > > We have come across this situation multiple times where a zookeeper issues > can cause NodeFailoverWorker to fail picking up replication queue for a dead > region server silently. One example is when the znode size for a particular > queue exceed jute.maxBuffer value. > There can be other situations that may lead to this and just go undetected. > We need to have a metric for number of unclaimed replication queues. This > will help in mitigating the problem through alerting on the metric and > identifying underlying issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)