[jira] [Commented] (GEODE-5385) hang trying to create a bucket
[ https://issues.apache.org/jira/browse/GEODE-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535438#comment-16535438 ] ASF subversion and git services commented on GEODE-5385: Commit dbdbd7a83f70340ab3a8b76823b04780c0b29430 in geode's branch refs/heads/feature/GEODE-QueryProvider from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=dbdbd7a ] GEODE-5385: hang trying to create a bucket We now look for a ForceReattemptException when destroying a partitioned region. This prevents a region ID skew that can occur if another node is still initializing its region and is not yet ready to destroy it. I've reenabled the PRSanityCheckMessage that watches for skews like this and reports them. This used to be enabled by default but somehow was disabled a long time ago. This closes #2109 > hang trying to create a bucket > -- > > Key: GEODE-5385 > URL: https://issues.apache.org/jira/browse/GEODE-5385 > Project: Geode > Issue Type: Bug >Reporter: Bruce Schuchardt >Priority: Major > Labels: pull-request-available > Fix For: 1.7.0 > > Time Spent: 40m > Remaining Estimate: 0h > > It's possible for partitioned region bucket allocation to hang even though > there appears to be plenty of storage available. This can happen if one > server is creating the partitioned region at the same time the region is > being destroyed by another server. > The server creating the partitioned region will send a > ForceReattemptException back to the server destroying the region and that > exception is ignored. The server creating the PR will then be stuck with a > region having a dangling ID that has been removed from the PR metadata > region. If another server then recreates the PR it will assign a new ID to > it and the servers will have skewed IDs. The IDs are sent in partitioned > region messages such as manage-bucket. > The distribution advisors don't recognize that there is a skew and our logs > show nothing about it because a safety mechanism was accidentally turned off > by an engineer in PRSanityCheckMessage. This message performs a check of the > IDs in the servers to make sure they're consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GEODE-5385) hang trying to create a bucket
[ https://issues.apache.org/jira/browse/GEODE-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535167#comment-16535167 ] ASF subversion and git services commented on GEODE-5385: Commit dbdbd7a83f70340ab3a8b76823b04780c0b29430 in geode's branch refs/heads/develop from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=dbdbd7a ] GEODE-5385: hang trying to create a bucket We now look for a ForceReattemptException when destroying a partitioned region. This prevents a region ID skew that can occur if another node is still initializing its region and is not yet ready to destroy it. I've reenabled the PRSanityCheckMessage that watches for skews like this and reports them. This used to be enabled by default but somehow was disabled a long time ago. This closes #2109 > hang trying to create a bucket > -- > > Key: GEODE-5385 > URL: https://issues.apache.org/jira/browse/GEODE-5385 > Project: Geode > Issue Type: Bug >Reporter: Bruce Schuchardt >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > It's possible for partitioned region bucket allocation to hang even though > there appears to be plenty of storage available. This can happen if one > server is creating the partitioned region at the same time the region is > being destroyed by another server. > The server creating the partitioned region will send a > ForceReattemptException back to the server destroying the region and that > exception is ignored. The server creating the PR will then be stuck with a > region having a dangling ID that has been removed from the PR metadata > region. If another server then recreates the PR it will assign a new ID to > it and the servers will have skewed IDs. The IDs are sent in partitioned > region messages such as manage-bucket. > The distribution advisors don't recognize that there is a skew and our logs > show nothing about it because a safety mechanism was accidentally turned off > by an engineer in PRSanityCheckMessage. This message performs a check of the > IDs in the servers to make sure they're consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (GEODE-5385) hang trying to create a bucket
[ https://issues.apache.org/jira/browse/GEODE-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534226#comment-16534226 ] ASF subversion and git services commented on GEODE-5385: Commit 86b7bd950a21065cab15973261231298b52d1f6f in geode's branch refs/heads/feature/GEODE-5385 from [~bschuchardt] [ https://gitbox.apache.org/repos/asf?p=geode.git;h=86b7bd9 ] GEODE-5385: hang trying to create a bucket We now look for a ForceReattemptException when destroying a partitioned region. This prevents a region ID skew that can occur if another node is still initializing its region and is not yet ready to destroy it. I've reenabled the PRSanityCheckMessage that watches for skews like this and reports them. This used to be enabled by default but somehow was disabled a long time ago. > hang trying to create a bucket > -- > > Key: GEODE-5385 > URL: https://issues.apache.org/jira/browse/GEODE-5385 > Project: Geode > Issue Type: Bug >Reporter: Bruce Schuchardt >Priority: Major > > It's possible for partitioned region bucket allocation to hang even though > there appears to be plenty of storage available. This can happen if one > server is creating the partitioned region at the same time the region is > being destroyed by another server. > The server creating the partitioned region will send a > ForceReattemptException back to the server destroying the region and that > exception is ignored. The server creating the PR will then be stuck with a > region having a dangling ID that has been removed from the PR metadata > region. If another server then recreates the PR it will assign a new ID to > it and the servers will have skewed IDs. The IDs are sent in partitioned > region messages such as manage-bucket. > The distribution advisors don't recognize that there is a skew and our logs > show nothing about it because a safety mechanism was accidentally turned off > by an engineer in PRSanityCheckMessage. This message performs a check of the > IDs in the servers to make sure they're consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)