[jira] [Commented] (GEODE-5385) hang trying to create a bucket

2018-07-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/GEODE-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535438#comment-16535438
 ] 

ASF subversion and git services commented on GEODE-5385:


Commit dbdbd7a83f70340ab3a8b76823b04780c0b29430 in geode's branch 
refs/heads/feature/GEODE-QueryProvider from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=dbdbd7a ]

GEODE-5385: hang trying to create a bucket

We now look for a ForceReattemptException when destroying a partitioned
region.  This prevents a region ID skew that can occur if another node
is still initializing its region and is not yet ready to destroy it.

I've reenabled the PRSanityCheckMessage that watches for skews like this
and reports them.  This used to be enabled by default but somehow was
disabled a long time ago.

This closes #2109


> hang trying to create a bucket
> --
>
> Key: GEODE-5385
> URL: https://issues.apache.org/jira/browse/GEODE-5385
> Project: Geode
>  Issue Type: Bug
>Reporter: Bruce Schuchardt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's possible for partitioned region bucket allocation to hang even though 
> there appears to be plenty of storage available.  This can happen if one 
> server is creating the partitioned region at the same time the region is 
> being destroyed by another server.
> The server creating the partitioned region will send a 
> ForceReattemptException back to the server destroying the region and that 
> exception is ignored.  The server creating the PR will then be stuck with a 
> region having a dangling ID that has been removed from the PR metadata 
> region.  If another server then recreates the PR it will assign a new ID to 
> it and the servers will have skewed IDs.  The IDs are sent in partitioned 
> region messages such as manage-bucket.  
> The distribution advisors don't recognize that there is a skew and our logs 
> show nothing about it because a safety mechanism was accidentally turned off 
> by an engineer in PRSanityCheckMessage.  This message performs a check of the 
> IDs in the servers to make sure they're consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (GEODE-5385) hang trying to create a bucket

2018-07-06 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/GEODE-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535167#comment-16535167
 ] 

ASF subversion and git services commented on GEODE-5385:


Commit dbdbd7a83f70340ab3a8b76823b04780c0b29430 in geode's branch 
refs/heads/develop from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=dbdbd7a ]

GEODE-5385: hang trying to create a bucket

We now look for a ForceReattemptException when destroying a partitioned
region.  This prevents a region ID skew that can occur if another node
is still initializing its region and is not yet ready to destroy it.

I've reenabled the PRSanityCheckMessage that watches for skews like this
and reports them.  This used to be enabled by default but somehow was
disabled a long time ago.

This closes #2109


> hang trying to create a bucket
> --
>
> Key: GEODE-5385
> URL: https://issues.apache.org/jira/browse/GEODE-5385
> Project: Geode
>  Issue Type: Bug
>Reporter: Bruce Schuchardt
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's possible for partitioned region bucket allocation to hang even though 
> there appears to be plenty of storage available.  This can happen if one 
> server is creating the partitioned region at the same time the region is 
> being destroyed by another server.
> The server creating the partitioned region will send a 
> ForceReattemptException back to the server destroying the region and that 
> exception is ignored.  The server creating the PR will then be stuck with a 
> region having a dangling ID that has been removed from the PR metadata 
> region.  If another server then recreates the PR it will assign a new ID to 
> it and the servers will have skewed IDs.  The IDs are sent in partitioned 
> region messages such as manage-bucket.  
> The distribution advisors don't recognize that there is a skew and our logs 
> show nothing about it because a safety mechanism was accidentally turned off 
> by an engineer in PRSanityCheckMessage.  This message performs a check of the 
> IDs in the servers to make sure they're consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (GEODE-5385) hang trying to create a bucket

2018-07-05 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/GEODE-5385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534226#comment-16534226
 ] 

ASF subversion and git services commented on GEODE-5385:


Commit 86b7bd950a21065cab15973261231298b52d1f6f in geode's branch 
refs/heads/feature/GEODE-5385 from [~bschuchardt]
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=86b7bd9 ]

GEODE-5385: hang trying to create a bucket

We now look for a ForceReattemptException when destroying a partitioned
region.  This prevents a region ID skew that can occur if another node
is still initializing its region and is not yet ready to destroy it.

I've reenabled the PRSanityCheckMessage that watches for skews like this
and reports them.  This used to be enabled by default but somehow was
disabled a long time ago.


> hang trying to create a bucket
> --
>
> Key: GEODE-5385
> URL: https://issues.apache.org/jira/browse/GEODE-5385
> Project: Geode
>  Issue Type: Bug
>Reporter: Bruce Schuchardt
>Priority: Major
>
> It's possible for partitioned region bucket allocation to hang even though 
> there appears to be plenty of storage available.  This can happen if one 
> server is creating the partitioned region at the same time the region is 
> being destroyed by another server.
> The server creating the partitioned region will send a 
> ForceReattemptException back to the server destroying the region and that 
> exception is ignored.  The server creating the PR will then be stuck with a 
> region having a dangling ID that has been removed from the PR metadata 
> region.  If another server then recreates the PR it will assign a new ID to 
> it and the servers will have skewed IDs.  The IDs are sent in partitioned 
> region messages such as manage-bucket.  
> The distribution advisors don't recognize that there is a skew and our logs 
> show nothing about it because a safety mechanism was accidentally turned off 
> by an engineer in PRSanityCheckMessage.  This message performs a check of the 
> IDs in the servers to make sure they're consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)