Yeah, this problem is also mentioned in that issue 645.

-Flavio

> On 13 Feb 2018, at 23:03, Kathryn Hogg <kathryn.h...@oati.net> wrote:
> 
> Thanks Flavio,
> 
> I'm debugging this and found another issue in the code in 
> findPrefixInChildren()
> 
>        private void findPrefixInChildren(String prefix, ZooKeeper zookeeper, 
> String dir) 
>            throws KeeperException, InterruptedException {
>            List<String> names = zookeeper.getChildren(dir, false);
>            for (String name : names) {
>                if (name.startsWith(prefix)) {
>                    id = name;  /*   THIS DOES NOT HAVE THE FULL PATH */
>                    if (LOG.isDebugEnabled()) {
>                        LOG.debug("Found id created last time: " + id);
>                    }
>                    break;
>                }
>            }
>            if (id == null) {
>                id = zookeeper.create(dir + "/" + prefix, data, 
>                        getAcl(), EPHEMERAL_SEQUENTIAL);   /* THIS HAS THE 
> FULL PATH */
> 
>                if (LOG.isDebugEnabled()) {
>                    LOG.debug("Created id: " + id);
>                }
>            }
> 
>        }
> 
> If we find the node in the children, we set id to x-$session-$sequence.  If 
> we create the znode, id is set to $dir/x-$session-$sequence
> 
> I believe the first case should be 
>  Id = dir + "/" + name;
> 
> -----Original Message-----
> From: Flavio Junqueira [mailto:f...@apache.org] 
> Sent: Tuesday, February 13, 2018 3:58 PM
> To: user@zookeeper.apache.org
> Subject: Re: Bug in WriteLock recipe
> 
> {External email message: This email is from an external source. Please 
> exercise caution prior to opening attachments, clicking on links, or 
> providing any sensitive information.}
> 
> You are right that there is a race between getting the children and checking 
> whether the predecessor is there. If we fail to set a watcher, then we won't 
> try locking again. I think setting id to null will work because it forces 
> another iteration of the do/while loop, which checks whether there is still 
> some predecessor, setting a watcher accordingly. The number of iterations 
> should be finite because we eventually hit the owned znode.
> 
> It might better to add a different condition for clarity, like `while (Id == 
> null || watcherNotSet)`. In any case, I'd appreciate if you could chime in 
> and contribute your changes to 
> https://issues.apache.org/jira/browse/ZOOKEEPER-645 
> <https://issues.apache.org/jira/browse/ZOOKEEPER-645>.
> 
> -Flavio
> 
>> On 13 Feb 2018, at 22:01, Kathryn Hogg <kathryn.h...@oati.net> wrote:
>> 
>> Hey Flavio,
>> 
>> FYI, I'm on 3.4.11 if that makes a difference.
>> 
>> The problem is
>>     1. In order to get to the exists() call, we've already determined that 
>> id is not null.
>>     2. After stat returns null, we log a message but do not reset stat to 
>> null.
>>     3.  The while loop will terminate because id is not null
>> 
>> I think this can fixed by setting id to null if stat returns null. This is 
>> similar to what is done when lessThanMe is not Empty.
>> 
>> Here's an outline of the code I'm talking about (my suggested change is 
>> after the LOG.warn()):
>> 
>> do {
>>  if (id == null) {
>>      ....
>> }
>> If (id != null) {
>>    if (names.IsEmpty()) {
>>        ...
>>        Id = null;
>>    } else {
>>        ....
>>       If (!lessThanMe.isEmpty())
>>                           Stat stat = zookeeper.exists(lastChildId, new 
>> LockWatcher());
>>                           if (stat != null) {
>>                               return Boolean.FALSE;
>>                           } else {
>>                               LOG.warn("Could not find the" +
>>                                              " stats for less than me: " + 
>> lastChildName.getName());
>>                                /* id = null; */  // id is not null here so 
>> the while loop will terminate 
>>                           }
>>      } else {
>>            if (isOwner()) {
>>                     if (callback != null) {
>>                          callback.lockAcquired();
>>                    }
>>                     return Boolean.TRUE;
>>           }
>>       }
>>   }
>> }
>> }
>> while (id == null);
>> 
>> 
>> -----Original Message-----
>> From: Flavio Junqueira [mailto:f...@apache.org] 
>> Sent: Tuesday, February 13, 2018 2:45 PM
>> To: user@zookeeper.apache.org
>> Subject: Re: Bug in WriteLock recipe
>> 
>> {External email message: This email is from an external source. Please 
>> exercise caution prior to opening attachments, clicking on links, or 
>> providing any sensitive information.}
>> 
>> Hi Kathryn,
>> 
>> Every time that execute method is invoked, it will get children. From your 
>> description, in the case the predecessor node is deleted and stat is null, 
>> the next call will not contain that predecessor znode. Consequently, it 
>> won't happen indefinitely. Makes sense?
>> 
>> There is actually an old issue about WriteLock that you may want to have a 
>> look:
>> 
>> https://issues.apache.org/jira/browse/ZOOKEEPER-645 
>> <https://issues.apache.org/jira/browse/ZOOKEEPER-645>
>> 
>> Thanks,
>> -Flavio
>> 
>> 
>>> On 13 Feb 2018, at 18:49, Kathryn Hogg <kathryn.h...@oati.net> wrote:
>>> 
>>> I'm actually using the WriteLock from the ZookeeperNetEx C# code but I've 
>>> verified that the same issue exists in the Java recipe.  On a busy system, 
>>> I'm fairly frequently seeing WriteLock that is never granted to client and 
>>> gets stuck.
>>> 
>>> What I believe is happening is the lock sets a watch on the request before 
>>> him via this code:
>>> 
>>>                          Stat stat = zookeeper.exists(lastChildId, new 
>>> LockWatcher());
>>>                          if (stat != null) {
>>>                              return Boolean.FALSE;
>>>                          } else {
>>>                              LOG.warn("Could not find the" +
>>>                                                              " stats for 
>>> less than me: " + lastChildName.getName());
>>>                          }
>>> 
>>> The problem (as I see it and I'm still fairly new to Zookeeper) is that if 
>>> the node represented by lastChildId has been deleted before the call to 
>>> exists is made, stat will return null and the watch will only ever be 
>>> invoked when the znode is created.  And of course that will never happen.
>>> 
>>> The message is appearing in my log and my watcher for the lock is never 
>>> invoked.
>>> 
>>> [2018-02-13 16:49:17.905 GMT    WARNING         WriteLock       Could not 
>>> find the stats for less than me: 
>>> /token/SegmentProfileQueueToken/x-72057953399865370-0000000724]
>>> 
>>> I'm not entirely sure of the proper way of fixing this but I think setting
>>>  Id = null;
>>> When stat is null should work.
>>> 
>>> Can someone verify if my analysis is correct?
>>> 
>>> --
>>> Kathryn Hogg
>>> Senior Manager Product Development
>>> Phone: 763.201.2000
>>> Fax: 763.201.5333
>>> Open Access Technology International, Inc.
>>> 3660 Technology Drive NE, Minneapolis, MN 55418
>>> 
>>> CONFIDENTIAL INFORMATION: This email and any attachment(s) contain 
>>> confidential and/or proprietary information of Open Access Technology 
>>> International, Inc.  Do not copy or distribute without the prior written 
>>> consent of OATI.  If you are not named a recipient to the message, please 
>>> notify the sender immediately and do not retain the message in any form, 
>>> printed or electronic.
>>> 
>> 
> 

Reply via email to