You are right that there is a race between getting the children and checking whether the predecessor is there. If we fail to set a watcher, then we won't try locking again. I think setting id to null will work because it forces another iteration of the do/while loop, which checks whether there is still some predecessor, setting a watcher accordingly. The number of iterations should be finite because we eventually hit the owned znode.
It might better to add a different condition for clarity, like `while (Id == null || watcherNotSet)`. In any case, I'd appreciate if you could chime in and contribute your changes to https://issues.apache.org/jira/browse/ZOOKEEPER-645 <https://issues.apache.org/jira/browse/ZOOKEEPER-645>. -Flavio > On 13 Feb 2018, at 22:01, Kathryn Hogg <[email protected]> wrote: > > Hey Flavio, > > FYI, I'm on 3.4.11 if that makes a difference. > > The problem is > 1. In order to get to the exists() call, we've already determined that > id is not null. > 2. After stat returns null, we log a message but do not reset stat to > null. > 3. The while loop will terminate because id is not null > > I think this can fixed by setting id to null if stat returns null. This is > similar to what is done when lessThanMe is not Empty. > > Here's an outline of the code I'm talking about (my suggested change is after > the LOG.warn()): > > do { > if (id == null) { > .... > } > If (id != null) { > if (names.IsEmpty()) { > ... > Id = null; > } else { > .... > If (!lessThanMe.isEmpty()) > Stat stat = zookeeper.exists(lastChildId, new > LockWatcher()); > if (stat != null) { > return Boolean.FALSE; > } else { > LOG.warn("Could not find the" + > " stats for less than me: " + > lastChildName.getName()); > /* id = null; */ // id is not null here so > the while loop will terminate > } > } else { > if (isOwner()) { > if (callback != null) { > callback.lockAcquired(); > } > return Boolean.TRUE; > } > } > } > } > } > while (id == null); > > > -----Original Message----- > From: Flavio Junqueira [mailto:[email protected]] > Sent: Tuesday, February 13, 2018 2:45 PM > To: [email protected] > Subject: Re: Bug in WriteLock recipe > > {External email message: This email is from an external source. Please > exercise caution prior to opening attachments, clicking on links, or > providing any sensitive information.} > > Hi Kathryn, > > Every time that execute method is invoked, it will get children. From your > description, in the case the predecessor node is deleted and stat is null, > the next call will not contain that predecessor znode. Consequently, it won't > happen indefinitely. Makes sense? > > There is actually an old issue about WriteLock that you may want to have a > look: > > https://issues.apache.org/jira/browse/ZOOKEEPER-645 > <https://issues.apache.org/jira/browse/ZOOKEEPER-645> > > Thanks, > -Flavio > > >> On 13 Feb 2018, at 18:49, Kathryn Hogg <[email protected]> wrote: >> >> I'm actually using the WriteLock from the ZookeeperNetEx C# code but I've >> verified that the same issue exists in the Java recipe. On a busy system, >> I'm fairly frequently seeing WriteLock that is never granted to client and >> gets stuck. >> >> What I believe is happening is the lock sets a watch on the request before >> him via this code: >> >> Stat stat = zookeeper.exists(lastChildId, new >> LockWatcher()); >> if (stat != null) { >> return Boolean.FALSE; >> } else { >> LOG.warn("Could not find the" + >> " stats for >> less than me: " + lastChildName.getName()); >> } >> >> The problem (as I see it and I'm still fairly new to Zookeeper) is that if >> the node represented by lastChildId has been deleted before the call to >> exists is made, stat will return null and the watch will only ever be >> invoked when the znode is created. And of course that will never happen. >> >> The message is appearing in my log and my watcher for the lock is never >> invoked. >> >> [2018-02-13 16:49:17.905 GMT WARNING WriteLock Could not >> find the stats for less than me: >> /token/SegmentProfileQueueToken/x-72057953399865370-0000000724] >> >> I'm not entirely sure of the proper way of fixing this but I think setting >> Id = null; >> When stat is null should work. >> >> Can someone verify if my analysis is correct? >> >> -- >> Kathryn Hogg >> Senior Manager Product Development >> Phone: 763.201.2000 >> Fax: 763.201.5333 >> Open Access Technology International, Inc. >> 3660 Technology Drive NE, Minneapolis, MN 55418 >> >> CONFIDENTIAL INFORMATION: This email and any attachment(s) contain >> confidential and/or proprietary information of Open Access Technology >> International, Inc. Do not copy or distribute without the prior written >> consent of OATI. If you are not named a recipient to the message, please >> notify the sender immediately and do not retain the message in any form, >> printed or electronic. >> >
