Thanks Flavio,
I'm debugging this and found another issue in the code in findPrefixInChildren()
private void findPrefixInChildren(String prefix, ZooKeeper zookeeper,
String dir)
throws KeeperException, InterruptedException {
List<String> names = zookeeper.getChildren(dir, false);
for (String name : names) {
if (name.startsWith(prefix)) {
id = name; /* THIS DOES NOT HAVE THE FULL PATH */
if (LOG.isDebugEnabled()) {
LOG.debug("Found id created last time: " + id);
}
break;
}
}
if (id == null) {
id = zookeeper.create(dir + "/" + prefix, data,
getAcl(), EPHEMERAL_SEQUENTIAL); /* THIS HAS THE FULL
PATH */
if (LOG.isDebugEnabled()) {
LOG.debug("Created id: " + id);
}
}
}
If we find the node in the children, we set id to x-$session-$sequence. If we
create the znode, id is set to $dir/x-$session-$sequence
I believe the first case should be
Id = dir + "/" + name;
-----Original Message-----
From: Flavio Junqueira [mailto:[email protected]]
Sent: Tuesday, February 13, 2018 3:58 PM
To: [email protected]
Subject: Re: Bug in WriteLock recipe
{External email message: This email is from an external source. Please exercise
caution prior to opening attachments, clicking on links, or providing any
sensitive information.}
You are right that there is a race between getting the children and checking
whether the predecessor is there. If we fail to set a watcher, then we won't
try locking again. I think setting id to null will work because it forces
another iteration of the do/while loop, which checks whether there is still
some predecessor, setting a watcher accordingly. The number of iterations
should be finite because we eventually hit the owned znode.
It might better to add a different condition for clarity, like `while (Id ==
null || watcherNotSet)`. In any case, I'd appreciate if you could chime in and
contribute your changes to https://issues.apache.org/jira/browse/ZOOKEEPER-645
<https://issues.apache.org/jira/browse/ZOOKEEPER-645>.
-Flavio
> On 13 Feb 2018, at 22:01, Kathryn Hogg <[email protected]> wrote:
>
> Hey Flavio,
>
> FYI, I'm on 3.4.11 if that makes a difference.
>
> The problem is
> 1. In order to get to the exists() call, we've already determined that
> id is not null.
> 2. After stat returns null, we log a message but do not reset stat to
> null.
> 3. The while loop will terminate because id is not null
>
> I think this can fixed by setting id to null if stat returns null. This is
> similar to what is done when lessThanMe is not Empty.
>
> Here's an outline of the code I'm talking about (my suggested change is after
> the LOG.warn()):
>
> do {
> if (id == null) {
> ....
> }
> If (id != null) {
> if (names.IsEmpty()) {
> ...
> Id = null;
> } else {
> ....
> If (!lessThanMe.isEmpty())
> Stat stat = zookeeper.exists(lastChildId, new
> LockWatcher());
> if (stat != null) {
> return Boolean.FALSE;
> } else {
> LOG.warn("Could not find the" +
> " stats for less than me: " +
> lastChildName.getName());
> /* id = null; */ // id is not null here so
> the while loop will terminate
> }
> } else {
> if (isOwner()) {
> if (callback != null) {
> callback.lockAcquired();
> }
> return Boolean.TRUE;
> }
> }
> }
> }
> }
> while (id == null);
>
>
> -----Original Message-----
> From: Flavio Junqueira [mailto:[email protected]]
> Sent: Tuesday, February 13, 2018 2:45 PM
> To: [email protected]
> Subject: Re: Bug in WriteLock recipe
>
> {External email message: This email is from an external source. Please
> exercise caution prior to opening attachments, clicking on links, or
> providing any sensitive information.}
>
> Hi Kathryn,
>
> Every time that execute method is invoked, it will get children. From your
> description, in the case the predecessor node is deleted and stat is null,
> the next call will not contain that predecessor znode. Consequently, it won't
> happen indefinitely. Makes sense?
>
> There is actually an old issue about WriteLock that you may want to have a
> look:
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-645
> <https://issues.apache.org/jira/browse/ZOOKEEPER-645>
>
> Thanks,
> -Flavio
>
>
>> On 13 Feb 2018, at 18:49, Kathryn Hogg <[email protected]> wrote:
>>
>> I'm actually using the WriteLock from the ZookeeperNetEx C# code but I've
>> verified that the same issue exists in the Java recipe. On a busy system,
>> I'm fairly frequently seeing WriteLock that is never granted to client and
>> gets stuck.
>>
>> What I believe is happening is the lock sets a watch on the request before
>> him via this code:
>>
>> Stat stat = zookeeper.exists(lastChildId, new
>> LockWatcher());
>> if (stat != null) {
>> return Boolean.FALSE;
>> } else {
>> LOG.warn("Could not find the" +
>> " stats for
>> less than me: " + lastChildName.getName());
>> }
>>
>> The problem (as I see it and I'm still fairly new to Zookeeper) is that if
>> the node represented by lastChildId has been deleted before the call to
>> exists is made, stat will return null and the watch will only ever be
>> invoked when the znode is created. And of course that will never happen.
>>
>> The message is appearing in my log and my watcher for the lock is never
>> invoked.
>>
>> [2018-02-13 16:49:17.905 GMT WARNING WriteLock Could not
>> find the stats for less than me:
>> /token/SegmentProfileQueueToken/x-72057953399865370-0000000724]
>>
>> I'm not entirely sure of the proper way of fixing this but I think setting
>> Id = null;
>> When stat is null should work.
>>
>> Can someone verify if my analysis is correct?
>>
>> --
>> Kathryn Hogg
>> Senior Manager Product Development
>> Phone: 763.201.2000
>> Fax: 763.201.5333
>> Open Access Technology International, Inc.
>> 3660 Technology Drive NE, Minneapolis, MN 55418
>>
>> CONFIDENTIAL INFORMATION: This email and any attachment(s) contain
>> confidential and/or proprietary information of Open Access Technology
>> International, Inc. Do not copy or distribute without the prior written
>> consent of OATI. If you are not named a recipient to the message, please
>> notify the sender immediately and do not retain the message in any form,
>> printed or electronic.
>>
>