But manually deleting the lock node is not normal behavior. It should never 
happen in production. Can you explain the scenario in more detail? 

-JZ



On January 20, 2015 at 10:47:20 AM, John Vines ([email protected]) wrote:

Sounds similar to https://issues.apache.org/jira/browse/CURATOR-171

On Tue, Jan 20, 2015 at 10:23 AM, Michael Peterson <[email protected]> wrote:
Hi,

I am fairly new to Curator and ZK, so apologies if this is has been asked 
before.  I haven't found anything yet that addresses it.

My ZK use case is very simple - HA failover.  Two processes get launched - one 
does the work and the other waits to take over in case the other dies or 
otherwise stops working.

The Curator InterProcessMutex fits the bill.  However, without too much effort 
I've found a scenario where both Process A and Process B both think they are 
the owner at the same time and start doing the work, causing data corruption.

The scenario is simply to delete the lock file, which I did via the ZK CLI 
(zkCli.sh).  The problem is that the InterProcessMutex currently holding the 
lock doesn't seem to notice that the lock file got deleted, but the 
InterProcessMutex in the waiting (failover) process *does* notice and creates a 
new lock and starts doing work.

Does the InterProcessMutex set a watch on the lock file it creates?  If not, 
why not?


Idea #1:

I tried setting all the Listeners I could figure how to set to detect the 
NodeDeleted event:

- CuratorListener
- ConnectionStateListener
- UnhandledErrorListener

but none get signaled when I manually delete the lock file.


Idea #2:

Is the solution to set my own watch on the lock file that the IPMutex created?  
If so, I see that one way to get the file name of the lock is to call 
InterProcessMutex#getParticipantNodes().  But the problem is that there can be 
more than one lock file - it seems

    [zk: localhost:2181(CONNECTED) 7] ls /XXX/masterlock
    [_c_c1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-0000000003, 
_c_bf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000]

    [zk: localhost:2181(CONNECTED) 37] ls /XXX/masterlock
    [_c_63490235-7ab6-461d-bab2-401d4439db4f-lock-0000000018, \
     _c_1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock-0000000012, \
     _c_f09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock-0000000017, \
     _c_2f9ebe06-b91c-4886-b916-34ff1fa83541-lock-0000000016]

And it seems that I can't use the one with the smallest sequential lock number, 
because the smallest one might be hanging around from a crashed lockholder and 
it has expired yet - that is the case in the above example: lock-00000012 is 
just waiting to be expired after a crash.

So I don't know how to tell which lock is "mine" to set a watch on using that 
method.



Idea #3:

I see that the InterProcessMutex also takes an optional `LockInternalsDriver` 
argument.  I looked into that code and there I see that it has access to the 
lock file name.  In addition, in the `getsTheLock` method it creates a 
PredicateResults object with a `pathToWatch` arg, which sounds promising, but 
in the default impl with my setup that pathToWatch is null. 

So I then created my own CustomLockInternalsDriver and put the lock-file name 
in pathToWatch (not sure that would work), but when I set `pathToWatch` to the 
actual lock path, still nothing happens when I delete the file.

So then I recorded the path to my lock in the CustomLockInternalsDriver so I 
could get it in my mainline code and set a WATCH manually/myself.  That ends up 
working.  But that's a lot of work and it's not at all clear what the right 
solution is and whether it is dangerous to fiddle with creating my own 
LockInternalsDriver impl.

What is the right way to solve this issue?


--- How to REPRODUCE ---

Here's a link to a gist with my test code:   
https://gist.github.com/quux00/f6be8fe223a7832ef514
Also a gist to my CustomLockInternalsDriver: 
https://gist.github.com/quux00/ab37cedc46cb5368c853

Start up two instances of that code. One will indicate it is "working" and the 
other "waiting". I then use zkCli.sh to delete the file:

    $ ./zkCli.sh
    [zk: localhost:2181(CONNECTED) 111] ls /XXX/masterlock
    [_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006]
    [zk: localhost:2181(CONNECTED) 112] delete 
/XXX/masterlock/_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006
    [zk: localhost:2181(CONNECTED) 113] ls /XXX/masterlock
    []

The "waiting" process will now create a new lock file and now both processes 
are "working".

Thank you,
Michael


Reply via email to