Sounds similar to https://issues.apache.org/jira/browse/CURATOR-171
On Tue, Jan 20, 2015 at 10:23 AM, Michael Peterson <[email protected]> wrote: > Hi, > > I am fairly new to Curator and ZK, so apologies if this is has been asked > before. I haven't found anything yet that addresses it. > > My ZK use case is very simple - HA failover. Two processes get launched - > one does the work and the other waits to take over in case the other dies > or otherwise stops working. > > The Curator InterProcessMutex fits the bill. However, without too much > effort I've found a scenario where both Process A and Process B both think > they are the owner at the same time and start doing the work, causing data > corruption. > > The scenario is simply to delete the lock file, which I did via the ZK CLI > (zkCli.sh). The problem is that the InterProcessMutex currently holding > the lock doesn't seem to notice that the lock file got deleted, but the > InterProcessMutex in the waiting (failover) process *does* notice and > creates a new lock and starts doing work. > > Does the InterProcessMutex set a watch on the lock file it creates? If > not, why not? > > > Idea #1: > > I tried setting all the Listeners I could figure how to set to detect the > NodeDeleted event: > > - CuratorListener > - ConnectionStateListener > - UnhandledErrorListener > > but none get signaled when I manually delete the lock file. > > > Idea #2: > > Is the solution to set my own watch on the lock file that the IPMutex > created? If so, I see that one way to get the file name of the lock is to > call InterProcessMutex#getParticipantNodes(). But the problem is that > there can be more than one lock file - it seems > > [zk: localhost:2181(CONNECTED) 7] ls /XXX/masterlock > [_c_c1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-0000000003, > _c_bf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000] > > [zk: localhost:2181(CONNECTED) 37] ls /XXX/masterlock > [_c_63490235-7ab6-461d-bab2-401d4439db4f-lock-0000000018, \ > _c_1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock-0000000012, \ > _c_f09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock-0000000017, \ > _c_2f9ebe06-b91c-4886-b916-34ff1fa83541-lock-0000000016] > > And it seems that I can't use the one with the smallest sequential lock > number, because the smallest one might be hanging around from a crashed > lockholder and it has expired yet - that is the case in the above example: > lock-00000012 is just waiting to be expired after a crash. > > So I don't know how to tell which lock is "mine" to set a watch on using > that method. > > > > Idea #3: > > I see that the InterProcessMutex also takes an optional > `LockInternalsDriver` argument. I looked into that code and there I see > that it has access to the lock file name. In addition, in the > `getsTheLock` method it creates a PredicateResults object with a > `pathToWatch` arg, which sounds promising, but in the default impl with my > setup that pathToWatch is null. > > So I then created my own CustomLockInternalsDriver and put the lock-file > name in pathToWatch (not sure that would work), but when I set > `pathToWatch` to the actual lock path, still nothing happens when I delete > the file. > > So then I recorded the path to my lock in the CustomLockInternalsDriver so > I could get it in my mainline code and set a WATCH manually/myself. That > ends up working. But that's a lot of work and it's not at all clear what > the right solution is and whether it is dangerous to fiddle with creating > my own LockInternalsDriver impl. > > What is the right way to solve this issue? > > > --- How to REPRODUCE --- > > Here's a link to a gist with my test code: > https://gist.github.com/quux00/f6be8fe223a7832ef514 > Also a gist to my CustomLockInternalsDriver: > https://gist.github.com/quux00/ab37cedc46cb5368c853 > > Start up two instances of that code. One will indicate it is "working" and > the other "waiting". I then use zkCli.sh to delete the file: > > $ ./zkCli.sh > [zk: localhost:2181(CONNECTED) 111] ls /XXX/masterlock > [_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006] > [zk: localhost:2181(CONNECTED) 112] delete > /XXX/masterlock/_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006 > [zk: localhost:2181(CONNECTED) 113] ls /XXX/masterlock > [] > > The "waiting" process will now create a new lock file and now both > processes are "working". > > Thank you, > Michael > >
