Thanks. That bug is for LeaderLatch. Should I open another bug on InterProcessMutex? Or just add commentary to the CURATOR-171 issue?
Can anyone address my workaround option (Idea #3 above) - namely implementing my own custom LockInternalsDriver and setting my own WATCH on the lock file. Any ideas if that will hit problems? On Tue, Jan 20, 2015 at 10:46 AM, John Vines <[email protected]> wrote: > Sounds similar to https://issues.apache.org/jira/browse/CURATOR-171 > > On Tue, Jan 20, 2015 at 10:23 AM, Michael Peterson <[email protected]> > wrote: > >> Hi, >> >> I am fairly new to Curator and ZK, so apologies if this is has been asked >> before. I haven't found anything yet that addresses it. >> >> My ZK use case is very simple - HA failover. Two processes get launched >> - one does the work and the other waits to take over in case the other dies >> or otherwise stops working. >> >> The Curator InterProcessMutex fits the bill. However, without too much >> effort I've found a scenario where both Process A and Process B both think >> they are the owner at the same time and start doing the work, causing data >> corruption. >> >> The scenario is simply to delete the lock file, which I did via the ZK >> CLI (zkCli.sh). The problem is that the InterProcessMutex currently >> holding the lock doesn't seem to notice that the lock file got deleted, but >> the InterProcessMutex in the waiting (failover) process *does* notice and >> creates a new lock and starts doing work. >> >> Does the InterProcessMutex set a watch on the lock file it creates? If >> not, why not? >> >> >> Idea #1: >> >> I tried setting all the Listeners I could figure how to set to detect the >> NodeDeleted event: >> >> - CuratorListener >> - ConnectionStateListener >> - UnhandledErrorListener >> >> but none get signaled when I manually delete the lock file. >> >> >> Idea #2: >> >> Is the solution to set my own watch on the lock file that the IPMutex >> created? If so, I see that one way to get the file name of the lock is to >> call InterProcessMutex#getParticipantNodes(). But the problem is that >> there can be more than one lock file - it seems >> >> [zk: localhost:2181(CONNECTED) 7] ls /XXX/masterlock >> [_c_c1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-0000000003, >> _c_bf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000] >> >> [zk: localhost:2181(CONNECTED) 37] ls /XXX/masterlock >> [_c_63490235-7ab6-461d-bab2-401d4439db4f-lock-0000000018, \ >> _c_1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock-0000000012, \ >> _c_f09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock-0000000017, \ >> _c_2f9ebe06-b91c-4886-b916-34ff1fa83541-lock-0000000016] >> >> And it seems that I can't use the one with the smallest sequential lock >> number, because the smallest one might be hanging around from a crashed >> lockholder and it has expired yet - that is the case in the above example: >> lock-00000012 is just waiting to be expired after a crash. >> >> So I don't know how to tell which lock is "mine" to set a watch on using >> that method. >> >> >> >> Idea #3: >> >> I see that the InterProcessMutex also takes an optional >> `LockInternalsDriver` argument. I looked into that code and there I see >> that it has access to the lock file name. In addition, in the >> `getsTheLock` method it creates a PredicateResults object with a >> `pathToWatch` arg, which sounds promising, but in the default impl with my >> setup that pathToWatch is null. >> >> So I then created my own CustomLockInternalsDriver and put the lock-file >> name in pathToWatch (not sure that would work), but when I set >> `pathToWatch` to the actual lock path, still nothing happens when I delete >> the file. >> >> So then I recorded the path to my lock in the CustomLockInternalsDriver >> so I could get it in my mainline code and set a WATCH manually/myself. >> That ends up working. But that's a lot of work and it's not at all clear >> what the right solution is and whether it is dangerous to fiddle with >> creating my own LockInternalsDriver impl. >> >> What is the right way to solve this issue? >> >> >> --- How to REPRODUCE --- >> >> Here's a link to a gist with my test code: >> https://gist.github.com/quux00/f6be8fe223a7832ef514 >> Also a gist to my CustomLockInternalsDriver: >> https://gist.github.com/quux00/ab37cedc46cb5368c853 >> >> Start up two instances of that code. One will indicate it is "working" >> and the other "waiting". I then use zkCli.sh to delete the file: >> >> $ ./zkCli.sh >> [zk: localhost:2181(CONNECTED) 111] ls /XXX/masterlock >> [_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006] >> [zk: localhost:2181(CONNECTED) 112] delete >> /XXX/masterlock/_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006 >> [zk: localhost:2181(CONNECTED) 113] ls /XXX/masterlock >> [] >> >> The "waiting" process will now create a new lock file and now both >> processes are "working". >> >> Thank you, >> Michael >> >> >
