Tim, I thought this was an interesting read:
http://www.time-travellers.org/shane/papers/NFS_considered_harmful.html Jim On Sun, Jun 21, 2015 at 09:45 Tim Bain <tb...@alumni.duke.edu> wrote: > Thanks for the feedback and questions. > > I hadn't considered any of the buffering/flushing/synchronization aspects > of the underlying NFS configuration, but you're absolutely right that the > guidelines for how to configure this solution need to acknowledge the > interplay between the two sets of settings and provide guidelines for how > to select settings for each one in order to achieve the goals. I have no > experience with configuring NFS, so although I understand the theory that > writes can be deferred, it would be great to have people who understand > those aspects be involved when that documentation gets written. > > Re: the type of lock to be used in the solution I'm using, I have no idea > (due to lack of experience with NFS other than as a user) and would > appreciate suggestions from those who are more knowledgeable about NFS. I > had assumed we'd use the same type of lock as is currently being used (with > the current behavior in the face of network failures), but if another type > would be more appropriate, that would be great to know about. > > Re: the master re-reading the data file, I wasn't planning on it but it > could be done. What I proposed would have a separate thread checking the > content of the file on the same periodicity (but not necessarily > synchronized so not necessarily occurring at the same time) as the writing > thread. If locking is working correctly and network access is reasonably > stable, if another process has written to the lock file (such that the > write thread would see a difference just before it wrote), then the write > would fail if the other process still held the lock, and it would succeed > (as you'd want it to) if the other process had lost the lock due to losing > network access. So under normal conditions, there wouldn't be a need to > check content before writing. The only scenario I can think of where it > might gain you something is if network access is winking in and out rapidly > for the different processes, so they get writes in and then immediately > lose the lock and someone else gets a write in and immediately loses the > lock. If that's happening, and if the reading thread is timed so that it > reads the file after the write thread writes to the file and before the > other processes write to the file during this cycle, then multiple > processes could think each think they've got the lock long enough to become > master. That seems like a very unlikely scenario, but easy enough to guard > against by doing the read-before-write that you asked about, so I think > it's worth doing. > > Tim > > On Fri, Jun 19, 2015 at 10:19 AM, James A. Robinson <j...@highwire.org> > wrote: > > > On Mon, Jun 15, 2015 at 7:08 AM Tim Bain <tb...@alumni.duke.edu> wrote: > > > > > > It seems pretty clear that the assumption that acquiring a single file > > lock > > > without doing any further checks will provide thread-safety in all > cases > > is > > > not an accurate one. > > > > > > As I see it, here are the failures of the existing approach: > > > > > > - Failure #1 is that when an NFS failure occurs, the master broker > > never > > > recognizes that an NFS failure has occurred and never recognizes > that > > the > > > slave broker has replaced it as the master. The broker needs to > > identify > > > those things even when it has no other reason to write to NFS. > > > > > > - Failure #2 is that the slave broker believes that it can > immediately > > > become the master. This wouldn't be a problem if the master broker > > > instantaneously recognized that it has been supplanted and > immediately > > > ceded control, but assuming anything happens instantaneously > > (especially > > > across multiple machines) is pretty unrealistic. This means there > > will be > > > a period of unresponsiveness when a failover occurs. > > > > > > - Failure #3 is that once the master recognizes that it no longer is > > the > > > master, it needs to abort all pending writes (in a way that > guarantees > > that > > > the data files will not be corrupted if NFS returns when some have > > been > > > cancelled and others have not). > > > > I don't know if this has been called out already, but it will > > important for users to coordinate how their NFS is configured. For > > example, activating asynchronous writes on the NFS side would > > probably make a mess of any assumptions we can make from the client > > side. I also wonder how buffering might affect how a heartbeat > > would have to work, whether or not mtime won't get propogated until > > enough data has been written to cause a flush to disk. > > > > > I've got a proposed solution that I think will address all of these > > > failures, but hopefully others will see ways to improve it. Here's what > > > I've got in mind: > > > > > > 1. It is no longer sufficient to hold an NFS lock on the DB lock > file. > > > In order to maintain master status, you must successfully write to > the > > DB > > > lock file within some time period. If you fail to do so within that > > time > > > period, you must close all use of the DB files and relinquish master > > status > > > to another broker. > > > > > > 2. When the master shuts down during normal NFS circumstances, it > will > > > delete the DB lock file. If at any point a slave broker sees that > > there is > > > no DB lock file or that the DB lock file is so stale that the master > > must > > > have shut down (more on that later), it may immediately attempt to > > create > > > one and begin writing to it. If that write succeeds, it is the > > master. > > > > Random curiousity, is this a network lock manager (NLM) based lock? > > > > > 3. All brokers should determine whether the current master is still > > > alive by checking the current content of the DB lock file against > the > > > content read the last time you checked, rather than simply locking > the > > file > > > and assuming that tells you who's got ownership. This means that the > > > current master needs to update some content in the DB lock file to > > make it > > > unique each on each write; I propose that the content of the file be > > the > > > broker's host, the broker's PID, the current local time on the > broker > > at > > > the time it did its write, and a UUID that will guarantee uniqueness > > of the > > > content from write to write even in the face of time issues. Note > > that > > > only the UUID is actually required for this algorithm to work, but I > > think > > > that having the other information will make it easier to > troubleshoot. > > > > That sounds good. > > > > > 4. Because time can drift between machines, it is not sufficient to > > > compare the write date on the DB lock file with your host's current > > time > > > when determining that a file is stale; you must successfully read > the > > file > > > repeatedly over a time period and receive the same value each time > in > > order > > > to decide that the DB lock file is stale. > > > > > > 5. The master should use the same approach as the slaves to > determine > > if > > > it's still in control, by checking for changes to the content of the > > DB > > > lock file. This means the master needs to positively confirm that > each > > > periodic write to the DB lock file succeeded by reading it back (in > a > > > separate thread, using a timeout on the read operation to identify > > > situations where NFS doesn't respond), rather than simply assuming > > that its > > > call to write() worked successfully. > > > > > > 6. When a slave determines that the master has failed to write to > the > > DB > > > lock file for longer than the timeout, it attempts to acquire the > > write > > > lock on the DB lock file and write to it to become the master. If > it > > > succeeds (because the master has lost NFS connectivity but the slave > > has > > > not), it is the master. If it fails because another slave acquired > > the > > > lock first or because the slave has lost NFS connectivity, it goes > > back to > > > monitoring for the master to fail to write to the DB lock file. > > > > > > Would the master re-read the data file again just before the next > > write as a guard against some other process somehow snagging the > > lock out from under it? > > > > Jim > > >