Tim,

I thought this was an interesting read:

http://www.time-travellers.org/shane/papers/NFS_considered_harmful.html

Jim

On Sun, Jun 21, 2015 at 09:45 Tim Bain <tb...@alumni.duke.edu> wrote:

> Thanks for the feedback and questions.
>
> I hadn't considered any of the buffering/flushing/synchronization aspects
> of the underlying NFS configuration, but you're absolutely right that the
> guidelines for how to configure this solution need to acknowledge the
> interplay between the two sets of settings and provide guidelines for how
> to select settings for each one in order to achieve the goals.  I have no
> experience with configuring NFS, so although I understand the theory that
> writes can be deferred, it would be great to have people who understand
> those aspects be involved when that documentation gets written.
>
> Re: the type of lock to be used in the solution I'm using, I have no idea
> (due to lack of experience with NFS other than as a user) and would
> appreciate suggestions from those who are more knowledgeable about NFS.  I
> had assumed we'd use the same type of lock as is currently being used (with
> the current behavior in the face of network failures), but if another type
> would be more appropriate, that would be great to know about.
>
> Re: the master re-reading the data file, I wasn't planning on it but it
> could be done.  What I proposed would have a separate thread checking the
> content of the file on the same periodicity (but not necessarily
> synchronized so not necessarily occurring at the same time) as the writing
> thread.  If locking is working correctly and network access is reasonably
> stable, if another process has written to the lock file (such that the
> write thread would see a difference just before it wrote), then the write
> would fail if the other process still held the lock, and it would succeed
> (as you'd want it to) if the other process had lost the lock due to losing
> network access.  So under normal conditions, there wouldn't be a need to
> check content before writing.  The only scenario I can think of where it
> might gain you something is if network access is winking in and out rapidly
> for the different processes, so they get writes in and then immediately
> lose the lock and someone else gets a write in and immediately loses the
> lock.  If that's happening, and if the reading thread is timed so that it
> reads the file after the write thread writes to the file and before the
> other processes write to the file during this cycle, then multiple
> processes could think each think they've got the lock long enough to become
> master.  That seems like a very unlikely scenario, but easy enough to guard
> against by doing the read-before-write that you asked about, so I think
> it's worth doing.
>
> Tim
>
> On Fri, Jun 19, 2015 at 10:19 AM, James A. Robinson <j...@highwire.org>
> wrote:
>
> > On Mon, Jun 15, 2015 at 7:08 AM Tim Bain <tb...@alumni.duke.edu> wrote:
> > >
> > > It seems pretty clear that the assumption that acquiring a single file
> > lock
> > > without doing any further checks will provide thread-safety in all
> cases
> > is
> > > not an accurate one.
> > >
> > > As I see it, here are the failures of the existing approach:
> > >
> > >    - Failure #1 is that when an NFS failure occurs, the master broker
> > never
> > >    recognizes that an NFS failure has occurred and never recognizes
> that
> > the
> > >    slave broker has replaced it as the master.  The broker needs to
> > identify
> > >    those things even when it has no other reason to write to NFS.
> > >
> > >    - Failure #2 is that the slave broker believes that it can
> immediately
> > >    become the master.  This wouldn't be a problem if the master broker
> > >    instantaneously recognized that it has been supplanted and
> immediately
> > >    ceded control, but assuming anything happens instantaneously
> > (especially
> > >    across multiple machines) is pretty unrealistic.  This means there
> > will be
> > >    a period of unresponsiveness when a failover occurs.
> > >
> > >    - Failure #3 is that once the master recognizes that it no longer is
> > the
> > >    master, it needs to abort all pending writes (in a way that
> guarantees
> > that
> > >    the data files will not be corrupted if NFS returns when some have
> > been
> > >    cancelled and others have not).
> >
> > I don't know if this has been called out already, but it will
> > important for users to coordinate how their NFS is configured.  For
> > example, activating asynchronous writes on the NFS side would
> > probably make a mess of any assumptions we can make from the client
> > side.  I also wonder how buffering might affect how a heartbeat
> > would have to work, whether or not mtime won't get propogated until
> > enough data has been written to cause a flush to disk.
> >
> > > I've got a proposed solution that I think will address all of these
> > > failures, but hopefully others will see ways to improve it. Here's what
> > > I've got in mind:
> > >
> > >    1. It is no longer sufficient to hold an NFS lock on the DB lock
> file.
> > >    In order to maintain master status, you must successfully write to
> the
> > DB
> > >    lock file within some time period. If you fail to do so within that
> > time
> > >    period, you must close all use of the DB files and relinquish master
> > status
> > >    to another broker.
> > >
> > >    2. When the master shuts down during normal NFS circumstances, it
> will
> > >    delete the DB lock file.  If at any point a slave broker sees that
> > there is
> > >    no DB lock file or that the DB lock file is so stale that the master
> > must
> > >    have shut down (more on that later), it may immediately attempt to
> > create
> > >    one and begin writing to it.  If that write succeeds, it is the
> > master.
> >
> > Random curiousity, is this a network lock manager (NLM) based lock?
> >
> > >    3. All brokers should determine whether the current master is still
> > >    alive by checking the current content of the DB lock file against
> the
> > >    content read the last time you checked, rather than simply locking
> the
> > file
> > >    and assuming that tells you who's got ownership. This means that the
> > >    current master needs to update some content in the DB lock file to
> > make it
> > >    unique each on each write; I propose that the content of the file be
> > the
> > >    broker's host, the broker's PID, the current local time on the
> broker
> > at
> > >    the time it did its write, and a UUID that will guarantee uniqueness
> > of the
> > >    content from write to write even in the face of time issues.  Note
> > that
> > >    only the UUID is actually required for this algorithm to work, but I
> > think
> > >    that having the other information will make it easier to
> troubleshoot.
> >
> > That sounds good.
> >
> > >    4. Because time can drift between machines, it is not sufficient to
> > >    compare the write date on the DB lock file with your host's current
> > time
> > >    when determining that a file is stale; you must successfully read
> the
> > file
> > >    repeatedly over a time period and receive the same value each time
> in
> > order
> > >    to decide that the DB lock file is stale.
> > >
> > >    5. The master should use the same approach as the slaves to
> determine
> > if
> > >    it's still in control, by checking for changes to the content of the
> > DB
> > >    lock file. This means the master needs to positively confirm that
> each
> > >    periodic write to the DB lock file succeeded by reading it back (in
> a
> > >    separate thread, using a timeout on the read operation to identify
> > >    situations where NFS doesn't respond), rather than simply assuming
> > that its
> > >    call to write() worked successfully.
> > >
> > >    6. When a slave determines that the master has failed to write to
> the
> > DB
> > >    lock file for longer than the timeout, it attempts to acquire the
> > write
> > >    lock on the DB lock file and write to it to become the master.  If
> it
> > >    succeeds (because the master has lost NFS connectivity but the slave
> > has
> > >    not), it is the master.  If it fails because another slave acquired
> > the
> > >    lock first or because the slave has lost NFS connectivity, it goes
> > back to
> > >    monitoring for the master to fail to write to the DB lock file.
> >
> >
> > Would the master re-read the data file again just before the next
> > write as a guard against some other process somehow snagging the
> > lock out from under it?
> >
> > Jim
> >
>

Reply via email to