Re: NFS file unlocking problem

2003-03-20 Thread Steve Sizemore
On Wed, Mar 19, 2003 at 05:15:02PM -0800, Andrew P. Lentvorski, Jr. wrote:
 Steve,
 
 I actually managed to pull down the dump.  It doesn't have any lock 
 requests in it.  It looks like it is hanging in the rpcinfo call.
 
 If you really want to debug this, it's going to take a chunk of work.
 
 1) set up two brand new machines
 FreeBSD-current for the server
 Solaris whatever for the client
 

Well, I think I found the problem. I had just installed the new
FreeBSD machine, not looking forward to finding another Sun to
install, when I had the idea to try different nfs protocols. By
default, it looks like the NFS mount is version 3 tcp. I specified
udp, and both of the test programs (mine and Terry's) work consistently.
I've reenabled locking on the Xinet software, and we'll see (tomorrow)
whether or not that also works.

If I'm right, that means that there is a problem with nfs over tcp
with a Solaris client and FreeBSD-5 server. All other combinations
of client/server pairs worked with the default (which I assume is
version 3 tcp).

BTW, Terry's testlock program, when run with the problem
configuration, would always return the There is nothing that would
block your lock message, but it could take anywhere from  1 second 
to 45 minutes. I'm not so confident that my perl program would always
succeed, because I was never willing to wait longer than overnight
before concluding that it was hung.

I'll report back on the status of xinet tomorrow, but thanks to all of
you for your help and suggestions. If there's any more information
I can provide, or testing that I can do, to help fix the
NFS/tcp/Solaris problem, let me know.

Thanks.
Steve
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Andrew P. Lentvorski, Jr.
On Mon, 17 Mar 2003, Terry Lambert wrote:

 Actually, given this, I don't understand how FreeBSD server side
 proxy locking can actually work at all; it would incorrectly
 coelesce locks with local locks when the l_pid matched, which
 would be *all* locks in the lockd, and then incorrectly release
 them when a local process exited, or any process on any remote
 system unlocked an overlapping range (possibly in error).

Please don't speculate without having reviewed the code.  It works because
I rewrote rpc.lockd so that it does the required housekeeping itself.  
The FreeBSD lockd is the only open-source locking daemon that actually
passes the Connectathon interoperability tests.

When rpc.lockd gets a lock request, it immediately attempts to lock the
*entire* underlying file on the NFS server (or fails).  It then keeps
internal track of the partial lock requests so that they *don't* coalesce
(including l_pid from the target system).  rpc.lockd only releases the
underlying file when all NFS locks go away.

Yes, it does create extra contention if someone tries to do partial file
locking via both NFS and the local filesystem simultaneously.  I deemed
this an acceptable compromise in order to cover up for the problems in the 
FreeBSD locking space since most people who truly need locking generally 
have dedicated file servers.

-a






To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Steve Sizemore
On Mon, Mar 17, 2003 at 11:36:58PM -0800, Terry Lambert wrote:
 Steve Sizemore wrote:
  useful. As it is, it's still interesting. I have no way of judging the
  quality of the code in question, other than the empirical result that
  it works in most cases.
 
 Well, then you are stuck with the code you have that someone else
 wrote.  Hopefully that's not your problem, or your are in trouble.
 8-).

Actually, maybe not, since it's a commercial program. If I could
demonstrate that it's their problem, I could put pressure on them to
fix it. However, at this point, I don't think that's the case.

 OK, then it isn't an intra-program deadlock, which is something.
 
 It could still be inter-program, but if it is, it's not going to
 be easy to find; you will need to find someone who *is* a programmer.
 FWIW, this happen when:
 
   Program 1   Program 2
   LOCK A
   LOCK B
   LOCK B (Waiting for Program 2)
   LOCK A (Waiting for Program 1 waiting for me)

I don't see now it could be inter-program, since I've gone to great
lengths to simplify it to a single program failing on a brand new file.

   On the other hand, this is clearly a deadlock that requires an
   existing, conflicting lock -- IFF the you are correct about the
   delayed locking behaviour.
  
  Not sure I understand this.
 
 If someone didn't already have it locks, your lock which waits for
 the region to be able to lock it would not need to wait: it would
 just give you the lock, and you wouldn't have the problem.

Oh, so that's what that meant. :-) But (see above) it's pretty clear
to me that nothing else could have it locked.

 
 You need to find out why it's waiting.  If it's waiting, it's
 waiting for somebody.  You need to know who that somebody is.

 Once you know that, you can go hit them over the head with a
 large baseball bat.  8-).
 
Yes. But that somebody is undoubtedly not a real person.

 I have attached the program to run on your Solaris box.  You
 may have to look in /usr/include/sys/fcntl.h to see the right
 name, if it complains about l_rsysid (might be l_sysid, or whatever).
 

 
 I'm attaching a test program to run on the server when the
 lock fails, using information from the trace to know the name
 of the file to enter, and the ethreal decoded packet trace to
 know how to answer the other questions.

I'll try it today.

 But I think it may be as simple as you not telling us that you
 have multiple IP addresses configured on one of your machines?

No, but this might be an important clue. The FreeBSD host has multiple
(2) A Records in the DNS. In fact, I think that when it last worked,
it had only a single A Record. Also, I notice that there are two
rpc.lockd processes running on the FreeBSD server. I hadn't noticed
that before it started failing, but I didn't mention it, since
rpc.lockd does get invoked twice in rc.network. However, rpc.statd
also gets called twice, and there's only a single version of it
running...

root 399  0.0  0.1 263496 1000 ??  Is9:11AM   0:00.00 /usr/sbin/rpc.sta
root 402  0.0  0.1  1512 1156  ??  Ss9:11AM   0:00.00 /usr/sbin/rpc.loc
daemon   405  0.0  0.1  1484 1176  ??  I 9:11AM   0:00.00 /usr/sbin/rpc.loc


Does that indicate a problem?

 
 If so, try:
 
   sysctl -w net.inet.ip.check_interface=0

What does this do, just turn off checking? Can I do this on the
running system, or do I need to put it into sysctl.conf and reboot?
(BTW, from the man page -
  The -w option has been deprecated and is silently ignored.)

Thanks.
Steve
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Terry Lambert
Andrew P. Lentvorski, Jr. wrote:
 On Mon, 17 Mar 2003, Terry Lambert wrote:
 Please don't speculate without having reviewed the code.  It works because
 I rewrote rpc.lockd so that it does the required housekeeping itself.
 The FreeBSD lockd is the only open-source locking daemon that actually
 passes the Connectathon interoperability tests.

I already knew this was the answer, actually.


 When rpc.lockd gets a lock request, it immediately attempts to lock the
 *entire* underlying file on the NFS server (or fails).  It then keeps
 internal track of the partial lock requests so that they *don't* coalesce
 (including l_pid from the target system).  rpc.lockd only releases the
 underlying file when all NFS locks go away.
 
 Yes, it does create extra contention if someone tries to do partial file
 locking via both NFS and the local filesystem simultaneously.  I deemed
 this an acceptable compromise in order to cover up for the problems in the
 FreeBSD locking space since most people who truly need locking generally
 have dedicated file servers.

So then the first question that should have been asked is are there
local server clients?.

If there are local server clients, then the problem is obviously
there.


Out of curiosity, why didn't you use my F_RSETLK/F_RGETLK and l_rsysid
and delayed coelescing patches for lockmgr, so that you could support
host lock coherency?  Was the issue F_RSETLKW requiring a blocking
context?  It's fairly easy to make this async-but-FIFO-ordered.

If you want, I can update them (they old ones are currently available
on http://www.freebsd.org/~terry ).  They do not break binary
compatability.  I'm even willing to do the F_RSETLKW queue insertion
management for the async part...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Terry Lambert
Steve Sizemore wrote:
 I don't see now it could be inter-program, since I've gone to great
 lengths to simplify it to a single program failing on a brand new file.

Is the file ever open by a program on the NFS server itself?

If so, this can cause the behaviour you are seeing (if you are
interested in the technical reasons, there's a different posting).


 Oh, so that's what that meant. :-) But (see above) it's pretty clear
 to me that nothing else could have it locked.

Then you aren't getting the error.  8-) 8-) 8-).


  Once you know that, you can go hit them over the head with a
  large baseball bat.  8-).
 
 Yes. But that somebody is undoubtedly not a real person.

kill -9 them, then.


  But I think it may be as simple as you not telling us that you
  have multiple IP addresses configured on one of your machines?
 
 No, but this might be an important clue. The FreeBSD host has multiple
 (2) A Records in the DNS. In fact, I think that when it last worked,
 it had only a single A Record.

Well, try undoing that change.  I don't think that's it, though,
but it gives you a lever to pull.


 Also, I notice that there are two
 rpc.lockd processes running on the FreeBSD server. I hadn't noticed
 that before it started failing, but I didn't mention it, since
 rpc.lockd does get invoked twice in rc.network. However, rpc.statd
 also gets called twice, and there's only a single version of it
 running...

Not the problem, I think.

  If so, try:
 
sysctl -w net.inet.ip.check_interface=0
 
 What does this do, just turn off checking? Can I do this on the
 running system, or do I need to put it into sysctl.conf and reboot?
 (BTW, from the man page -
   The -w option has been deprecated and is silently ignored.)

Use it on a running system.  Ignore the warning.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Andrew P. Lentvorski, Jr.
On Tue, 18 Mar 2003, Steve Sizemore wrote:

 root 399  0.0  0.1 263496 1000 ??  Is9:11AM   0:00.00 /usr/sbin/rpc.sta
 root 402  0.0  0.1  1512 1156  ??  Ss9:11AM   0:00.00 /usr/sbin/rpc.loc
 daemon   405  0.0  0.1  1484 1176  ??  I 9:11AM   0:00.00 /usr/sbin/rpc.loc

This might be the culprit.  The way that rpc.lock works is that it grabs a 
lock on the *entire* underlying file when any lock request comes in.  If 
those requests get misrouted to the wrong daemon, it is likely to cause 
havoc.

-a


To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Terry Lambert
Andrew P. Lentvorski, Jr. wrote:
 On Tue, 18 Mar 2003, Steve Sizemore wrote:
 
  root 399  0.0  0.1 263496 1000 ??  Is9:11AM   0:00.00 /usr/sbin/rpc.sta
  root 402  0.0  0.1  1512 1156  ??  Ss9:11AM   0:00.00 /usr/sbin/rpc.loc
  daemon   405  0.0  0.1  1484 1176  ??  I 9:11AM   0:00.00 /usr/sbin/rpc.loc
 
 This might be the culprit.  The way that rpc.lock works is that it grabs a
 lock on the *entire* underlying file when any lock request comes in.  If
 those requests get misrouted to the wrong daemon, it is likely to cause
 havoc.

I thought this was a fork on purpose, and the second one was there
for LOCK vs. LOCKW requests?

If not, how the heck is he starting this in the first place?!?

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-18 Thread Steve Sizemore
On Tue, Mar 18, 2003 at 05:25:36PM -0800, Andrew P. Lentvorski, Jr. wrote:
 On Tue, 18 Mar 2003, Steve Sizemore wrote:
 
  root 399  0.0  0.1 263496 1000 ??  Is9:11AM   0:00.00 /usr/sbin/rpc.sta
  root 402  0.0  0.1  1512 1156  ??  Ss9:11AM   0:00.00 /usr/sbin/rpc.loc
  daemon   405  0.0  0.1  1484 1176  ??  I 9:11AM   0:00.00 /usr/sbin/rpc.loc
 
 This might be the culprit.  The way that rpc.lock works is that it grabs a 
 lock on the *entire* underlying file when any lock request comes in.  If 
 those requests get misrouted to the wrong daemon, it is likely to cause 
 havoc.
 

OK. It appears that starting rpc.lockd automatically spawns two
copies, so if that's a problem, how can I fix it?

Thanks.
Steve
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-17 Thread Andrew P. Lentvorski, Jr.
On Sun, 16 Mar 2003, Steve Sizemore wrote:

 Sorry - I was trying to be too helpful. I actually did capture the raw
 dump but appended the decoded output. This time, I've attached a
 real raw dump.

The dump doesn't seem to be attached.  However, I note that the request
being sent is SETLKW which is a blocking wait until lock is granted.  If
the server thinks the file is already locked, it will hang *and* that is 
the proper behavior.

What is the result of running this locally on the NFS server and 
attempting to lock the underlying file?  If rpc.lockd is hanging onto a 
lock, running that perl script locally on the actual file (not an NFS 
mounted image of it) should also hang.

As a side note, you probably want to create a C executable to do this kind
of fcntl fiddling when attempting to test NFS.  That way you can use a
locally mounted binary and you won't wind up with all of the Perl access
calls on the NFS wire.  Or, at least, use a local copy of Perl.

-a



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-17 Thread Steve Sizemore
On Mon, Mar 17, 2003 at 01:21:19PM -0800, Andrew P. Lentvorski, Jr. wrote:
 On Sun, 16 Mar 2003, Steve Sizemore wrote:
 
 The dump doesn't seem to be attached.  However, I note that the request

It appears that there are problems sending the raw dump. I've tried
twice - once 2 minutes after I sent the original message, and once
again when I got this from you. Neither has shown up on the list.
I can find another way to make it available if you need to see it.

 being sent is SETLKW which is a blocking wait until lock is granted.  If
 the server thinks the file is already locked, it will hang *and* that is 
 the proper behavior.
 
 What is the result of running this locally on the NFS server and 
 attempting to lock the underlying file?  If rpc.lockd is hanging onto a 
 lock, running that perl script locally on the actual file (not an NFS 
 mounted image of it) should also hang.

It seems to work as expected (at least as I expect) on the server. If
no other process has a lock, then the program locks the file, unlocks
it, and exits immediately. If the remote client is trying to
lock/unlock the file, then running the same program on the server also
hangs.

One other twist - recently, the behavior is less predictable. A couple
of times in the last 24 hours, the lock/unlock on the client has
actually worked as it should. The first time it happened, I was so
surprised, that I thought I must have locked a local file rather than
an NFS mounted file. On other occasions, the program has succeeded
after very long hangs, .e.g

% time plock xxx
Locking xxx
Unlocking xxx

Done
0.21u 0.05s 55:35.33 0.0%

This makes me wonder whether waiting indefinitely would succeed in all
cases. (Note, however, that I've frequently waited more than an hour
before killing the process or giving up.)

 As a side note, you probably want to create a C executable to do this kind
 of fcntl fiddling when attempting to test NFS.  That way you can use a
 locally mounted binary and you won't wind up with all of the Perl access
 calls on the NFS wire.  Or, at least, use a local copy of Perl.

If I trusted my C skills as much as I trust my perl skills, I would do
that. The perl stuff is all mounted locally, so there shouldn't be any
perl nfs traffic on the wire.

Let me know if you still need to see the dump.

Steve
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-17 Thread Terry Lambert
Andrew P. Lentvorski, Jr. wrote:
 The dump doesn't seem to be attached.  However, I note that the request
 being sent is SETLKW which is a blocking wait until lock is granted.  If
 the server thinks the file is already locked, it will hang *and* that is
 the proper behavior.

It is, to ensure FIFO ordering of request grants.  You could also
implement this as a retry.

If you do it the first way, you end up potentially deadlocking the
server shen a single client has badly behaved code that locks against
itself.  If you do it the second way, you end up with timing dependent
starvation deadlocks for individual client processes.  Note that the
first deadlock is normal -- it would happen if the file were local, as
well... no help for badly written code -- but I mention it as important
because we are talking about blocking multiple clients.

I don't know what the process is, but a threaded process can cause
a deadlock when it should be a grant/upgrade/downgrade of an existing
lock overlap.  This is because there is no such thing as a thread ID
in the NFS protocol, and if process IDs are different for different
threads, and the requests come from the same system ID, then you can
get a deadlock when none should be present.  To avoid this, either
manage all locks in an apartment or rental model (queue all requests
to a single thread, and have it do the locking by proxy) OR make sure
that all requests from any thread in a given process in fact are given
the same proxy process ID on the wire.

[ ... This last is not likely your problem, but I mention it, in case
  you are using rfork() or Linux threads ... ]


 What is the result of running this locally on the NFS server and
 attempting to lock the underlying file?  If rpc.lockd is hanging onto a
 lock, running that perl script locally on the actual file (not an NFS
 mounted image of it) should also hang.

That was my next question, as well: does it happen on a local FS
as well as an NFS FS?  Personally, I would *NOT* recommend running
it on the server, but mount a local FS on the client instead; the
less variables, the better.

On the other hand, this is clearly a deadlock that requires an
existing, conflicting lock -- IFF the you are correct about the
delayed locking behaviour.


 As a side note, you probably want to create a C executable to do this kind
 of fcntl fiddling when attempting to test NFS.  That way you can use a
 locally mounted binary and you won't wind up with all of the Perl access
 calls on the NFS wire.  Or, at least, use a local copy of Perl.

I recommend a pared down test case.  I suspect that the problem is
that something that is expected to have the same ID is locking
against itself.

Does the failure occur with the same values in all cases in the
F_RSETLKW?  If so, I suggest you capture *all* locking packets on
your wire, and then find who is conflicting.  This may be a simple
lock order reversal (deadly embrace deadlock) due to poor application
performance.  You may also find that you have multiple process IDs,
when it should be a single process ID, for the proxy PID for the
conflicting request.  At worst, it would be nice to know the system
that caused it.

Actually, for a lock you know is threre, you *can* diagnose the
problem (somewhat) by writing a program on the server, and using
F_GETLK on the range for the hanging lock on the server -- this
will return a struct flock, which will give you range and PID
information.  Do it on the Solaris box, though.

The reason you want to do this on the Solaris box is that the
struct flock on FreeBSD fails to include the l_rsysid -- the
remote system ID.

Actually, given this, I don't understand how FreeBSD server side
proxy locking can actually work at all; it would incorrectly
coelesce locks with local locks when the l_pid matched, which
would be *all* locks in the lockd, and then incorrectly release
them when a local process exited, or any process on any remote
system unlocked an overlapping range (possibly in error).

You are using FreeBSD as the NFS client in this case, right?  If
so, that's probably not an issue for you...

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-17 Thread Steve Sizemore
Hi, Terry -

On Mon, Mar 17, 2003 at 07:02:31PM -0800, Terry Lambert wrote:
 Andrew P. Lentvorski, Jr. wrote:
  being sent is SETLKW which is a blocking wait until lock is granted.  If
  the server thinks the file is already locked, it will hang *and* that is
  the proper behavior.
 
 It is, to ensure FIFO ordering of request grants.  You could also
 implement this as a retry.
 
 If you do it the first way, you end up potentially deadlocking the
 server shen a single client has badly behaved code that locks against
 itself.  If you do it the second way, you end up with timing dependent
 starvation deadlocks for individual client processes.  Note that the
 first deadlock is normal -- it would happen if the file were local, as
 well... no help for badly written code -- but I mention it as important
 because we are talking about blocking multiple clients.
 
 I don't know what the process is, but a threaded process can cause
 a deadlock when it should be a grant/upgrade/downgrade of an existing
 lock overlap.  This is because there is no such thing as a thread ID
 in the NFS protocol, and if process IDs are different for different
 threads, and the requests come from the same system ID, then you can
 get a deadlock when none should be present.  To avoid this, either
 manage all locks in an apartment or rental model (queue all requests
 to a single thread, and have it do the locking by proxy) OR make sure
 that all requests from any thread in a given process in fact are given
 the same proxy process ID on the wire.
 
 [ ... This last is not likely your problem, but I mention it, in case
   you are using rfork() or Linux threads ... ]

Thanks for the explanation. If I were a programmer, it would be very
useful. As it is, it's still interesting. I have no way of judging the
quality of the code in question, other than the empirical result that
it works in most cases.

 
  What is the result of running this locally on the NFS server and
  attempting to lock the underlying file?  If rpc.lockd is hanging onto a
  lock, running that perl script locally on the actual file (not an NFS
  mounted image of it) should also hang.
 
 That was my next question, as well: does it happen on a local FS
 as well as an NFS FS?  Personally, I would *NOT* recommend running
 it on the server, but mount a local FS on the client instead; the
 less variables, the better.

Works fine on the client on a local file system. Works fine on the
server.

 On the other hand, this is clearly a deadlock that requires an
 existing, conflicting lock -- IFF the you are correct about the
 delayed locking behaviour.

Not sure I understand this.

 
  As a side note, you probably want to create a C executable to do this kind
  of fcntl fiddling when attempting to test NFS.  That way you can use a
  locally mounted binary and you won't wind up with all of the Perl access
  calls on the NFS wire.  Or, at least, use a local copy of Perl.
 
 I recommend a pared down test case.  I suspect that the problem is
 that something that is expected to have the same ID is locking
 against itself.

I can't pare it down any further using perl. If someone better at C
than I am gives me a sample C program, I'll be happy to try it.

 Does the failure occur with the same values in all cases in the
 F_RSETLKW?  If so, I suggest you capture *all* locking packets on
 your wire, and then find who is conflicting.  This may be a simple
 lock order reversal (deadly embrace deadlock) due to poor application
 performance.  You may also find that you have multiple process IDs,
 when it should be a single process ID, for the proxy PID for the
 conflicting request.  At worst, it would be nice to know the system
 that caused it.

 Actually, for a lock you know is threre, you *can* diagnose the
 problem (somewhat) by writing a program on the server, and using
 F_GETLK on the range for the hanging lock on the server -- this
 will return a struct flock, which will give you range and PID
 information.  Do it on the Solaris box, though.
 
 The reason you want to do this on the Solaris box is that the
 struct flock on FreeBSD fails to include the l_rsysid -- the
 remote system ID.

Sorry, but I don't understand any of that.
 
 Actually, given this, I don't understand how FreeBSD server side
 proxy locking can actually work at all; it would incorrectly
 coelesce locks with local locks when the l_pid matched, which
 would be *all* locks in the lockd, and then incorrectly release
 them when a local process exited, or any process on any remote
 system unlocked an overlapping range (possibly in error).

So you're suggesting that when it works, it's just lucky? But others
have said that it works for them, and it seems to work OK between
FreeBSD systems.


 You are using FreeBSD as the NFS client in this case, right?  If
 so, that's probably not an issue for you...

No.

I think that you may be trying to solve a problem I don't have.
First - I'm not a programmer. I'm not trying to write any 

Re: NFS file unlocking problem

2003-03-17 Thread Terry Lambert
Steve Sizemore wrote:
 Thanks for the explanation. If I were a programmer, it would be very
 useful. As it is, it's still interesting. I have no way of judging the
 quality of the code in question, other than the empirical result that
 it works in most cases.

Well, then you are stuck with the code you have that someone else
wrote.  Hopefully that's not your problem, or your are in trouble.
8-).


   What is the result of running this locally on the NFS server and
   attempting to lock the underlying file?  If rpc.lockd is hanging onto a
   lock, running that perl script locally on the actual file (not an NFS
   mounted image of it) should also hang.
 
  That was my next question, as well: does it happen on a local FS
  as well as an NFS FS?  Personally, I would *NOT* recommend running
  it on the server, but mount a local FS on the client instead; the
  less variables, the better.
 
 Works fine on the client on a local file system. Works fine on the
 server.

OK, then it isn't an intra-program deadlock, which is something.

It could still be inter-program, but if it is, it's not going to
be easy to find; you will need to find someone who *is* a programmer.
FWIW, this happen when:

Program 1   Program 2
LOCK A
LOCK B
LOCK B (Waiting for Program 2)
LOCK A (Waiting for Program 1 waiting for me)

  On the other hand, this is clearly a deadlock that requires an
  existing, conflicting lock -- IFF the you are correct about the
  delayed locking behaviour.
 
 Not sure I understand this.

If someone didn't already have it locks, your lock which waits for
the region to be able to lock it would not need to wait: it would
just give you the lock, and you wouldn't have the problem.


  Does the failure occur with the same values in all cases in the
  F_RSETLKW?  If so, I suggest you capture *all* locking packets on
  your wire, and then find who is conflicting.  This may be a simple
  lock order reversal (deadly embrace deadlock) due to poor application
  performance.  You may also find that you have multiple process IDs,
  when it should be a single process ID, for the proxy PID for the
  conflicting request.  At worst, it would be nice to know the system
  that caused it.
 
  Actually, for a lock you know is threre, you *can* diagnose the
  problem (somewhat) by writing a program on the server, and using
  F_GETLK on the range for the hanging lock on the server -- this
  will return a struct flock, which will give you range and PID
  information.  Do it on the Solaris box, though.
 
  The reason you want to do this on the Solaris box is that the
  struct flock on FreeBSD fails to include the l_rsysid -- the
  remote system ID.
 
 Sorry, but I don't understand any of that.

You need to find out why it's waiting.  If it's waiting, it's
waiting for somebody.  You need to know who that somebody is.

Once you know that, you can go hit them over the head with a
large baseball bat.  8-).

I have attached the program to run on your Solaris box.  You
may have to look in /usr/include/sys/fcntl.h to see the right
name, if it complains about l_rsysid (might be l_sysid, or whatever).


  Actually, given this, I don't understand how FreeBSD server side
  proxy locking can actually work at all; it would incorrectly
  coelesce locks with local locks when the l_pid matched, which
  would be *all* locks in the lockd, and then incorrectly release
  them when a local process exited, or any process on any remote
  system unlocked an overlapping range (possibly in error).
 
 So you're suggesting that when it works, it's just lucky? But others
 have said that it works for them, and it seems to work OK between
 FreeBSD systems.

I would have to look at the locking code in FreeBSD for the NFS
case.  I wrote some NFS locking code for FreeBSD in 1995 that was
not used for the implementation.

There are ways around the problem in userspace, but they're very
hard to make efficient or get correct.  They also make it very hard
to debug easily, because you can't get the system ID for systems
that have outstanding locks.  8-(.


  You are using FreeBSD as the NFS client in this case, right?  If
  so, that's probably not an issue for you...
 
 No.
 
 I think that you may be trying to solve a problem I don't have.
 First - I'm not a programmer. I'm not trying to write any program
 at all, except as necessary to diagnose this problem. I'll summarize
 the situation briefly. The issue cropped up in a commercial program
 (Xinet) which was working on Solaris 2.6 client and server. I'm
 replacing the server with a FreeBSD box (RELENG_5_0) and the program
 stopped working. Xinet tech support diagnosed it as nfs locking
 problem, which I've confirmed by my simple perl program.
 
 Client  Server  Result
 ==  === ==
 Solaris Solaris Works
 FreeBSD Solaris Works
 FreeBSD  

Re: NFS file unlocking problem

2003-03-16 Thread Steve Sizemore
On Sat, Mar 15, 2003 at 01:33:11AM -0600, Dan Nelson wrote:
 
 Oops.  You appended a decoded dump again.  I should have told you how
 to generate a raw tcpdump log.  Add -s 1500 -w file.pcap to the
 tcpdump commandline.  You won't get any output to the screen, but the
 raw packet contents will get written to the file.  You can replay it
 with tcpdump -r, or load it into ethereal and view the packets in the
 GUI.

Sorry - I was trying to be too helpful. I actually did capture the raw
dump but appended the decoded output. This time, I've attached a
real raw dump.

 
 Runs fine on my Solaris 2.6 and 2.7 machines, so it's not a
 Solaris-FreeBSD specific problem.

I was already pretty sure that's true - I actually did have it working
for some number of days. I just don't know what broke it.

 
 Ok, it definitely dies trying to lock the file.  Check to make sure
 that rpc.lockd is still running on the FreeBSD server.  I have seen it
 coredump on a couple of my 4.5 servers and the result was what you're
 seeing here (lock attempts hang).
 

rpc.lockd and rpc.statd still running on Freebsd, lockd and statd
still running on Solaris.

I hope the tcpdump gives you a clue what might be going wrong...

Thanks.
Steve
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-14 Thread Steve Sizemore
On Fri, Mar 14, 2003 at 09:58:56AM -0600, Dan Nelson wrote:
 In the last episode (Mar 13), Steve Sizemore said:
  Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd
  and rpc.lockd both running on FreeBSD, lockd and statd both running
  on Solaris. Locking a file (flock) works fine, but when an attempt to
  unlock it is made, the client session hangs. The program is typically
  (but not always) uninterruptible, and I have to kill the login
  session.
 
 See if you can get a dump of the packets sent when the client hangs.
 

Thanks for the reply, Dan.

I have captured packets (tcpdump) between the hosts, and I'm enclosing
that as an attachment. I'm not facile enough with tcpdump to capture
only the significant packets, so I just got everything. 

I've enclosed the dump as an attachment - physics is the FreeBSD
machine and cfpa11 is the Solaris box. Actually, in this case, the
program only locked the file, no unlock, and it hung anyway. (This
is different from before.)

Is this output helpful?

Thanks.
Steve
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley
From [EMAIL PROTECTED]  Fri Mar 14 13:24:54 2003
Return-Path: [EMAIL PROTECTED]
Received: from physics.berkeley.edu (Physics.Berkeley.EDU [128.32.61.77])
by Math.Berkeley.EDU (8.12.8/8.12.8) with ESMTP id h2ELOseB023171
for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:54 -0800 (PST)
Received: from physics.berkeley.edu (localhost [127.0.0.1])
by physics.berkeley.edu (8.12.6/8.12.6) with ESMTP id h2ELOrdu008457
for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:53 -0800 (PST)
(envelope-from [EMAIL PROTECTED])
Received: (from [EMAIL PROTECTED])
by physics.berkeley.edu (8.12.6/8.12.6/Submit) id h2ELOr4l008456
for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:53 -0800 (PST)
(envelope-from steve)
Date: Fri, 14 Mar 2003 13:24:53 -0800 (PST)
From: Steve Sizemore [EMAIL PROTECTED]
Message-Id: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
X-Spam-Status: No, hits=1.1 required=5.0
tests=SPAM_PHRASE_00_01,SUBJ_MISSING
version=2.43
X-Spam-Level: *
Status: RO
Content-Length: 4087
Lines: 40

13:23:10.746188 cfpa11.Berkeley.EDU.2399800991  physics.berkeley.edu.nfs: 112 fsstat 
[|nfs] (DF)
13:23:10.746259 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399800991: reply ok 
172 fsstat [|nfs] (DF)
13:23:10.843671 cfpa11.Berkeley.EDU.1005  physics.berkeley.edu.nfsd: . ack 205284758 
win 8760 (DF)
13:23:11.934004 cfpa11.Berkeley.EDU.2399800992  physics.berkeley.edu.nfs: 116 fsstat 
[|nfs] (DF)
13:23:11.934085 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399800992: reply ok 
172 fsstat [|nfs] (DF)
13:23:12.033672 cfpa11.Berkeley.EDU.1005  physics.berkeley.edu.nfsd: . ack 173 win 
8760 (DF)
13:23:15.801623 cfpa11.Berkeley.EDU.2399800998  physics.berkeley.edu.nfs: 124 lookup 
[|nfs] (DF)
13:23:15.801707 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399800998: reply ok 
240 lookup [|nfs] (DF)
13:23:15.802485 cfpa11.Berkeley.EDU.2399800999  physics.berkeley.edu.nfs: 120 lookup 
[|nfs] (DF)
13:23:15.802521 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399800999: reply ok 
240 lookup [|nfs] (DF)
13:23:15.803314 cfpa11.Berkeley.EDU.2399801000  physics.berkeley.edu.nfs: 112 getattr 
[|nfs] (DF)
13:23:15.803345 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399801000: reply ok 
116 getattr [|nfs] (DF)
13:23:15.803999 cfpa11.Berkeley.EDU.2399801001  physics.berkeley.edu.nfs: 112 getattr 
[|nfs] (DF)
13:23:15.804027 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399801001: reply ok 
116 getattr [|nfs] (DF)
13:23:15.862284 cfpa11.Berkeley.EDU.2399801029  physics.berkeley.edu.nfs: 112 getattr 
[|nfs] (DF)
13:23:15.862318 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399801029: reply ok 
116 getattr [|nfs] (DF)
13:23:15.953781 cfpa11.Berkeley.EDU.1005  physics.berkeley.edu.nfsd: . ack 1001 win 
8760 (DF)
13:23:16.138906 cfpa11.Berkeley.EDU.2399801083  physics.berkeley.edu.nfs: 112 getattr 
[|nfs] (DF)
13:23:16.138970 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399801083: reply ok 
116 getattr [|nfs] (DF)
13:23:16.139606 cfpa11.Berkeley.EDU.2399801084  physics.berkeley.edu.nfs: 116 access 
[|nfs] (DF)
13:23:16.139643 physics.berkeley.edu.nfs  cfpa11.Berkeley.EDU.2399801084: reply ok 
124 access c 8f0a0efd (DF)
13:23:16.141238 cfpa11.Berkeley.EDU.1021  physics.berkeley.edu.sunrpc: P 
2839020119:2839020215(96) ack 3472682147 win 8760 (DF)
13:23:16.143832 physics.berkeley.edu.sunrpc  cfpa11.Berkeley.EDU.1021: P 1:33(32) ack 
96 win 65535 (DF)
13:23:16.233777 cfpa11.Berkeley.EDU.1005  physics.berkeley.edu.nfsd: . ack 1241 win 
8760 (DF)
13:23:16.243744 cfpa11.Berkeley.EDU.1021  physics.berkeley.edu.sunrpc: . ack 33 win 
8760 (DF)
13:23:16.919974 cfpa11.Berkeley.EDU.2399801086  physics.berkeley.edu.nfs: 116 fsstat 
[|nfs] (DF)
13:23:16.920036 physics.berkeley.edu.nfs  

Re: NFS file unlocking problem

2003-03-14 Thread Andrew P. Lentvorski, Jr.
On Thu, 13 Mar 2003, Steve Sizemore wrote:

 Running RELENG_5_0 as nfs server with a Solaris 2.5 client.
 rpc.statd and rpc.lockd both running on FreeBSD, lockd and
 statd both running on Solaris. Locking a file (flock) works
 fine, but when an attempt to unlock it is made, the client
 session hangs. The program is typically (but not always)
 uninterruptible, and I have to kill the login session.

That's ... odd.  However, the Solaris rpc.lockd does some strange caching 
that can lead to asymmetric behavior.

In addition, you are running Solaris 2.5 which qualifies as practically
prehistoric in computer time.  That's going to activate some old
mechanisms which FreeBSD may or may not support.

Several areas are suspect:

1) RPC can't agree on a protocol version with Solaris 2.5
2) NFS can't agree on a protocol version with Solaris 2.5
3) The lock attempt itself is broken

The last time a hang like this happened I believe that it was an issue in 
not returning the correct rejection notice during an RPC negotiation.

I recommend using ethereal to create a trace file.  This is going to be 
tough to debug as I don't have access to a Solaris 2.5 machine to test the 
interaction and see what is going on.

-a




To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-14 Thread Dan Nelson
In the last episode (Mar 13), Steve Sizemore said:
 Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd
 and rpc.lockd both running on FreeBSD, lockd and statd both running
 on Solaris. Locking a file (flock) works fine, but when an attempt to
 unlock it is made, the client session hangs. The program is typically
 (but not always) uninterruptible, and I have to kill the login
 session.

See if you can get a dump of the packets sent when the client hangs.

-- 
Dan Nelson
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-14 Thread Steve Sizemore
On Fri, Mar 14, 2003 at 04:47:11PM -0600, Dan Nelson wrote:
 
 Ideally, a truss of the last 10 lines of the failing program plus the
 raw tcpdump log (run with -s 1500 so we get the whole packet) would be
 better.  The truss is so we have proof that file locks are really to
 blame :)
 

On Fri, Mar 14, 2003 at 06:14:59PM -0800, Andrew P. Lentvorski, Jr. wrote:
 On Thu, 13 Mar 2003, Steve Sizemore wrote:
 
  Running RELENG_5_0 as nfs server with a Solaris 2.5 client.
  rpc.statd and rpc.lockd both running on FreeBSD, lockd and
  statd both running on Solaris. Locking a file (flock) works
  fine, but when an attempt to unlock it is made, the client
  session hangs. The program is typically (but not always)
  uninterruptible, and I have to kill the login session.
 
 That's ... odd.  However, the Solaris rpc.lockd does some strange caching 
 that can lead to asymmetric behavior.
 
 In addition, you are running Solaris 2.5 which qualifies as practically
 prehistoric in computer time.  That's going to activate some old
 mechanisms which FreeBSD may or may not support.

OK, that was a typo - it's really 2.6. Not quite so ancient. However,
I also have a Solaris 8 machine that has the same behavior, so I've
used it to generate the requested output.

 Several areas are suspect:
 
 1) RPC can't agree on a protocol version with Solaris 2.5
 2) NFS can't agree on a protocol version with Solaris 2.5
 3) The lock attempt itself is broken
 
 The last time a hang like this happened I believe that it was an issue in 
 not returning the correct rejection notice during an RPC negotiation.
 
 I recommend using ethereal to create a trace file.  This is going to be 
 tough to debug as I don't have access to a Solaris 2.5 machine to test the 
 interaction and see what is going on.

I have installed ethereal, so I could do a trace, if you tell me what
options to use. In the meantime, I'm attaching the output of truss
(Solaris 8) and tcpdump (FreeBSD).

Note that the program now has been simplified to do only the lock,
since it's no longer necessary to unlock the file to get it to hang.
Here's the demo program -

#!/usr/local/bin/perl -w

use strict ;
use File::BasicFlock;

my $filename= shift ;

print Locking $filename\n;
lock($filename);
print Done\n;
exit;

Output files are attached.

Thanks.
-- 
Steve Sizemore [EMAIL PROTECTED], (510) 642-8570
Unix System Manager
Dept. of Mathematics and College of Letters and Science
University of California, Berkeley
SPAM:  Start SpamAssassin results --
SPAM: This mail is probably spam.  The original message has been altered
SPAM: so you can recognise or block similar unwanted mail in future.
SPAM: See http://spamassassin.org/tag/ for more details.
SPAM: 
SPAM: Content analysis details:   (5.10 hits, 5 required)
SPAM: X_AUTH_WARNING (-0.2 points) Found a X-Authentication-Warning header
SPAM: SUBJ_MISSING   (0.3 points)  Subject: is empty or missing
SPAM: GAPPY_TEXT (0.0 points)  BODY: Contains 'G.a.p.p.y-T.e.x.t'
SPAM: SPAM_PHRASE_00_01  (0.8 points)  BODY: Spam phrases score is 00 to 01 (low)
SPAM: BAD_HELO_WARNING   (2.3 points)  Fake name used in SMTP HELO command
SPAM: RCVD_IN_RELAYS_ORDB_ORG (0.6 points)  RBL: Received via a relay in 
relays.ordb.org
SPAM:[RBL check: found 106.210.32.128.relays.ordb.org.]
SPAM: UPPERCASE_25_50(1.3 points)  message body is 25-50% uppercase
SPAM: 
SPAM:  End of SpamAssassin results -

execve(plock, 0xFFBEFA54, 0xFFBEFA68)  argc = 4
mmap(0x, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 
0xFF3A
resolvepath(/usr/lib/ld.so.1, /usr/lib/ld.so.1, 1023) = 16
open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT
stat(/usr/local/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT
stat(/opt/SUNWspro/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT
stat(/usr/openwin/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT
stat(/usr/lib/libsocket.so.1, 0xFFBEF17C) = 0
open(/usr/lib/libsocket.so.1, O_RDONLY)   = 3
fstat(3, 0xFFBEF17C)= 0
mmap(0x, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xFF39
mmap(0x, 114688, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xFF37
mmap(0xFF38A000, 4365, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 
40960) = 0xFF38A000
munmap(0xFF37A000, 65536)   = 0
memcntl(0xFF37, 14496, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
close(3)= 0
stat(/usr/local/lib/libnsl.so.1, 0xFFBEF17C)  Err#2 ENOENT
stat(/opt/SUNWspro/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT
stat(/usr/openwin/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT
stat(/usr/lib/libnsl.so.1, 0xFFBEF17C)= 0
open(/usr/lib/libnsl.so.1, O_RDONLY)  = 3
fstat(3, 0xFFBEF17C)= 0
mmap(0xFF39, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 

Re: NFS file unlocking problem

2003-03-14 Thread Dan Nelson
In the last episode (Mar 14), Steve Sizemore said:
  That's ... odd.  However, the Solaris rpc.lockd does some strange caching 
  that can lead to asymmetric behavior.
  
  In addition, you are running Solaris 2.5 which qualifies as practically
  prehistoric in computer time.  That's going to activate some old
  mechanisms which FreeBSD may or may not support.
 
 OK, that was a typo - it's really 2.6. Not quite so ancient. However,
 I also have a Solaris 8 machine that has the same behavior, so I've
 used it to generate the requested output.

Oops.  You appended a decoded dump again.  I should have told you how
to generate a raw tcpdump log.  Add -s 1500 -w file.pcap to the
tcpdump commandline.  You won't get any output to the screen, but the
raw packet contents will get written to the file.  You can replay it
with tcpdump -r, or load it into ethereal and view the packets in the
GUI.
 
  Several areas are suspect:
  
  1) RPC can't agree on a protocol version with Solaris 2.5
  2) NFS can't agree on a protocol version with Solaris 2.5
  3) The lock attempt itself is broken

Judging by the truss, I'd say #3

 Note that the program now has been simplified to do only the lock,
 since it's no longer necessary to unlock the file to get it to hang.
 Here's the demo program -

Runs fine on my Solaris 2.6 and 2.7 machines, so it's not a
Solaris-FreeBSD specific problem.

 open(/home/cosmology/steve/lock_file, O_RDWR)   = 3
 fstat(3, 0x000C0A2C)  = 0
 fcntl(3, F_SETFD, 0x0001) = 0
 llseek(3, 0, SEEK_CUR)= 0
 Received signal #2, SIGINT, in fcntl() [default]
 fcntl(3, F_SETLKW, 0xFFBEF790)Err#4 EINTR
   *** process killed ***

Ok, it definitely dies trying to lock the file.  Check to make sure
that rpc.lockd is still running on the FreeBSD server.  I have seen it
coredump on a couple of my 4.5 servers and the result was what you're
seeing here (lock attempts hang).

-- 
Dan Nelson
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message


Re: NFS file unlocking problem

2003-03-14 Thread Dan Nelson
In the last episode (Mar 14), Steve Sizemore said:
 On Fri, Mar 14, 2003 at 09:58:56AM -0600, Dan Nelson wrote:
  In the last episode (Mar 13), Steve Sizemore said:
   Running RELENG_5_0 as nfs server with a Solaris 2.5 client.
   rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd
   both running on Solaris. Locking a file (flock) works fine, but
   when an attempt to unlock it is made, the client session hangs.
   The program is typically (but not always) uninterruptible, and I
   have to kill the login session.
  
  See if you can get a dump of the packets sent when the client hangs.
 
 I've enclosed the dump as an attachment - physics is the FreeBSD
 machine and cfpa11 is the Solaris box. Actually, in this case, the
 program only locked the file, no unlock, and it hung anyway. (This is
 different from before.)
 
 Is this output helpful?

Not really; tcpdump doesn't decode NFS or RPC packets well enough. 
ethereal does a much better job (but unfortunately doesn't have a good
1-line-per-packet mode so you can't just email the decoded dump).

Ideally, a truss of the last 10 lines of the failing program plus the
raw tcpdump log (run with -s 1500 so we get the whole packet) would be
better.  The truss is so we have proof that file locks are really to
blame :)

-- 
Dan Nelson
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message