Re: NFS file unlocking problem
On Wed, Mar 19, 2003 at 05:15:02PM -0800, Andrew P. Lentvorski, Jr. wrote: Steve, I actually managed to pull down the dump. It doesn't have any lock requests in it. It looks like it is hanging in the rpcinfo call. If you really want to debug this, it's going to take a chunk of work. 1) set up two brand new machines FreeBSD-current for the server Solaris whatever for the client Well, I think I found the problem. I had just installed the new FreeBSD machine, not looking forward to finding another Sun to install, when I had the idea to try different nfs protocols. By default, it looks like the NFS mount is version 3 tcp. I specified udp, and both of the test programs (mine and Terry's) work consistently. I've reenabled locking on the Xinet software, and we'll see (tomorrow) whether or not that also works. If I'm right, that means that there is a problem with nfs over tcp with a Solaris client and FreeBSD-5 server. All other combinations of client/server pairs worked with the default (which I assume is version 3 tcp). BTW, Terry's testlock program, when run with the problem configuration, would always return the There is nothing that would block your lock message, but it could take anywhere from 1 second to 45 minutes. I'm not so confident that my perl program would always succeed, because I was never willing to wait longer than overnight before concluding that it was hung. I'll report back on the status of xinet tomorrow, but thanks to all of you for your help and suggestions. If there's any more information I can provide, or testing that I can do, to help fix the NFS/tcp/Solaris problem, let me know. Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Mon, 17 Mar 2003, Terry Lambert wrote: Actually, given this, I don't understand how FreeBSD server side proxy locking can actually work at all; it would incorrectly coelesce locks with local locks when the l_pid matched, which would be *all* locks in the lockd, and then incorrectly release them when a local process exited, or any process on any remote system unlocked an overlapping range (possibly in error). Please don't speculate without having reviewed the code. It works because I rewrote rpc.lockd so that it does the required housekeeping itself. The FreeBSD lockd is the only open-source locking daemon that actually passes the Connectathon interoperability tests. When rpc.lockd gets a lock request, it immediately attempts to lock the *entire* underlying file on the NFS server (or fails). It then keeps internal track of the partial lock requests so that they *don't* coalesce (including l_pid from the target system). rpc.lockd only releases the underlying file when all NFS locks go away. Yes, it does create extra contention if someone tries to do partial file locking via both NFS and the local filesystem simultaneously. I deemed this an acceptable compromise in order to cover up for the problems in the FreeBSD locking space since most people who truly need locking generally have dedicated file servers. -a To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Mon, Mar 17, 2003 at 11:36:58PM -0800, Terry Lambert wrote: Steve Sizemore wrote: useful. As it is, it's still interesting. I have no way of judging the quality of the code in question, other than the empirical result that it works in most cases. Well, then you are stuck with the code you have that someone else wrote. Hopefully that's not your problem, or your are in trouble. 8-). Actually, maybe not, since it's a commercial program. If I could demonstrate that it's their problem, I could put pressure on them to fix it. However, at this point, I don't think that's the case. OK, then it isn't an intra-program deadlock, which is something. It could still be inter-program, but if it is, it's not going to be easy to find; you will need to find someone who *is* a programmer. FWIW, this happen when: Program 1 Program 2 LOCK A LOCK B LOCK B (Waiting for Program 2) LOCK A (Waiting for Program 1 waiting for me) I don't see now it could be inter-program, since I've gone to great lengths to simplify it to a single program failing on a brand new file. On the other hand, this is clearly a deadlock that requires an existing, conflicting lock -- IFF the you are correct about the delayed locking behaviour. Not sure I understand this. If someone didn't already have it locks, your lock which waits for the region to be able to lock it would not need to wait: it would just give you the lock, and you wouldn't have the problem. Oh, so that's what that meant. :-) But (see above) it's pretty clear to me that nothing else could have it locked. You need to find out why it's waiting. If it's waiting, it's waiting for somebody. You need to know who that somebody is. Once you know that, you can go hit them over the head with a large baseball bat. 8-). Yes. But that somebody is undoubtedly not a real person. I have attached the program to run on your Solaris box. You may have to look in /usr/include/sys/fcntl.h to see the right name, if it complains about l_rsysid (might be l_sysid, or whatever). I'm attaching a test program to run on the server when the lock fails, using information from the trace to know the name of the file to enter, and the ethreal decoded packet trace to know how to answer the other questions. I'll try it today. But I think it may be as simple as you not telling us that you have multiple IP addresses configured on one of your machines? No, but this might be an important clue. The FreeBSD host has multiple (2) A Records in the DNS. In fact, I think that when it last worked, it had only a single A Record. Also, I notice that there are two rpc.lockd processes running on the FreeBSD server. I hadn't noticed that before it started failing, but I didn't mention it, since rpc.lockd does get invoked twice in rc.network. However, rpc.statd also gets called twice, and there's only a single version of it running... root 399 0.0 0.1 263496 1000 ?? Is9:11AM 0:00.00 /usr/sbin/rpc.sta root 402 0.0 0.1 1512 1156 ?? Ss9:11AM 0:00.00 /usr/sbin/rpc.loc daemon 405 0.0 0.1 1484 1176 ?? I 9:11AM 0:00.00 /usr/sbin/rpc.loc Does that indicate a problem? If so, try: sysctl -w net.inet.ip.check_interface=0 What does this do, just turn off checking? Can I do this on the running system, or do I need to put it into sysctl.conf and reboot? (BTW, from the man page - The -w option has been deprecated and is silently ignored.) Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
Andrew P. Lentvorski, Jr. wrote: On Mon, 17 Mar 2003, Terry Lambert wrote: Please don't speculate without having reviewed the code. It works because I rewrote rpc.lockd so that it does the required housekeeping itself. The FreeBSD lockd is the only open-source locking daemon that actually passes the Connectathon interoperability tests. I already knew this was the answer, actually. When rpc.lockd gets a lock request, it immediately attempts to lock the *entire* underlying file on the NFS server (or fails). It then keeps internal track of the partial lock requests so that they *don't* coalesce (including l_pid from the target system). rpc.lockd only releases the underlying file when all NFS locks go away. Yes, it does create extra contention if someone tries to do partial file locking via both NFS and the local filesystem simultaneously. I deemed this an acceptable compromise in order to cover up for the problems in the FreeBSD locking space since most people who truly need locking generally have dedicated file servers. So then the first question that should have been asked is are there local server clients?. If there are local server clients, then the problem is obviously there. Out of curiosity, why didn't you use my F_RSETLK/F_RGETLK and l_rsysid and delayed coelescing patches for lockmgr, so that you could support host lock coherency? Was the issue F_RSETLKW requiring a blocking context? It's fairly easy to make this async-but-FIFO-ordered. If you want, I can update them (they old ones are currently available on http://www.freebsd.org/~terry ). They do not break binary compatability. I'm even willing to do the F_RSETLKW queue insertion management for the async part... -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
Steve Sizemore wrote: I don't see now it could be inter-program, since I've gone to great lengths to simplify it to a single program failing on a brand new file. Is the file ever open by a program on the NFS server itself? If so, this can cause the behaviour you are seeing (if you are interested in the technical reasons, there's a different posting). Oh, so that's what that meant. :-) But (see above) it's pretty clear to me that nothing else could have it locked. Then you aren't getting the error. 8-) 8-) 8-). Once you know that, you can go hit them over the head with a large baseball bat. 8-). Yes. But that somebody is undoubtedly not a real person. kill -9 them, then. But I think it may be as simple as you not telling us that you have multiple IP addresses configured on one of your machines? No, but this might be an important clue. The FreeBSD host has multiple (2) A Records in the DNS. In fact, I think that when it last worked, it had only a single A Record. Well, try undoing that change. I don't think that's it, though, but it gives you a lever to pull. Also, I notice that there are two rpc.lockd processes running on the FreeBSD server. I hadn't noticed that before it started failing, but I didn't mention it, since rpc.lockd does get invoked twice in rc.network. However, rpc.statd also gets called twice, and there's only a single version of it running... Not the problem, I think. If so, try: sysctl -w net.inet.ip.check_interface=0 What does this do, just turn off checking? Can I do this on the running system, or do I need to put it into sysctl.conf and reboot? (BTW, from the man page - The -w option has been deprecated and is silently ignored.) Use it on a running system. Ignore the warning. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Tue, 18 Mar 2003, Steve Sizemore wrote: root 399 0.0 0.1 263496 1000 ?? Is9:11AM 0:00.00 /usr/sbin/rpc.sta root 402 0.0 0.1 1512 1156 ?? Ss9:11AM 0:00.00 /usr/sbin/rpc.loc daemon 405 0.0 0.1 1484 1176 ?? I 9:11AM 0:00.00 /usr/sbin/rpc.loc This might be the culprit. The way that rpc.lock works is that it grabs a lock on the *entire* underlying file when any lock request comes in. If those requests get misrouted to the wrong daemon, it is likely to cause havoc. -a To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
Andrew P. Lentvorski, Jr. wrote: On Tue, 18 Mar 2003, Steve Sizemore wrote: root 399 0.0 0.1 263496 1000 ?? Is9:11AM 0:00.00 /usr/sbin/rpc.sta root 402 0.0 0.1 1512 1156 ?? Ss9:11AM 0:00.00 /usr/sbin/rpc.loc daemon 405 0.0 0.1 1484 1176 ?? I 9:11AM 0:00.00 /usr/sbin/rpc.loc This might be the culprit. The way that rpc.lock works is that it grabs a lock on the *entire* underlying file when any lock request comes in. If those requests get misrouted to the wrong daemon, it is likely to cause havoc. I thought this was a fork on purpose, and the second one was there for LOCK vs. LOCKW requests? If not, how the heck is he starting this in the first place?!? -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Tue, Mar 18, 2003 at 05:25:36PM -0800, Andrew P. Lentvorski, Jr. wrote: On Tue, 18 Mar 2003, Steve Sizemore wrote: root 399 0.0 0.1 263496 1000 ?? Is9:11AM 0:00.00 /usr/sbin/rpc.sta root 402 0.0 0.1 1512 1156 ?? Ss9:11AM 0:00.00 /usr/sbin/rpc.loc daemon 405 0.0 0.1 1484 1176 ?? I 9:11AM 0:00.00 /usr/sbin/rpc.loc This might be the culprit. The way that rpc.lock works is that it grabs a lock on the *entire* underlying file when any lock request comes in. If those requests get misrouted to the wrong daemon, it is likely to cause havoc. OK. It appears that starting rpc.lockd automatically spawns two copies, so if that's a problem, how can I fix it? Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Sun, 16 Mar 2003, Steve Sizemore wrote: Sorry - I was trying to be too helpful. I actually did capture the raw dump but appended the decoded output. This time, I've attached a real raw dump. The dump doesn't seem to be attached. However, I note that the request being sent is SETLKW which is a blocking wait until lock is granted. If the server thinks the file is already locked, it will hang *and* that is the proper behavior. What is the result of running this locally on the NFS server and attempting to lock the underlying file? If rpc.lockd is hanging onto a lock, running that perl script locally on the actual file (not an NFS mounted image of it) should also hang. As a side note, you probably want to create a C executable to do this kind of fcntl fiddling when attempting to test NFS. That way you can use a locally mounted binary and you won't wind up with all of the Perl access calls on the NFS wire. Or, at least, use a local copy of Perl. -a To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Mon, Mar 17, 2003 at 01:21:19PM -0800, Andrew P. Lentvorski, Jr. wrote: On Sun, 16 Mar 2003, Steve Sizemore wrote: The dump doesn't seem to be attached. However, I note that the request It appears that there are problems sending the raw dump. I've tried twice - once 2 minutes after I sent the original message, and once again when I got this from you. Neither has shown up on the list. I can find another way to make it available if you need to see it. being sent is SETLKW which is a blocking wait until lock is granted. If the server thinks the file is already locked, it will hang *and* that is the proper behavior. What is the result of running this locally on the NFS server and attempting to lock the underlying file? If rpc.lockd is hanging onto a lock, running that perl script locally on the actual file (not an NFS mounted image of it) should also hang. It seems to work as expected (at least as I expect) on the server. If no other process has a lock, then the program locks the file, unlocks it, and exits immediately. If the remote client is trying to lock/unlock the file, then running the same program on the server also hangs. One other twist - recently, the behavior is less predictable. A couple of times in the last 24 hours, the lock/unlock on the client has actually worked as it should. The first time it happened, I was so surprised, that I thought I must have locked a local file rather than an NFS mounted file. On other occasions, the program has succeeded after very long hangs, .e.g % time plock xxx Locking xxx Unlocking xxx Done 0.21u 0.05s 55:35.33 0.0% This makes me wonder whether waiting indefinitely would succeed in all cases. (Note, however, that I've frequently waited more than an hour before killing the process or giving up.) As a side note, you probably want to create a C executable to do this kind of fcntl fiddling when attempting to test NFS. That way you can use a locally mounted binary and you won't wind up with all of the Perl access calls on the NFS wire. Or, at least, use a local copy of Perl. If I trusted my C skills as much as I trust my perl skills, I would do that. The perl stuff is all mounted locally, so there shouldn't be any perl nfs traffic on the wire. Let me know if you still need to see the dump. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
Andrew P. Lentvorski, Jr. wrote: The dump doesn't seem to be attached. However, I note that the request being sent is SETLKW which is a blocking wait until lock is granted. If the server thinks the file is already locked, it will hang *and* that is the proper behavior. It is, to ensure FIFO ordering of request grants. You could also implement this as a retry. If you do it the first way, you end up potentially deadlocking the server shen a single client has badly behaved code that locks against itself. If you do it the second way, you end up with timing dependent starvation deadlocks for individual client processes. Note that the first deadlock is normal -- it would happen if the file were local, as well... no help for badly written code -- but I mention it as important because we are talking about blocking multiple clients. I don't know what the process is, but a threaded process can cause a deadlock when it should be a grant/upgrade/downgrade of an existing lock overlap. This is because there is no such thing as a thread ID in the NFS protocol, and if process IDs are different for different threads, and the requests come from the same system ID, then you can get a deadlock when none should be present. To avoid this, either manage all locks in an apartment or rental model (queue all requests to a single thread, and have it do the locking by proxy) OR make sure that all requests from any thread in a given process in fact are given the same proxy process ID on the wire. [ ... This last is not likely your problem, but I mention it, in case you are using rfork() or Linux threads ... ] What is the result of running this locally on the NFS server and attempting to lock the underlying file? If rpc.lockd is hanging onto a lock, running that perl script locally on the actual file (not an NFS mounted image of it) should also hang. That was my next question, as well: does it happen on a local FS as well as an NFS FS? Personally, I would *NOT* recommend running it on the server, but mount a local FS on the client instead; the less variables, the better. On the other hand, this is clearly a deadlock that requires an existing, conflicting lock -- IFF the you are correct about the delayed locking behaviour. As a side note, you probably want to create a C executable to do this kind of fcntl fiddling when attempting to test NFS. That way you can use a locally mounted binary and you won't wind up with all of the Perl access calls on the NFS wire. Or, at least, use a local copy of Perl. I recommend a pared down test case. I suspect that the problem is that something that is expected to have the same ID is locking against itself. Does the failure occur with the same values in all cases in the F_RSETLKW? If so, I suggest you capture *all* locking packets on your wire, and then find who is conflicting. This may be a simple lock order reversal (deadly embrace deadlock) due to poor application performance. You may also find that you have multiple process IDs, when it should be a single process ID, for the proxy PID for the conflicting request. At worst, it would be nice to know the system that caused it. Actually, for a lock you know is threre, you *can* diagnose the problem (somewhat) by writing a program on the server, and using F_GETLK on the range for the hanging lock on the server -- this will return a struct flock, which will give you range and PID information. Do it on the Solaris box, though. The reason you want to do this on the Solaris box is that the struct flock on FreeBSD fails to include the l_rsysid -- the remote system ID. Actually, given this, I don't understand how FreeBSD server side proxy locking can actually work at all; it would incorrectly coelesce locks with local locks when the l_pid matched, which would be *all* locks in the lockd, and then incorrectly release them when a local process exited, or any process on any remote system unlocked an overlapping range (possibly in error). You are using FreeBSD as the NFS client in this case, right? If so, that's probably not an issue for you... -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
Hi, Terry - On Mon, Mar 17, 2003 at 07:02:31PM -0800, Terry Lambert wrote: Andrew P. Lentvorski, Jr. wrote: being sent is SETLKW which is a blocking wait until lock is granted. If the server thinks the file is already locked, it will hang *and* that is the proper behavior. It is, to ensure FIFO ordering of request grants. You could also implement this as a retry. If you do it the first way, you end up potentially deadlocking the server shen a single client has badly behaved code that locks against itself. If you do it the second way, you end up with timing dependent starvation deadlocks for individual client processes. Note that the first deadlock is normal -- it would happen if the file were local, as well... no help for badly written code -- but I mention it as important because we are talking about blocking multiple clients. I don't know what the process is, but a threaded process can cause a deadlock when it should be a grant/upgrade/downgrade of an existing lock overlap. This is because there is no such thing as a thread ID in the NFS protocol, and if process IDs are different for different threads, and the requests come from the same system ID, then you can get a deadlock when none should be present. To avoid this, either manage all locks in an apartment or rental model (queue all requests to a single thread, and have it do the locking by proxy) OR make sure that all requests from any thread in a given process in fact are given the same proxy process ID on the wire. [ ... This last is not likely your problem, but I mention it, in case you are using rfork() or Linux threads ... ] Thanks for the explanation. If I were a programmer, it would be very useful. As it is, it's still interesting. I have no way of judging the quality of the code in question, other than the empirical result that it works in most cases. What is the result of running this locally on the NFS server and attempting to lock the underlying file? If rpc.lockd is hanging onto a lock, running that perl script locally on the actual file (not an NFS mounted image of it) should also hang. That was my next question, as well: does it happen on a local FS as well as an NFS FS? Personally, I would *NOT* recommend running it on the server, but mount a local FS on the client instead; the less variables, the better. Works fine on the client on a local file system. Works fine on the server. On the other hand, this is clearly a deadlock that requires an existing, conflicting lock -- IFF the you are correct about the delayed locking behaviour. Not sure I understand this. As a side note, you probably want to create a C executable to do this kind of fcntl fiddling when attempting to test NFS. That way you can use a locally mounted binary and you won't wind up with all of the Perl access calls on the NFS wire. Or, at least, use a local copy of Perl. I recommend a pared down test case. I suspect that the problem is that something that is expected to have the same ID is locking against itself. I can't pare it down any further using perl. If someone better at C than I am gives me a sample C program, I'll be happy to try it. Does the failure occur with the same values in all cases in the F_RSETLKW? If so, I suggest you capture *all* locking packets on your wire, and then find who is conflicting. This may be a simple lock order reversal (deadly embrace deadlock) due to poor application performance. You may also find that you have multiple process IDs, when it should be a single process ID, for the proxy PID for the conflicting request. At worst, it would be nice to know the system that caused it. Actually, for a lock you know is threre, you *can* diagnose the problem (somewhat) by writing a program on the server, and using F_GETLK on the range for the hanging lock on the server -- this will return a struct flock, which will give you range and PID information. Do it on the Solaris box, though. The reason you want to do this on the Solaris box is that the struct flock on FreeBSD fails to include the l_rsysid -- the remote system ID. Sorry, but I don't understand any of that. Actually, given this, I don't understand how FreeBSD server side proxy locking can actually work at all; it would incorrectly coelesce locks with local locks when the l_pid matched, which would be *all* locks in the lockd, and then incorrectly release them when a local process exited, or any process on any remote system unlocked an overlapping range (possibly in error). So you're suggesting that when it works, it's just lucky? But others have said that it works for them, and it seems to work OK between FreeBSD systems. You are using FreeBSD as the NFS client in this case, right? If so, that's probably not an issue for you... No. I think that you may be trying to solve a problem I don't have. First - I'm not a programmer. I'm not trying to write any
Re: NFS file unlocking problem
Steve Sizemore wrote: Thanks for the explanation. If I were a programmer, it would be very useful. As it is, it's still interesting. I have no way of judging the quality of the code in question, other than the empirical result that it works in most cases. Well, then you are stuck with the code you have that someone else wrote. Hopefully that's not your problem, or your are in trouble. 8-). What is the result of running this locally on the NFS server and attempting to lock the underlying file? If rpc.lockd is hanging onto a lock, running that perl script locally on the actual file (not an NFS mounted image of it) should also hang. That was my next question, as well: does it happen on a local FS as well as an NFS FS? Personally, I would *NOT* recommend running it on the server, but mount a local FS on the client instead; the less variables, the better. Works fine on the client on a local file system. Works fine on the server. OK, then it isn't an intra-program deadlock, which is something. It could still be inter-program, but if it is, it's not going to be easy to find; you will need to find someone who *is* a programmer. FWIW, this happen when: Program 1 Program 2 LOCK A LOCK B LOCK B (Waiting for Program 2) LOCK A (Waiting for Program 1 waiting for me) On the other hand, this is clearly a deadlock that requires an existing, conflicting lock -- IFF the you are correct about the delayed locking behaviour. Not sure I understand this. If someone didn't already have it locks, your lock which waits for the region to be able to lock it would not need to wait: it would just give you the lock, and you wouldn't have the problem. Does the failure occur with the same values in all cases in the F_RSETLKW? If so, I suggest you capture *all* locking packets on your wire, and then find who is conflicting. This may be a simple lock order reversal (deadly embrace deadlock) due to poor application performance. You may also find that you have multiple process IDs, when it should be a single process ID, for the proxy PID for the conflicting request. At worst, it would be nice to know the system that caused it. Actually, for a lock you know is threre, you *can* diagnose the problem (somewhat) by writing a program on the server, and using F_GETLK on the range for the hanging lock on the server -- this will return a struct flock, which will give you range and PID information. Do it on the Solaris box, though. The reason you want to do this on the Solaris box is that the struct flock on FreeBSD fails to include the l_rsysid -- the remote system ID. Sorry, but I don't understand any of that. You need to find out why it's waiting. If it's waiting, it's waiting for somebody. You need to know who that somebody is. Once you know that, you can go hit them over the head with a large baseball bat. 8-). I have attached the program to run on your Solaris box. You may have to look in /usr/include/sys/fcntl.h to see the right name, if it complains about l_rsysid (might be l_sysid, or whatever). Actually, given this, I don't understand how FreeBSD server side proxy locking can actually work at all; it would incorrectly coelesce locks with local locks when the l_pid matched, which would be *all* locks in the lockd, and then incorrectly release them when a local process exited, or any process on any remote system unlocked an overlapping range (possibly in error). So you're suggesting that when it works, it's just lucky? But others have said that it works for them, and it seems to work OK between FreeBSD systems. I would have to look at the locking code in FreeBSD for the NFS case. I wrote some NFS locking code for FreeBSD in 1995 that was not used for the implementation. There are ways around the problem in userspace, but they're very hard to make efficient or get correct. They also make it very hard to debug easily, because you can't get the system ID for systems that have outstanding locks. 8-(. You are using FreeBSD as the NFS client in this case, right? If so, that's probably not an issue for you... No. I think that you may be trying to solve a problem I don't have. First - I'm not a programmer. I'm not trying to write any program at all, except as necessary to diagnose this problem. I'll summarize the situation briefly. The issue cropped up in a commercial program (Xinet) which was working on Solaris 2.6 client and server. I'm replacing the server with a FreeBSD box (RELENG_5_0) and the program stopped working. Xinet tech support diagnosed it as nfs locking problem, which I've confirmed by my simple perl program. Client Server Result == === == Solaris Solaris Works FreeBSD Solaris Works FreeBSD
Re: NFS file unlocking problem
On Sat, Mar 15, 2003 at 01:33:11AM -0600, Dan Nelson wrote: Oops. You appended a decoded dump again. I should have told you how to generate a raw tcpdump log. Add -s 1500 -w file.pcap to the tcpdump commandline. You won't get any output to the screen, but the raw packet contents will get written to the file. You can replay it with tcpdump -r, or load it into ethereal and view the packets in the GUI. Sorry - I was trying to be too helpful. I actually did capture the raw dump but appended the decoded output. This time, I've attached a real raw dump. Runs fine on my Solaris 2.6 and 2.7 machines, so it's not a Solaris-FreeBSD specific problem. I was already pretty sure that's true - I actually did have it working for some number of days. I just don't know what broke it. Ok, it definitely dies trying to lock the file. Check to make sure that rpc.lockd is still running on the FreeBSD server. I have seen it coredump on a couple of my 4.5 servers and the result was what you're seeing here (lock attempts hang). rpc.lockd and rpc.statd still running on Freebsd, lockd and statd still running on Solaris. I hope the tcpdump gives you a clue what might be going wrong... Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Fri, Mar 14, 2003 at 09:58:56AM -0600, Dan Nelson wrote: In the last episode (Mar 13), Steve Sizemore said: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. See if you can get a dump of the packets sent when the client hangs. Thanks for the reply, Dan. I have captured packets (tcpdump) between the hosts, and I'm enclosing that as an attachment. I'm not facile enough with tcpdump to capture only the significant packets, so I just got everything. I've enclosed the dump as an attachment - physics is the FreeBSD machine and cfpa11 is the Solaris box. Actually, in this case, the program only locked the file, no unlock, and it hung anyway. (This is different from before.) Is this output helpful? Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley From [EMAIL PROTECTED] Fri Mar 14 13:24:54 2003 Return-Path: [EMAIL PROTECTED] Received: from physics.berkeley.edu (Physics.Berkeley.EDU [128.32.61.77]) by Math.Berkeley.EDU (8.12.8/8.12.8) with ESMTP id h2ELOseB023171 for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:54 -0800 (PST) Received: from physics.berkeley.edu (localhost [127.0.0.1]) by physics.berkeley.edu (8.12.6/8.12.6) with ESMTP id h2ELOrdu008457 for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:53 -0800 (PST) (envelope-from [EMAIL PROTECTED]) Received: (from [EMAIL PROTECTED]) by physics.berkeley.edu (8.12.6/8.12.6/Submit) id h2ELOr4l008456 for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:53 -0800 (PST) (envelope-from steve) Date: Fri, 14 Mar 2003 13:24:53 -0800 (PST) From: Steve Sizemore [EMAIL PROTECTED] Message-Id: [EMAIL PROTECTED] To: [EMAIL PROTECTED] X-Spam-Status: No, hits=1.1 required=5.0 tests=SPAM_PHRASE_00_01,SUBJ_MISSING version=2.43 X-Spam-Level: * Status: RO Content-Length: 4087 Lines: 40 13:23:10.746188 cfpa11.Berkeley.EDU.2399800991 physics.berkeley.edu.nfs: 112 fsstat [|nfs] (DF) 13:23:10.746259 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800991: reply ok 172 fsstat [|nfs] (DF) 13:23:10.843671 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 205284758 win 8760 (DF) 13:23:11.934004 cfpa11.Berkeley.EDU.2399800992 physics.berkeley.edu.nfs: 116 fsstat [|nfs] (DF) 13:23:11.934085 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800992: reply ok 172 fsstat [|nfs] (DF) 13:23:12.033672 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 173 win 8760 (DF) 13:23:15.801623 cfpa11.Berkeley.EDU.2399800998 physics.berkeley.edu.nfs: 124 lookup [|nfs] (DF) 13:23:15.801707 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800998: reply ok 240 lookup [|nfs] (DF) 13:23:15.802485 cfpa11.Berkeley.EDU.2399800999 physics.berkeley.edu.nfs: 120 lookup [|nfs] (DF) 13:23:15.802521 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800999: reply ok 240 lookup [|nfs] (DF) 13:23:15.803314 cfpa11.Berkeley.EDU.2399801000 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:15.803345 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801000: reply ok 116 getattr [|nfs] (DF) 13:23:15.803999 cfpa11.Berkeley.EDU.2399801001 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:15.804027 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801001: reply ok 116 getattr [|nfs] (DF) 13:23:15.862284 cfpa11.Berkeley.EDU.2399801029 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:15.862318 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801029: reply ok 116 getattr [|nfs] (DF) 13:23:15.953781 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 1001 win 8760 (DF) 13:23:16.138906 cfpa11.Berkeley.EDU.2399801083 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:16.138970 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801083: reply ok 116 getattr [|nfs] (DF) 13:23:16.139606 cfpa11.Berkeley.EDU.2399801084 physics.berkeley.edu.nfs: 116 access [|nfs] (DF) 13:23:16.139643 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801084: reply ok 124 access c 8f0a0efd (DF) 13:23:16.141238 cfpa11.Berkeley.EDU.1021 physics.berkeley.edu.sunrpc: P 2839020119:2839020215(96) ack 3472682147 win 8760 (DF) 13:23:16.143832 physics.berkeley.edu.sunrpc cfpa11.Berkeley.EDU.1021: P 1:33(32) ack 96 win 65535 (DF) 13:23:16.233777 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 1241 win 8760 (DF) 13:23:16.243744 cfpa11.Berkeley.EDU.1021 physics.berkeley.edu.sunrpc: . ack 33 win 8760 (DF) 13:23:16.919974 cfpa11.Berkeley.EDU.2399801086 physics.berkeley.edu.nfs: 116 fsstat [|nfs] (DF) 13:23:16.920036 physics.berkeley.edu.nfs
Re: NFS file unlocking problem
On Thu, 13 Mar 2003, Steve Sizemore wrote: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. That's ... odd. However, the Solaris rpc.lockd does some strange caching that can lead to asymmetric behavior. In addition, you are running Solaris 2.5 which qualifies as practically prehistoric in computer time. That's going to activate some old mechanisms which FreeBSD may or may not support. Several areas are suspect: 1) RPC can't agree on a protocol version with Solaris 2.5 2) NFS can't agree on a protocol version with Solaris 2.5 3) The lock attempt itself is broken The last time a hang like this happened I believe that it was an issue in not returning the correct rejection notice during an RPC negotiation. I recommend using ethereal to create a trace file. This is going to be tough to debug as I don't have access to a Solaris 2.5 machine to test the interaction and see what is going on. -a To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
In the last episode (Mar 13), Steve Sizemore said: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. See if you can get a dump of the packets sent when the client hangs. -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Fri, Mar 14, 2003 at 04:47:11PM -0600, Dan Nelson wrote: Ideally, a truss of the last 10 lines of the failing program plus the raw tcpdump log (run with -s 1500 so we get the whole packet) would be better. The truss is so we have proof that file locks are really to blame :) On Fri, Mar 14, 2003 at 06:14:59PM -0800, Andrew P. Lentvorski, Jr. wrote: On Thu, 13 Mar 2003, Steve Sizemore wrote: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. That's ... odd. However, the Solaris rpc.lockd does some strange caching that can lead to asymmetric behavior. In addition, you are running Solaris 2.5 which qualifies as practically prehistoric in computer time. That's going to activate some old mechanisms which FreeBSD may or may not support. OK, that was a typo - it's really 2.6. Not quite so ancient. However, I also have a Solaris 8 machine that has the same behavior, so I've used it to generate the requested output. Several areas are suspect: 1) RPC can't agree on a protocol version with Solaris 2.5 2) NFS can't agree on a protocol version with Solaris 2.5 3) The lock attempt itself is broken The last time a hang like this happened I believe that it was an issue in not returning the correct rejection notice during an RPC negotiation. I recommend using ethereal to create a trace file. This is going to be tough to debug as I don't have access to a Solaris 2.5 machine to test the interaction and see what is going on. I have installed ethereal, so I could do a trace, if you tell me what options to use. In the meantime, I'm attaching the output of truss (Solaris 8) and tcpdump (FreeBSD). Note that the program now has been simplified to do only the lock, since it's no longer necessary to unlock the file to get it to hang. Here's the demo program - #!/usr/local/bin/perl -w use strict ; use File::BasicFlock; my $filename= shift ; print Locking $filename\n; lock($filename); print Done\n; exit; Output files are attached. Thanks. -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley SPAM: Start SpamAssassin results -- SPAM: This mail is probably spam. The original message has been altered SPAM: so you can recognise or block similar unwanted mail in future. SPAM: See http://spamassassin.org/tag/ for more details. SPAM: SPAM: Content analysis details: (5.10 hits, 5 required) SPAM: X_AUTH_WARNING (-0.2 points) Found a X-Authentication-Warning header SPAM: SUBJ_MISSING (0.3 points) Subject: is empty or missing SPAM: GAPPY_TEXT (0.0 points) BODY: Contains 'G.a.p.p.y-T.e.x.t' SPAM: SPAM_PHRASE_00_01 (0.8 points) BODY: Spam phrases score is 00 to 01 (low) SPAM: BAD_HELO_WARNING (2.3 points) Fake name used in SMTP HELO command SPAM: RCVD_IN_RELAYS_ORDB_ORG (0.6 points) RBL: Received via a relay in relays.ordb.org SPAM:[RBL check: found 106.210.32.128.relays.ordb.org.] SPAM: UPPERCASE_25_50(1.3 points) message body is 25-50% uppercase SPAM: SPAM: End of SpamAssassin results - execve(plock, 0xFFBEFA54, 0xFFBEFA68) argc = 4 mmap(0x, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF3A resolvepath(/usr/lib/ld.so.1, /usr/lib/ld.so.1, 1023) = 16 open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT stat(/usr/local/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/opt/SUNWspro/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/openwin/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/lib/libsocket.so.1, 0xFFBEF17C) = 0 open(/usr/lib/libsocket.so.1, O_RDONLY) = 3 fstat(3, 0xFFBEF17C)= 0 mmap(0x, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xFF39 mmap(0x, 114688, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xFF37 mmap(0xFF38A000, 4365, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 40960) = 0xFF38A000 munmap(0xFF37A000, 65536) = 0 memcntl(0xFF37, 14496, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3)= 0 stat(/usr/local/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/opt/SUNWspro/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/openwin/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/lib/libnsl.so.1, 0xFFBEF17C)= 0 open(/usr/lib/libnsl.so.1, O_RDONLY) = 3 fstat(3, 0xFFBEF17C)= 0 mmap(0xFF39, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) =
Re: NFS file unlocking problem
In the last episode (Mar 14), Steve Sizemore said: That's ... odd. However, the Solaris rpc.lockd does some strange caching that can lead to asymmetric behavior. In addition, you are running Solaris 2.5 which qualifies as practically prehistoric in computer time. That's going to activate some old mechanisms which FreeBSD may or may not support. OK, that was a typo - it's really 2.6. Not quite so ancient. However, I also have a Solaris 8 machine that has the same behavior, so I've used it to generate the requested output. Oops. You appended a decoded dump again. I should have told you how to generate a raw tcpdump log. Add -s 1500 -w file.pcap to the tcpdump commandline. You won't get any output to the screen, but the raw packet contents will get written to the file. You can replay it with tcpdump -r, or load it into ethereal and view the packets in the GUI. Several areas are suspect: 1) RPC can't agree on a protocol version with Solaris 2.5 2) NFS can't agree on a protocol version with Solaris 2.5 3) The lock attempt itself is broken Judging by the truss, I'd say #3 Note that the program now has been simplified to do only the lock, since it's no longer necessary to unlock the file to get it to hang. Here's the demo program - Runs fine on my Solaris 2.6 and 2.7 machines, so it's not a Solaris-FreeBSD specific problem. open(/home/cosmology/steve/lock_file, O_RDWR) = 3 fstat(3, 0x000C0A2C) = 0 fcntl(3, F_SETFD, 0x0001) = 0 llseek(3, 0, SEEK_CUR)= 0 Received signal #2, SIGINT, in fcntl() [default] fcntl(3, F_SETLKW, 0xFFBEF790)Err#4 EINTR *** process killed *** Ok, it definitely dies trying to lock the file. Check to make sure that rpc.lockd is still running on the FreeBSD server. I have seen it coredump on a couple of my 4.5 servers and the result was what you're seeing here (lock attempts hang). -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
In the last episode (Mar 14), Steve Sizemore said: On Fri, Mar 14, 2003 at 09:58:56AM -0600, Dan Nelson wrote: In the last episode (Mar 13), Steve Sizemore said: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. See if you can get a dump of the packets sent when the client hangs. I've enclosed the dump as an attachment - physics is the FreeBSD machine and cfpa11 is the Solaris box. Actually, in this case, the program only locked the file, no unlock, and it hung anyway. (This is different from before.) Is this output helpful? Not really; tcpdump doesn't decode NFS or RPC packets well enough. ethereal does a much better job (but unfortunately doesn't have a good 1-line-per-packet mode so you can't just email the decoded dump). Ideally, a truss of the last 10 lines of the failing program plus the raw tcpdump log (run with -s 1500 so we get the whole packet) would be better. The truss is so we have proof that file locks are really to blame :) -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message