Re: [Re: NFS -current
On Thu, Mar 27, 2003 at 12:09:28AM -0800, Terry Lambert wrote: Steve Sizemore wrote: On Wed, Mar 26, 2003 at 12:18:11AM -0800, Terry Lambert wrote: In fact, the only legitimate argument I have ever heard for UDP has been I have an old Linux install that can't talk TCP, as only UDP was implemented at the time I installed it. Have you already forgotten the locking problem that you were helping me with last week? The only solution was to use UDP. Working around a screwed up implementation is not a legitimate argument. The only legitimate argument to that is unscrewing the implementation. 8-). I agree with that to a degree - at least from the perspective of a developer. (If I had the knowledge and time to unscrew the implementation, I would certainly try.) However, for those who are primarily sysadmins and FreeBSD advocates, using UDP is a legitimate alternative to switching to linux. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [Re: NFS -current
On Wed, Mar 26, 2003 at 12:18:11AM -0800, Terry Lambert wrote: In fact, the only legitimate argument I have ever heard for UDP has been I have an old Linux install that can't talk TCP, as only UDP was implemented at the time I installed it. Hi, Terry - Have you already forgotten the locking problem that you were helping me with last week? The only solution was to use UDP. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: NFS file unlocking problem
On Wed, Mar 19, 2003 at 05:15:02PM -0800, Andrew P. Lentvorski, Jr. wrote: Steve, I actually managed to pull down the dump. It doesn't have any lock requests in it. It looks like it is hanging in the rpcinfo call. If you really want to debug this, it's going to take a chunk of work. 1) set up two brand new machines FreeBSD-current for the server Solaris whatever for the client Well, I think I found the problem. I had just installed the new FreeBSD machine, not looking forward to finding another Sun to install, when I had the idea to try different nfs protocols. By default, it looks like the NFS mount is version 3 tcp. I specified udp, and both of the test programs (mine and Terry's) work consistently. I've reenabled locking on the Xinet software, and we'll see (tomorrow) whether or not that also works. If I'm right, that means that there is a problem with nfs over tcp with a Solaris client and FreeBSD-5 server. All other combinations of client/server pairs worked with the default (which I assume is version 3 tcp). BTW, Terry's testlock program, when run with the problem configuration, would always return the There is nothing that would block your lock message, but it could take anywhere from 1 second to 45 minutes. I'm not so confident that my perl program would always succeed, because I was never willing to wait longer than overnight before concluding that it was hung. I'll report back on the status of xinet tomorrow, but thanks to all of you for your help and suggestions. If there's any more information I can provide, or testing that I can do, to help fix the NFS/tcp/Solaris problem, let me know. Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Mon, Mar 17, 2003 at 11:36:58PM -0800, Terry Lambert wrote: Steve Sizemore wrote: useful. As it is, it's still interesting. I have no way of judging the quality of the code in question, other than the empirical result that it works in most cases. Well, then you are stuck with the code you have that someone else wrote. Hopefully that's not your problem, or your are in trouble. 8-). Actually, maybe not, since it's a commercial program. If I could demonstrate that it's their problem, I could put pressure on them to fix it. However, at this point, I don't think that's the case. OK, then it isn't an intra-program deadlock, which is something. It could still be inter-program, but if it is, it's not going to be easy to find; you will need to find someone who *is* a programmer. FWIW, this happen when: Program 1 Program 2 LOCK A LOCK B LOCK B (Waiting for Program 2) LOCK A (Waiting for Program 1 waiting for me) I don't see now it could be inter-program, since I've gone to great lengths to simplify it to a single program failing on a brand new file. On the other hand, this is clearly a deadlock that requires an existing, conflicting lock -- IFF the you are correct about the delayed locking behaviour. Not sure I understand this. If someone didn't already have it locks, your lock which waits for the region to be able to lock it would not need to wait: it would just give you the lock, and you wouldn't have the problem. Oh, so that's what that meant. :-) But (see above) it's pretty clear to me that nothing else could have it locked. You need to find out why it's waiting. If it's waiting, it's waiting for somebody. You need to know who that somebody is. Once you know that, you can go hit them over the head with a large baseball bat. 8-). Yes. But that somebody is undoubtedly not a real person. I have attached the program to run on your Solaris box. You may have to look in /usr/include/sys/fcntl.h to see the right name, if it complains about l_rsysid (might be l_sysid, or whatever). I'm attaching a test program to run on the server when the lock fails, using information from the trace to know the name of the file to enter, and the ethreal decoded packet trace to know how to answer the other questions. I'll try it today. But I think it may be as simple as you not telling us that you have multiple IP addresses configured on one of your machines? No, but this might be an important clue. The FreeBSD host has multiple (2) A Records in the DNS. In fact, I think that when it last worked, it had only a single A Record. Also, I notice that there are two rpc.lockd processes running on the FreeBSD server. I hadn't noticed that before it started failing, but I didn't mention it, since rpc.lockd does get invoked twice in rc.network. However, rpc.statd also gets called twice, and there's only a single version of it running... root 399 0.0 0.1 263496 1000 ?? Is9:11AM 0:00.00 /usr/sbin/rpc.sta root 402 0.0 0.1 1512 1156 ?? Ss9:11AM 0:00.00 /usr/sbin/rpc.loc daemon 405 0.0 0.1 1484 1176 ?? I 9:11AM 0:00.00 /usr/sbin/rpc.loc Does that indicate a problem? If so, try: sysctl -w net.inet.ip.check_interface=0 What does this do, just turn off checking? Can I do this on the running system, or do I need to put it into sysctl.conf and reboot? (BTW, from the man page - The -w option has been deprecated and is silently ignored.) Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Tue, Mar 18, 2003 at 05:25:36PM -0800, Andrew P. Lentvorski, Jr. wrote: On Tue, 18 Mar 2003, Steve Sizemore wrote: root 399 0.0 0.1 263496 1000 ?? Is9:11AM 0:00.00 /usr/sbin/rpc.sta root 402 0.0 0.1 1512 1156 ?? Ss9:11AM 0:00.00 /usr/sbin/rpc.loc daemon 405 0.0 0.1 1484 1176 ?? I 9:11AM 0:00.00 /usr/sbin/rpc.loc This might be the culprit. The way that rpc.lock works is that it grabs a lock on the *entire* underlying file when any lock request comes in. If those requests get misrouted to the wrong daemon, it is likely to cause havoc. OK. It appears that starting rpc.lockd automatically spawns two copies, so if that's a problem, how can I fix it? Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Mon, Mar 17, 2003 at 01:21:19PM -0800, Andrew P. Lentvorski, Jr. wrote: On Sun, 16 Mar 2003, Steve Sizemore wrote: The dump doesn't seem to be attached. However, I note that the request It appears that there are problems sending the raw dump. I've tried twice - once 2 minutes after I sent the original message, and once again when I got this from you. Neither has shown up on the list. I can find another way to make it available if you need to see it. being sent is SETLKW which is a blocking wait until lock is granted. If the server thinks the file is already locked, it will hang *and* that is the proper behavior. What is the result of running this locally on the NFS server and attempting to lock the underlying file? If rpc.lockd is hanging onto a lock, running that perl script locally on the actual file (not an NFS mounted image of it) should also hang. It seems to work as expected (at least as I expect) on the server. If no other process has a lock, then the program locks the file, unlocks it, and exits immediately. If the remote client is trying to lock/unlock the file, then running the same program on the server also hangs. One other twist - recently, the behavior is less predictable. A couple of times in the last 24 hours, the lock/unlock on the client has actually worked as it should. The first time it happened, I was so surprised, that I thought I must have locked a local file rather than an NFS mounted file. On other occasions, the program has succeeded after very long hangs, .e.g % time plock xxx Locking xxx Unlocking xxx Done 0.21u 0.05s 55:35.33 0.0% This makes me wonder whether waiting indefinitely would succeed in all cases. (Note, however, that I've frequently waited more than an hour before killing the process or giving up.) As a side note, you probably want to create a C executable to do this kind of fcntl fiddling when attempting to test NFS. That way you can use a locally mounted binary and you won't wind up with all of the Perl access calls on the NFS wire. Or, at least, use a local copy of Perl. If I trusted my C skills as much as I trust my perl skills, I would do that. The perl stuff is all mounted locally, so there shouldn't be any perl nfs traffic on the wire. Let me know if you still need to see the dump. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
program at all, except as necessary to diagnose this problem. I'll summarize the situation briefly. The issue cropped up in a commercial program (Xinet) which was working on Solaris 2.6 client and server. I'm replacing the server with a FreeBSD box (RELENG_5_0) and the program stopped working. Xinet tech support diagnosed it as nfs locking problem, which I've confirmed by my simple perl program. Client Server Result == === == Solaris Solaris Works FreeBSD Solaris Works FreeBSD FreeBSD Works Solaris FreeBSD Problems Actually, when I say works, all I know is that it doesn't hang. Whether or not the lock is actually effective, I haven't tested. Oh, and the nonblocking flock also hangs, just like the blocking one. The lock call returns; the unlock call doesn't. Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Sat, Mar 15, 2003 at 01:33:11AM -0600, Dan Nelson wrote: Oops. You appended a decoded dump again. I should have told you how to generate a raw tcpdump log. Add -s 1500 -w file.pcap to the tcpdump commandline. You won't get any output to the screen, but the raw packet contents will get written to the file. You can replay it with tcpdump -r, or load it into ethereal and view the packets in the GUI. Sorry - I was trying to be too helpful. I actually did capture the raw dump but appended the decoded output. This time, I've attached a real raw dump. Runs fine on my Solaris 2.6 and 2.7 machines, so it's not a Solaris-FreeBSD specific problem. I was already pretty sure that's true - I actually did have it working for some number of days. I just don't know what broke it. Ok, it definitely dies trying to lock the file. Check to make sure that rpc.lockd is still running on the FreeBSD server. I have seen it coredump on a couple of my 4.5 servers and the result was what you're seeing here (lock attempts hang). rpc.lockd and rpc.statd still running on Freebsd, lockd and statd still running on Solaris. I hope the tcpdump gives you a clue what might be going wrong... Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: NFS file unlocking problem
On Fri, Mar 14, 2003 at 09:58:56AM -0600, Dan Nelson wrote: In the last episode (Mar 13), Steve Sizemore said: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. See if you can get a dump of the packets sent when the client hangs. Thanks for the reply, Dan. I have captured packets (tcpdump) between the hosts, and I'm enclosing that as an attachment. I'm not facile enough with tcpdump to capture only the significant packets, so I just got everything. I've enclosed the dump as an attachment - physics is the FreeBSD machine and cfpa11 is the Solaris box. Actually, in this case, the program only locked the file, no unlock, and it hung anyway. (This is different from before.) Is this output helpful? Thanks. Steve -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley From [EMAIL PROTECTED] Fri Mar 14 13:24:54 2003 Return-Path: [EMAIL PROTECTED] Received: from physics.berkeley.edu (Physics.Berkeley.EDU [128.32.61.77]) by Math.Berkeley.EDU (8.12.8/8.12.8) with ESMTP id h2ELOseB023171 for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:54 -0800 (PST) Received: from physics.berkeley.edu (localhost [127.0.0.1]) by physics.berkeley.edu (8.12.6/8.12.6) with ESMTP id h2ELOrdu008457 for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:53 -0800 (PST) (envelope-from [EMAIL PROTECTED]) Received: (from [EMAIL PROTECTED]) by physics.berkeley.edu (8.12.6/8.12.6/Submit) id h2ELOr4l008456 for [EMAIL PROTECTED]; Fri, 14 Mar 2003 13:24:53 -0800 (PST) (envelope-from steve) Date: Fri, 14 Mar 2003 13:24:53 -0800 (PST) From: Steve Sizemore [EMAIL PROTECTED] Message-Id: [EMAIL PROTECTED] To: [EMAIL PROTECTED] X-Spam-Status: No, hits=1.1 required=5.0 tests=SPAM_PHRASE_00_01,SUBJ_MISSING version=2.43 X-Spam-Level: * Status: RO Content-Length: 4087 Lines: 40 13:23:10.746188 cfpa11.Berkeley.EDU.2399800991 physics.berkeley.edu.nfs: 112 fsstat [|nfs] (DF) 13:23:10.746259 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800991: reply ok 172 fsstat [|nfs] (DF) 13:23:10.843671 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 205284758 win 8760 (DF) 13:23:11.934004 cfpa11.Berkeley.EDU.2399800992 physics.berkeley.edu.nfs: 116 fsstat [|nfs] (DF) 13:23:11.934085 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800992: reply ok 172 fsstat [|nfs] (DF) 13:23:12.033672 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 173 win 8760 (DF) 13:23:15.801623 cfpa11.Berkeley.EDU.2399800998 physics.berkeley.edu.nfs: 124 lookup [|nfs] (DF) 13:23:15.801707 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800998: reply ok 240 lookup [|nfs] (DF) 13:23:15.802485 cfpa11.Berkeley.EDU.2399800999 physics.berkeley.edu.nfs: 120 lookup [|nfs] (DF) 13:23:15.802521 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399800999: reply ok 240 lookup [|nfs] (DF) 13:23:15.803314 cfpa11.Berkeley.EDU.2399801000 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:15.803345 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801000: reply ok 116 getattr [|nfs] (DF) 13:23:15.803999 cfpa11.Berkeley.EDU.2399801001 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:15.804027 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801001: reply ok 116 getattr [|nfs] (DF) 13:23:15.862284 cfpa11.Berkeley.EDU.2399801029 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:15.862318 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801029: reply ok 116 getattr [|nfs] (DF) 13:23:15.953781 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 1001 win 8760 (DF) 13:23:16.138906 cfpa11.Berkeley.EDU.2399801083 physics.berkeley.edu.nfs: 112 getattr [|nfs] (DF) 13:23:16.138970 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801083: reply ok 116 getattr [|nfs] (DF) 13:23:16.139606 cfpa11.Berkeley.EDU.2399801084 physics.berkeley.edu.nfs: 116 access [|nfs] (DF) 13:23:16.139643 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU.2399801084: reply ok 124 access c 8f0a0efd (DF) 13:23:16.141238 cfpa11.Berkeley.EDU.1021 physics.berkeley.edu.sunrpc: P 2839020119:2839020215(96) ack 3472682147 win 8760 (DF) 13:23:16.143832 physics.berkeley.edu.sunrpc cfpa11.Berkeley.EDU.1021: P 1:33(32) ack 96 win 65535 (DF) 13:23:16.233777 cfpa11.Berkeley.EDU.1005 physics.berkeley.edu.nfsd: . ack 1241 win 8760 (DF) 13:23:16.243744 cfpa11.Berkeley.EDU.1021 physics.berkeley.edu.sunrpc: . ack 33 win 8760 (DF) 13:23:16.919974 cfpa11.Berkeley.EDU.2399801086 physics.berkeley.edu.nfs: 116 fsstat [|nfs] (DF) 13:23:16.920036 physics.berkeley.edu.nfs cfpa11.Berkeley.EDU
Re: NFS file unlocking problem
On Fri, Mar 14, 2003 at 04:47:11PM -0600, Dan Nelson wrote: Ideally, a truss of the last 10 lines of the failing program plus the raw tcpdump log (run with -s 1500 so we get the whole packet) would be better. The truss is so we have proof that file locks are really to blame :) On Fri, Mar 14, 2003 at 06:14:59PM -0800, Andrew P. Lentvorski, Jr. wrote: On Thu, 13 Mar 2003, Steve Sizemore wrote: Running RELENG_5_0 as nfs server with a Solaris 2.5 client. rpc.statd and rpc.lockd both running on FreeBSD, lockd and statd both running on Solaris. Locking a file (flock) works fine, but when an attempt to unlock it is made, the client session hangs. The program is typically (but not always) uninterruptible, and I have to kill the login session. That's ... odd. However, the Solaris rpc.lockd does some strange caching that can lead to asymmetric behavior. In addition, you are running Solaris 2.5 which qualifies as practically prehistoric in computer time. That's going to activate some old mechanisms which FreeBSD may or may not support. OK, that was a typo - it's really 2.6. Not quite so ancient. However, I also have a Solaris 8 machine that has the same behavior, so I've used it to generate the requested output. Several areas are suspect: 1) RPC can't agree on a protocol version with Solaris 2.5 2) NFS can't agree on a protocol version with Solaris 2.5 3) The lock attempt itself is broken The last time a hang like this happened I believe that it was an issue in not returning the correct rejection notice during an RPC negotiation. I recommend using ethereal to create a trace file. This is going to be tough to debug as I don't have access to a Solaris 2.5 machine to test the interaction and see what is going on. I have installed ethereal, so I could do a trace, if you tell me what options to use. In the meantime, I'm attaching the output of truss (Solaris 8) and tcpdump (FreeBSD). Note that the program now has been simplified to do only the lock, since it's no longer necessary to unlock the file to get it to hang. Here's the demo program - #!/usr/local/bin/perl -w use strict ; use File::BasicFlock; my $filename= shift ; print Locking $filename\n; lock($filename); print Done\n; exit; Output files are attached. Thanks. -- Steve Sizemore [EMAIL PROTECTED], (510) 642-8570 Unix System Manager Dept. of Mathematics and College of Letters and Science University of California, Berkeley SPAM: Start SpamAssassin results -- SPAM: This mail is probably spam. The original message has been altered SPAM: so you can recognise or block similar unwanted mail in future. SPAM: See http://spamassassin.org/tag/ for more details. SPAM: SPAM: Content analysis details: (5.10 hits, 5 required) SPAM: X_AUTH_WARNING (-0.2 points) Found a X-Authentication-Warning header SPAM: SUBJ_MISSING (0.3 points) Subject: is empty or missing SPAM: GAPPY_TEXT (0.0 points) BODY: Contains 'G.a.p.p.y-T.e.x.t' SPAM: SPAM_PHRASE_00_01 (0.8 points) BODY: Spam phrases score is 00 to 01 (low) SPAM: BAD_HELO_WARNING (2.3 points) Fake name used in SMTP HELO command SPAM: RCVD_IN_RELAYS_ORDB_ORG (0.6 points) RBL: Received via a relay in relays.ordb.org SPAM:[RBL check: found 106.210.32.128.relays.ordb.org.] SPAM: UPPERCASE_25_50(1.3 points) message body is 25-50% uppercase SPAM: SPAM: End of SpamAssassin results - execve(plock, 0xFFBEFA54, 0xFFBEFA68) argc = 4 mmap(0x, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF3A resolvepath(/usr/lib/ld.so.1, /usr/lib/ld.so.1, 1023) = 16 open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT stat(/usr/local/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/opt/SUNWspro/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/openwin/lib/libsocket.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/lib/libsocket.so.1, 0xFFBEF17C) = 0 open(/usr/lib/libsocket.so.1, O_RDONLY) = 3 fstat(3, 0xFFBEF17C)= 0 mmap(0x, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xFF39 mmap(0x, 114688, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xFF37 mmap(0xFF38A000, 4365, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 40960) = 0xFF38A000 munmap(0xFF37A000, 65536) = 0 memcntl(0xFF37, 14496, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3)= 0 stat(/usr/local/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/opt/SUNWspro/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/openwin/lib/libnsl.so.1, 0xFFBEF17C) Err#2 ENOENT stat(/usr/lib/libnsl.so.1, 0xFFBEF17C)= 0 open(/usr/lib/libnsl.so.1, O_RDONLY) = 3 fstat(3, 0xFFBEF17C)= 0 mmap(0xFF39, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0