Re: NFS stop/start problems (related to datagram shutdown bug?)
> There does seem to be a possible problem with sk_inuse not being > updated atomically, so a race between an increment and a decrement > could lose one of them. > svc_sock_release seems to often be called with no more protection than > the BKL, and it decrements sk_inuse. > > svc_sock_enqueue, on the other hand increments sk_inuse, and is > protected by sv_lock, but not, I think, by the BKL, as it is called by > a networking layer callback. So there might be a possibility for a > race here. > > The attached patch might fix it, so if you are having reproducable > problems, it might be worth applying this patch. > > NeilBrown I applied the patch and the problem seems to have gone away, where it was fairly reproducable beforehand. It waits a little longer (about 4 seconds) during the NFS daemon shutdown before [ OK ] pops up, but it could be my imagination because I was doing it on the 166 and I was used to the 866's. But what matters is that I can stop and restart NFS just fine now whereas before I couldn't. Thanks for the patch. -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
> " " == Neil Brown <[EMAIL PROTECTED]> writes: > The attached patch might fix it, so if you are having > reproducable problems, it might be worth applying this patch. > Trond: any comments? > + > + spin_lock_bh(>sv_lock); > if (!--(svsk->sk_inuse) && svsk->sk_dead) { > + spin_unlock_bh(>sv_lock); > dprintk("svc: releasing dead socket\n"); > sock_release(svsk->sk_sock); > kfree(svsk); > } > + else > + spin_unlock_bh(>sv_lock); > } Looks correct, but there's a similar problem in svc_delete_socket() (see the setting of sk_dead, and subsequent test for sk_inuse). Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
" " == Neil Brown [EMAIL PROTECTED] writes: The attached patch might fix it, so if you are having reproducable problems, it might be worth applying this patch. Trond: any comments? + + spin_lock_bh(serv-sv_lock); if (!--(svsk-sk_inuse) svsk-sk_dead) { + spin_unlock_bh(serv-sv_lock); dprintk("svc: releasing dead socket\n"); sock_release(svsk-sk_sock); kfree(svsk); } + else + spin_unlock_bh(serv-sv_lock); } Looks correct, but there's a similar problem in svc_delete_socket() (see the setting of sk_dead, and subsequent test for sk_inuse). Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
There does seem to be a possible problem with sk_inuse not being updated atomically, so a race between an increment and a decrement could lose one of them. svc_sock_release seems to often be called with no more protection than the BKL, and it decrements sk_inuse. svc_sock_enqueue, on the other hand increments sk_inuse, and is protected by sv_lock, but not, I think, by the BKL, as it is called by a networking layer callback. So there might be a possibility for a race here. The attached patch might fix it, so if you are having reproducable problems, it might be worth applying this patch. NeilBrown I applied the patch and the problem seems to have gone away, where it was fairly reproducable beforehand. It waits a little longer (about 4 seconds) during the NFS daemon shutdown before [ OK ] pops up, but it could be my imagination because I was doing it on the 166 and I was used to the 866's. But what matters is that I can stop and restart NFS just fine now whereas before I couldn't. Thanks for the patch. -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Monday February 5, [EMAIL PROTECTED] wrote: > On Tue, 6 Feb 2001, Neil Brown wrote: > > > How repeatable is this? Is the server SMP? > > I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem > occurred. I have not tested it more than once on the same system (I left the > NFS servers untouched after the reboot). > > The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running > 2.2.19-pre7. All computers exporting the FS had one directory mounted at least > once. > > In one case, only 1 directory was mounted once and then unmounted before > shutting off the NFS server. When I realized I forgot to copy a directory over, > I went to restart NFS on the server and found out I was unable to. Probably > irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps. > > I still have the 'broken' server running, so if you would like me to run a > command or two on it I can show you the results. I don't think that there is much useful that I could look at, thanks. > > > The attached patch might fix it, so if you are having reproducable > > problems, it might be worth applying this patch. > > I can try it tomorrow and see if it fixes the problem, but since this problem > also occurred on a UP, using spin locks probably will not correct it. Perhaps > it's something else. On second thoughts, this doesn't need to be SMP related. I don't know much about "bottom halves" but I gather that they get run after an interrupt has been handled and interrupts have been re-enabled, but before the original process is rescheduled. If this is the case, then the "_bh" part of the "spin_lock_bh" (which does a local_bh_disable) could be the bit that is important on a UP system. NeilBrown > > > [patch snipped] > > -Byron > > -- > Byron Stanoszek Ph: (330) 644-3059 > Systems Programmer Fax: (330) 644-8110 > Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Tue, 6 Feb 2001, Neil Brown wrote: > How repeatable is this? Is the server SMP? I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem occurred. I have not tested it more than once on the same system (I left the NFS servers untouched after the reboot). The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running 2.2.19-pre7. All computers exporting the FS had one directory mounted at least once. In one case, only 1 directory was mounted once and then unmounted before shutting off the NFS server. When I realized I forgot to copy a directory over, I went to restart NFS on the server and found out I was unable to. Probably irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps. I still have the 'broken' server running, so if you would like me to run a command or two on it I can show you the results. > The attached patch might fix it, so if you are having reproducable > problems, it might be worth applying this patch. I can try it tomorrow and see if it fixes the problem, but since this problem also occurred on a UP, using spin locks probably will not correct it. Perhaps it's something else. > [patch snipped] -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Monday February 5, [EMAIL PROTECTED] wrote: > Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I > ran into this problem: > > Stopping NFS says the following in the kernel logs: > > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > nfsd: terminating on signal 9 > svc: server socket destroy delayed > > And restarting NFS has the following error message: > > root:~> /etc/rc.d/init.d/nfs start > Starting NFS services: [ OK ] > Starting NFS quotas: [ OK ] > Starting NFS mountd: [ OK ] > Starting NFS daemon: nfssvc: Address already in use >[FAILED] How repeatable is this? Is the server SMP? There does seem to be a possible problem with sk_inuse not being updated atomically, so a race between an increment and a decrement could lose one of them. svc_sock_release seems to often be called with no more protection than the BKL, and it decrements sk_inuse. svc_sock_enqueue, on the other hand increments sk_inuse, and is protected by sv_lock, but not, I think, by the BKL, as it is called by a networking layer callback. So there might be a possibility for a race here. The attached patch might fix it, so if you are having reproducable problems, it might be worth applying this patch. Trond: any comments? NeilBrown [ a better fix would be to make sk_inuse atomic_t ] --- net/sunrpc/svcsock.c2001/02/05 23:45:54 1.1 +++ net/sunrpc/svcsock.c2001/02/05 23:48:12 @@ -211,16 +211,22 @@ svc_sock_release(struct svc_rqst *rqstp) { struct svc_sock *svsk = rqstp->rq_sock; + struct svc_serv *serv = svsk->sk_server; if (!svsk) return; svc_release_skb(rqstp); rqstp->rq_sock = NULL; + + spin_lock_bh(>sv_lock); if (!--(svsk->sk_inuse) && svsk->sk_dead) { + spin_unlock_bh(>sv_lock); dprintk("svc: releasing dead socket\n"); sock_release(svsk->sk_sock); kfree(svsk); } + else + spin_unlock_bh(>sv_lock); } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Mon, 5 Feb 2001, Alan Cox wrote: > > Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I > > ran into this problem: > > Ok seen this in older 2.2 but not 2.4 > > > nfsd: terminating on signal 9 > > svc: server socket destroy delayed > > > > And restarting NFS has the following error message: > > Starting NFS mountd: [ OK ] > > Starting NFS daemon: nfssvc: Address already in use > >[FAILED] > > A socket got stuck. Thats preventing you restarting it. The bug is whatever > leak caused the svc: server socket destroy delayed case. > > Just for reference what network card ? Both machines had a 3c905b-tx-nm card in them. 3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others. http://www.scyld.com/network/vortex.html $Revision: 1.102.2.46 $ See Documentation/networking/vortex.txt eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0x6100, 00:50:da:cd:c8:b9, IRQ 11 product code 'XC' rev 00.13 date 12-29-99 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. MII transceiver found at address 24, status 786d. Enabling bus-master transmits and whole-frame receives. -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
> Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I > ran into this problem: Ok seen this in older 2.2 but not 2.4 > nfsd: terminating on signal 9 > svc: server socket destroy delayed > > And restarting NFS has the following error message: > Starting NFS mountd: [ OK ] > Starting NFS daemon: nfssvc: Address already in use >[FAILED] A socket got stuck. Thats preventing you restarting it. The bug is whatever leak caused the svc: server socket destroy delayed case. Just for reference what network card ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
NFS stop/start problems (related to datagram shutdown bug?)
Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I ran into this problem: Stopping NFS says the following in the kernel logs: nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 svc: server socket destroy delayed And restarting NFS has the following error message: root:~> /etc/rc.d/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS mountd: [ OK ] Starting NFS daemon: nfssvc: Address already in use [FAILED] >From that moment forward, the NFS server is completely broken until the system is rebooted, and other machines respond during a 'mount' by saying, nfs: server xxx not responding, still trying When I tried this, the remote computer had unmounted this NFS-served partition prior to shutting NFS down with '/etc/rc.d/init.d/nfs stop'. I was wondering if this could be related to that datagram shutdown bug, and maybe if there's a quick solution in the meantime to kill the socket so that I can restart NFS without rebooting. Thanks, Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
NFS stop/start problems (related to datagram shutdown bug?)
Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I ran into this problem: Stopping NFS says the following in the kernel logs: nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 svc: server socket destroy delayed And restarting NFS has the following error message: root:~ /etc/rc.d/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS mountd: [ OK ] Starting NFS daemon: nfssvc: Address already in use [FAILED] From that moment forward, the NFS server is completely broken until the system is rebooted, and other machines respond during a 'mount' by saying, nfs: server xxx not responding, still trying When I tried this, the remote computer had unmounted this NFS-served partition prior to shutting NFS down with '/etc/rc.d/init.d/nfs stop'. I was wondering if this could be related to that datagram shutdown bug, and maybe if there's a quick solution in the meantime to kill the socket so that I can restart NFS without rebooting. Thanks, Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I ran into this problem: Ok seen this in older 2.2 but not 2.4 nfsd: terminating on signal 9 svc: server socket destroy delayed And restarting NFS has the following error message: Starting NFS mountd: [ OK ] Starting NFS daemon: nfssvc: Address already in use [FAILED] A socket got stuck. Thats preventing you restarting it. The bug is whatever leak caused the svc: server socket destroy delayed case. Just for reference what network card ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Mon, 5 Feb 2001, Alan Cox wrote: Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I ran into this problem: Ok seen this in older 2.2 but not 2.4 nfsd: terminating on signal 9 svc: server socket destroy delayed And restarting NFS has the following error message: Starting NFS mountd: [ OK ] Starting NFS daemon: nfssvc: Address already in use [FAILED] A socket got stuck. Thats preventing you restarting it. The bug is whatever leak caused the svc: server socket destroy delayed case. Just for reference what network card ? Both machines had a 3c905b-tx-nm card in them. 3c59x.c:LK1.1.12 06 Jan 2000 Donald Becker and others. http://www.scyld.com/network/vortex.html $Revision: 1.102.2.46 $ See Documentation/networking/vortex.txt eth0: 3Com PCI 3c905B Cyclone 100baseTx at 0x6100, 00:50:da:cd:c8:b9, IRQ 11 product code 'XC' rev 00.13 date 12-29-99 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface. MII transceiver found at address 24, status 786d. Enabling bus-master transmits and whole-frame receives. -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Monday February 5, [EMAIL PROTECTED] wrote: Seems recently, on both redhat 6.1 and 7.0 using kernel 2.4.1-ac3, I ran into this problem: Stopping NFS says the following in the kernel logs: nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 nfsd: terminating on signal 9 svc: server socket destroy delayed And restarting NFS has the following error message: root:~ /etc/rc.d/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS mountd: [ OK ] Starting NFS daemon: nfssvc: Address already in use [FAILED] How repeatable is this? Is the server SMP? There does seem to be a possible problem with sk_inuse not being updated atomically, so a race between an increment and a decrement could lose one of them. svc_sock_release seems to often be called with no more protection than the BKL, and it decrements sk_inuse. svc_sock_enqueue, on the other hand increments sk_inuse, and is protected by sv_lock, but not, I think, by the BKL, as it is called by a networking layer callback. So there might be a possibility for a race here. The attached patch might fix it, so if you are having reproducable problems, it might be worth applying this patch. Trond: any comments? NeilBrown [ a better fix would be to make sk_inuse atomic_t ] --- net/sunrpc/svcsock.c2001/02/05 23:45:54 1.1 +++ net/sunrpc/svcsock.c2001/02/05 23:48:12 @@ -211,16 +211,22 @@ svc_sock_release(struct svc_rqst *rqstp) { struct svc_sock *svsk = rqstp-rq_sock; + struct svc_serv *serv = svsk-sk_server; if (!svsk) return; svc_release_skb(rqstp); rqstp-rq_sock = NULL; + + spin_lock_bh(serv-sv_lock); if (!--(svsk-sk_inuse) svsk-sk_dead) { + spin_unlock_bh(serv-sv_lock); dprintk("svc: releasing dead socket\n"); sock_release(svsk-sk_sock); kfree(svsk); } + else + spin_unlock_bh(serv-sv_lock); } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Tue, 6 Feb 2001, Neil Brown wrote: How repeatable is this? Is the server SMP? I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem occurred. I have not tested it more than once on the same system (I left the NFS servers untouched after the reboot). The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running 2.2.19-pre7. All computers exporting the FS had one directory mounted at least once. In one case, only 1 directory was mounted once and then unmounted before shutting off the NFS server. When I realized I forgot to copy a directory over, I went to restart NFS on the server and found out I was unable to. Probably irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps. I still have the 'broken' server running, so if you would like me to run a command or two on it I can show you the results. The attached patch might fix it, so if you are having reproducable problems, it might be worth applying this patch. I can try it tomorrow and see if it fixes the problem, but since this problem also occurred on a UP, using spin locks probably will not correct it. Perhaps it's something else. [patch snipped] -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS stop/start problems (related to datagram shutdown bug?)
On Monday February 5, [EMAIL PROTECTED] wrote: On Tue, 6 Feb 2001, Neil Brown wrote: How repeatable is this? Is the server SMP? I've tested this on two UP Athlons and 2 SMP Pentium 3's and the same problem occurred. I have not tested it more than once on the same system (I left the NFS servers untouched after the reboot). The Athlon systems running NFS were 2.4.1-ac3 and the Pentiums were running 2.2.19-pre7. All computers exporting the FS had one directory mounted at least once. In one case, only 1 directory was mounted once and then unmounted before shutting off the NFS server. When I realized I forgot to copy a directory over, I went to restart NFS on the server and found out I was unable to. Probably irrelevant, but this had been after transferring 7 gigs of data over 100 Mbps. I still have the 'broken' server running, so if you would like me to run a command or two on it I can show you the results. I don't think that there is much useful that I could look at, thanks. The attached patch might fix it, so if you are having reproducable problems, it might be worth applying this patch. I can try it tomorrow and see if it fixes the problem, but since this problem also occurred on a UP, using spin locks probably will not correct it. Perhaps it's something else. On second thoughts, this doesn't need to be SMP related. I don't know much about "bottom halves" but I gather that they get run after an interrupt has been handled and interrupts have been re-enabled, but before the original process is rescheduled. If this is the case, then the "_bh" part of the "spin_lock_bh" (which does a local_bh_disable) could be the bit that is important on a UP system. NeilBrown [patch snipped] -Byron -- Byron Stanoszek Ph: (330) 644-3059 Systems Programmer Fax: (330) 644-8110 Commercial Timesharing Inc. Email: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/