Re: 9-STABLE - NFS - NetAPP:
On Friday, February 15, 2013 11:31:11 pm Marc Fournier wrote: Trying the patch now … but what do you mean by using 'SIGSTOP'? I generally do a 'kill -HUP' then when that doesn't work 'kill -9' … should Iuse -STOP instead of 9? No. This patch only helps if you are using kill -STOP to pause processes and later resume them. If you aren't doing that, then the suspension could be due to a different cause. Please try this patch instead and let me know if you see any of the 'Deferring' messages on the console: Index: kern_thread.c === --- kern_thread.c (revision 246122) +++ kern_thread.c (working copy) @@ -794,7 +794,30 @@ thread_suspend_check(int return_instead) (p-p_flag P_SINGLE_BOUNDARY) return_instead) return (ERESTART); +#if 0 /* +* Ignore suspend requests for stop signals if they +* are deferred. +*/ + if (P_SHOULDSTOP(p) == P_STOPPED_SIG + td-td_flags TDF_SBDRY) { + KASSERT(return_instead, + (TDF_SBDRY set for unsafe thread_suspend_check)); + return (0); + } +#else + /* Ignore syspend requests if stops are deferred. */ + if (td-td_flags TDF_SBDRY) { + if (!return_instead) + panic(TDF_SBDRY set, but return_instead not); + if (P_SHOULDSTOP(p) != P_STOPPED_SIG) + printf(Deferring non-STOP suspension: SHOULDSTOP: %x p_flag %x\n, + P_SHOULDSTOP(p), p-p_flag); + return (0); + } +#endif + + /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
2days, 6hrs since reboot with new kernel, server shows unreachable: # ssh mercury ssh_exchange_identification: Connection closed by remote host although runtime shows it is up: mercuryup 2+06:17, 0 users, load 0.63, 0.69, 0.70 Remote console shows: I could press return, so keyboard was still responsive, and got a new login prompt, but after typing login id, it appears to just hang … Remotely power cycled server. This is new behaviour for that server since applying patch … will see if it happens again ... On 2013-02-17, at 7:07 AM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: On 2013-02-15, at 7:21 AM, Rick Macklem rmack...@uoguelph.ca wrote: Righto. Thanks jhb and kib for looking at this. Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However, as far as I can tell, it just sets TDF_SBDRY when it is already set and seems harmless. (Since this code is supposed to be generic and not specific to NFS, maybe it should stay that way?) Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch is ok for stable/9 without your recent head patch. Maybe Marc can test the above patch? 'k, not sure what you want me to 'test', but so far, patch has been applied / live for ~21hrs, and no processes in state T … Yes, I meant run it like you normally do and see if the hang occurs with the patch (or other problems crop up). I suspect you have some idea of how long it needs to run without a hang before you are convinced the problem is fixed. I can't do commits until April, so there is no rush from my point of view. (I suspect jhb@ will commit it at some point, if/when it appears to fix the problem and seems correct.) Thanks for testing it, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
According to /var/log/messages, everything seems to have been running (at least against the local file system) up until the reboot: === Feb 18 12:00:00 mercury kernel: bce1: promiscuous mode disabled Feb 18 12:00:00 mercury kernel: bce1: promiscuous mode enabled Feb 18 12:13:55 mercury syslogd: kernel boot file is /boot/kernel/kernel Feb 18 12:13:55 mercury kernel: Copyright (c) 1992-2013 The FreeBSD Project. === On 2013-02-18, at 4:12 AM, Marc Fournier scra...@hub.org wrote: 2days, 6hrs since reboot with new kernel, server shows unreachable: # ssh mercury ssh_exchange_identification: Connection closed by remote host although runtime shows it is up: mercuryup 2+06:17, 0 users, load 0.63, 0.69, 0.70 Remote console shows: Screen Shot 2013-02-18 at 4.06.02 AM.png I could press return, so keyboard was still responsive, and got a new login prompt, but after typing login id, it appears to just hang … Remotely power cycled server. This is new behaviour for that server since applying patch … will see if it happens again ... On 2013-02-17, at 7:07 AM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: On 2013-02-15, at 7:21 AM, Rick Macklem rmack...@uoguelph.ca wrote: Righto. Thanks jhb and kib for looking at this. Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However, as far as I can tell, it just sets TDF_SBDRY when it is already set and seems harmless. (Since this code is supposed to be generic and not specific to NFS, maybe it should stay that way?) Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch is ok for stable/9 without your recent head patch. Maybe Marc can test the above patch? 'k, not sure what you want me to 'test', but so far, patch has been applied / live for ~21hrs, and no processes in state T … Yes, I meant run it like you normally do and see if the hang occurs with the patch (or other problems crop up). I suspect you have some idea of how long it needs to run without a hang before you are convinced the problem is fixed. I can't do commits until April, so there is no rush from my point of view. (I suspect jhb@ will commit it at some point, if/when it appears to fix the problem and seems correct.) Thanks for testing it, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: On 2013-02-15, at 7:21 AM, Rick Macklem rmack...@uoguelph.ca wrote: Righto. Thanks jhb and kib for looking at this. Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However, as far as I can tell, it just sets TDF_SBDRY when it is already set and seems harmless. (Since this code is supposed to be generic and not specific to NFS, maybe it should stay that way?) Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch is ok for stable/9 without your recent head patch. Maybe Marc can test the above patch? 'k, not sure what you want me to 'test', but so far, patch has been applied / live for ~21hrs, and no processes in state T … Yes, I meant run it like you normally do and see if the hang occurs with the patch (or other problems crop up). I suspect you have some idea of how long it needs to run without a hang before you are convinced the problem is fixed. I can't do commits until April, so there is no rush from my point of view. (I suspect jhb@ will commit it at some point, if/when it appears to fix the problem and seems correct.) Thanks for testing it, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-15, at 7:21 AM, Rick Macklem rmack...@uoguelph.ca wrote: Righto. Thanks jhb and kib for looking at this. Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However, as far as I can tell, it just sets TDF_SBDRY when it is already set and seems harmless. (Since this code is supposed to be generic and not specific to NFS, maybe it should stay that way?) Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch is ok for stable/9 without your recent head patch. Maybe Marc can test the above patch? 'k, not sure what you want me to 'test', but so far, patch has been applied / live for ~21hrs, and no processes in state T … ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Thursday, February 14, 2013 10:05:56 pm Rick Macklem wrote: Marc Fournier wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments … if these don't make it through, let me know, just figured I'd try and keep it compact ... Well, I've looked at this call path a little closer: 16693 104135 httpd-mi_switch+0x186 thread_suspend_check+0x19f sleepq_catch_signals+0x1c5 sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect_call+0xfb newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 nfs_access+0x306 vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 I am probably way off, since I am not familiar with this stuff, but it seems to me that thread_suspend_check() should just return 0 for the case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag set) instead of sitting in the loop and doing a mi_switch(). I'm not even sure if it should call thread_suspend_check() for this case, but there are cases in thread_suspend_check() that I don't understand. Although I don't really understand thread_suspend_check(), I've attached a simple patch that might be a starting point for fixing this? I wouldn't recommend trying the patch until kib and/or jhb weigh in on whether it makes any sense. I think this is the right idea, but in HEAD with the sigdeferstop() changes it should just check for TDF_SBDRY instead of adding a new parameter. I think checking for TDF_SBDRY will work even in 9 (and will make the patch smaller). Also, I think this is only needed for stop signals. Other suspend requests will eventually resume the thread, it is only stop signals that can cause the thread to get stuck indefinitely (since it depends on the user sending SIGCONT). Marc, are you using SIGSTOP? Index: kern_thread.c === --- kern_thread.c (revision 246122) +++ kern_thread.c (working copy) @@ -795,6 +795,17 @@ thread_suspend_check(int return_instead) return (ERESTART); /* +* Ignore suspend requests for stop signals if they +* are deferred. +*/ + if (P_SHOULDSTOP(p) == P_STOPPED_SIG + td-td_flags TDF_SBDRY) { + KASSERT(return_instead, + (TDF_SBDRY set for unsafe thread_suspend_check)); + return (0); + } + + /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Fri, Feb 15, 2013 at 08:44:43AM -0500, John Baldwin wrote: On Thursday, February 14, 2013 10:05:56 pm Rick Macklem wrote: Marc Fournier wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments ??? if these don't make it through, let me know, just figured I'd try and keep it compact ... Well, I've looked at this call path a little closer: 16693 104135 httpd-mi_switch+0x186 thread_suspend_check+0x19f sleepq_catch_signals+0x1c5 sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect_call+0xfb newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 nfs_access+0x306 vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 I am probably way off, since I am not familiar with this stuff, but it seems to me that thread_suspend_check() should just return 0 for the case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag set) instead of sitting in the loop and doing a mi_switch(). I'm not even sure if it should call thread_suspend_check() for this case, but there are cases in thread_suspend_check() that I don't understand. Although I don't really understand thread_suspend_check(), I've attached a simple patch that might be a starting point for fixing this? I wouldn't recommend trying the patch until kib and/or jhb weigh in on whether it makes any sense. I think this is the right idea, but in HEAD with the sigdeferstop() changes it should just check for TDF_SBDRY instead of adding a new parameter. I think checking for TDF_SBDRY will work even in 9 (and will make the patch smaller). Also, I think this is only needed for stop signals. Other suspend requests will eventually resume the thread, it is only stop signals that can cause the thread to get stuck indefinitely (since it depends on the user sending SIGCONT). Marc, are you using SIGSTOP? Index: kern_thread.c === --- kern_thread.c (revision 246122) +++ kern_thread.c (working copy) @@ -795,6 +795,17 @@ thread_suspend_check(int return_instead) return (ERESTART); /* + * Ignore suspend requests for stop signals if they + * are deferred. + */ + if (P_SHOULDSTOP(p) == P_STOPPED_SIG + td-td_flags TDF_SBDRY) { + KASSERT(return_instead, + (TDF_SBDRY set for unsafe thread_suspend_check)); + return (0); + } + + /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. This looks correct. pgpwJnJsA6DUs.pgp Description: PGP signature
Re: 9-STABLE - NFS - NetAPP:
Konstantin Belousov wrote: On Fri, Feb 15, 2013 at 08:44:43AM -0500, John Baldwin wrote: On Thursday, February 14, 2013 10:05:56 pm Rick Macklem wrote: Marc Fournier wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments ??? if these don't make it through, let me know, just figured I'd try and keep it compact ... Well, I've looked at this call path a little closer: 16693 104135 httpd - mi_switch+0x186 thread_suspend_check+0x19f sleepq_catch_signals+0x1c5 sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect_call+0xfb newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 nfs_access+0x306 vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 I am probably way off, since I am not familiar with this stuff, but it seems to me that thread_suspend_check() should just return 0 for the case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag set) instead of sitting in the loop and doing a mi_switch(). I'm not even sure if it should call thread_suspend_check() for this case, but there are cases in thread_suspend_check() that I don't understand. Although I don't really understand thread_suspend_check(), I've attached a simple patch that might be a starting point for fixing this? I wouldn't recommend trying the patch until kib and/or jhb weigh in on whether it makes any sense. I think this is the right idea, but in HEAD with the sigdeferstop() changes it should just check for TDF_SBDRY instead of adding a new parameter. I think checking for TDF_SBDRY will work even in 9 (and will make the patch smaller). Also, I think this is only needed for stop signals. Other suspend requests will eventually resume the thread, it is only stop signals that can cause the thread to get stuck indefinitely (since it depends on the user sending SIGCONT). Marc, are you using SIGSTOP? Index: kern_thread.c === --- kern_thread.c (revision 246122) +++ kern_thread.c (working copy) @@ -795,6 +795,17 @@ thread_suspend_check(int return_instead) return (ERESTART); /* + * Ignore suspend requests for stop signals if they + * are deferred. + */ + if (P_SHOULDSTOP(p) == P_STOPPED_SIG + td-td_flags TDF_SBDRY) { + KASSERT(return_instead, + (TDF_SBDRY set for unsafe thread_suspend_check)); + return (0); + } + + /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. This looks correct. Righto. Thanks jhb and kib for looking at this. Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However, as far as I can tell, it just sets TDF_SBDRY when it is already set and seems harmless. (Since this code is supposed to be generic and not specific to NFS, maybe it should stay that way?) Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch is ok for stable/9 without your recent head patch. Maybe Marc can test the above patch? Thanks everyone for your help, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Friday, February 15, 2013 10:21:11 am Rick Macklem wrote: Konstantin Belousov wrote: On Fri, Feb 15, 2013 at 08:44:43AM -0500, John Baldwin wrote: On Thursday, February 14, 2013 10:05:56 pm Rick Macklem wrote: Marc Fournier wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments ??? if these don't make it through, let me know, just figured I'd try and keep it compact ... Well, I've looked at this call path a little closer: 16693 104135 httpd - mi_switch+0x186 thread_suspend_check+0x19f sleepq_catch_signals+0x1c5 sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect_call+0xfb newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 nfs_access+0x306 vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 I am probably way off, since I am not familiar with this stuff, but it seems to me that thread_suspend_check() should just return 0 for the case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag set) instead of sitting in the loop and doing a mi_switch(). I'm not even sure if it should call thread_suspend_check() for this case, but there are cases in thread_suspend_check() that I don't understand. Although I don't really understand thread_suspend_check(), I've attached a simple patch that might be a starting point for fixing this? I wouldn't recommend trying the patch until kib and/or jhb weigh in on whether it makes any sense. I think this is the right idea, but in HEAD with the sigdeferstop() changes it should just check for TDF_SBDRY instead of adding a new parameter. I think checking for TDF_SBDRY will work even in 9 (and will make the patch smaller). Also, I think this is only needed for stop signals. Other suspend requests will eventually resume the thread, it is only stop signals that can cause the thread to get stuck indefinitely (since it depends on the user sending SIGCONT). Marc, are you using SIGSTOP? Index: kern_thread.c === --- kern_thread.c (revision 246122) +++ kern_thread.c (working copy) @@ -795,6 +795,17 @@ thread_suspend_check(int return_instead) return (ERESTART); /* + * Ignore suspend requests for stop signals if they + * are deferred. + */ + if (P_SHOULDSTOP(p) == P_STOPPED_SIG + td-td_flags TDF_SBDRY) { + KASSERT(return_instead, + (TDF_SBDRY set for unsafe thread_suspend_check)); + return (0); + } + + /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. This looks correct. Righto. Thanks jhb and kib for looking at this. Btw John, PBDRY still gets set for sleeps in the sys/rpc code. However, as far as I can tell, it just sets TDF_SBDRY when it is already set and seems harmless. (Since this code is supposed to be generic and not specific to NFS, maybe it should stay that way?) In HEAD PBDRY is now a nop and the existing sigdeferstop() stuff should cover the calls in sys/rpc. Also, since PBDRY on the sleeps sets TDF_SBDRY, I think the above patch is ok for stable/9 without your recent head patch. Yep, exactly. Thanks everyone for your help, rick Thanks for your debugging! -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Trying the patch now … but what do you mean by using 'SIGSTOP'? I generally do a 'kill -HUP' then when that doesn't work 'kill -9' … should Iuse -STOP instead of 9? On 2013-02-15, at 5:44 AM, John Baldwin j...@freebsd.org wrote: I think this is the right idea, but in HEAD with the sigdeferstop() changes it should just check for TDF_SBDRY instead of adding a new parameter. I think checking for TDF_SBDRY will work even in 9 (and will make the patch smaller). Also, I think this is only needed for stop signals. Other suspend requests will eventually resume the thread, it is only stop signals that can cause the thread to get stuck indefinitely (since it depends on the user sending SIGCONT). Marc, are you using SIGSTOP? Index: kern_thread.c === --- kern_thread.c (revision 246122) +++ kern_thread.c (working copy) @@ -795,6 +795,17 @@ thread_suspend_check(int return_instead) return (ERESTART); /* + * Ignore suspend requests for stop signals if they + * are deferred. + */ + if (P_SHOULDSTOP(p) == P_STOPPED_SIG + td-td_flags TDF_SBDRY) { + KASSERT(return_instead, + (TDF_SBDRY set for unsafe thread_suspend_check)); + return (0); + } + + /* * If the process is waiting for us to exit, * this thread should just suicide. * Assumes that P_SINGLE_EXIT implies P_STOPPED_SINGLE. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments … if these don't make it through, let me know, just figured I'd try and keep it compact ... Ok, I took a look and the interesting process seems to be 16693. It is stopped (T state) and several of its threads (22, but not all) have a procstat like this: 16693 104135 httpd-mi_switch+0x186 thread_suspend_check+0x19f sleepq_catch_signals+0x1c5 sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect_call+0xfb newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 nfs_access+0x306 vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 The sleep in clnt_vc_call is waiting for an RPC reply (while a vnode lock is held) with PCATCH | PBDRY flags, since it interruptible. I can see that the thread_suspend_check() has a 1 argument (return_instead == 1), since there is only one call to thread_suspend_check() in sleepq_catch_signals(). When looking at thread_suspend_check(), I basically got lost, although it seems that it can only return_instead if there is a single thread and not multiple threads doing this. If these threads are stuck here and won't return from msleep(), that would explain the hang. If they would wakeup and return from the msleep() when a wakeup occurs, it would suggest that there is a lost reply or similar, so the wakeup isn't occurring. I also don't know if a timeout of the msleep() will still occur and make the msleep() return? Although it wasn't done to fix this, it looks like jhb@'s recent patch to head (r246417) might fix this, since it reworks how STOP signals are handled for interruptible mounts. Hopefully kib or jhb can provide more insight. Btw Marc, if you just want this problem to go away, I suspect getting rid of the intr mount option would do that. rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-14, at 08:41 , Rick Macklem rmack...@uoguelph.ca wrote: Btw Marc, if you just want this problem to go away, I suspect getting rid of the intr mount option would do that. Am more interested in fixing the problem (if possible) then just masking it, but ... Based on the man page for mount_nfs, wouldn't that have the opposite effect: intrMake the mount interruptible, which implies that file system calls that are delayed due to an unresponsive server will fail with EINTR when a termination signal is posted for the process. I may be mis-reading, but from the above it sounds like a -9 *should* terminate the process if intr is enabled, while with it disabled, it would ignore it … ? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: On 2013-02-14, at 08:41 , Rick Macklem rmack...@uoguelph.ca wrote: Btw Marc, if you just want this problem to go away, I suspect getting rid of the intr mount option would do that. Am more interested in fixing the problem (if possible) then just masking it, but ... Based on the man page for mount_nfs, wouldn't that have the opposite effect: intr Make the mount interruptible, which implies that file system calls that are delayed due to an unresponsive server will fail with EINTR when a termination signal is posted for the process. I may be mis-reading, but from the above it sounds like a -9 *should* terminate the process if intr is enabled, while with it disabled, it would ignore it … ? Yes, you have misread it (or english is a wonderfully ambiguous thing, if you prefer;-). For hard mounts (which is what you get if you don't specify either soft nor intr), the RPCs behave like other I/O subsystems, which means they do non-interruptible sleeps (D stat in ps) waiting for server replies and continue to try and complete the RPC forever. You can't kill off the process/thread with any signal. If umount -f of the filesystem works, that terminates the thread(s). Unfortunately, umount -f is quite broken again. I have an idea on how to resolve this, but I haven't coded it yet. (The problem is that the process doing umount -f gets stuck before it does the VFS_UNMOUNT(), so the NFS client doesn't see it.) rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-14, at 16:24 , Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: On 2013-02-14, at 08:41 , Rick Macklem rmack...@uoguelph.ca wrote: Btw Marc, if you just want this problem to go away, I suspect getting rid of the intr mount option would do that. Am more interested in fixing the problem (if possible) then just masking it, but ... Based on the man page for mount_nfs, wouldn't that have the opposite effect: intr Make the mount interruptible, which implies that file system calls that are delayed due to an unresponsive server will fail with EINTR when a termination signal is posted for the process. I may be mis-reading, but from the above it sounds like a -9 *should* terminate the process if intr is enabled, while with it disabled, it would ignore it … ? Yes, you have misread it (or english is a wonderfully ambiguous thing, if you prefer;-). For hard mounts (which is what you get if you don't specify either soft nor intr), the RPCs behave like other I/O subsystems, which means they do non-interruptible sleeps (D stat in ps) waiting for server replies and continue to try and complete the RPC forever. You can't kill off the process/thread with any signal. If umount -f of the filesystem works, that terminates the thread(s). Unfortunately, umount -f is quite broken again. I have an idea on how to resolve this, but I haven't coded it yet. (The problem is that the process doing umount -f gets stuck before it does the VFS_UNMOUNT(), so the NFS client doesn't see it.) For how infrequently this problem generally manifests itself, is there an overall benefit from a debugging standpoint of my leaving intr on and reporting when it happens, including procstat output, and then upgrading to latest kernel … ? Its an annoyance, but it isn't like it happens daily, so I don't mind going through the process *towards* having it fixed if there is an overall benefit … ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments … if these don't make it through, let me know, just figured I'd try and keep it compact ... Well, I've looked at this call path a little closer: 16693 104135 httpd-mi_switch+0x186 thread_suspend_check+0x19f sleepq_catch_signals+0x1c5 sleepq_timedwait_sig+0x19 _sleep+0x2ca clnt_vc_call+0x763 clnt_reconnect_call+0xfb newnfs_request+0xadb nfscl_request+0x72 nfsrpc_accessrpc+0x1df nfs34_access_otw+0x56 nfs_access+0x306 vn_open_cred+0x5a8 kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 I am probably way off, since I am not familiar with this stuff, but it seems to me that thread_suspend_check() should just return 0 for the case where stop_allowed == SIG_STOP_NOT_ALLOWED (TDF_SBDRY flag set) instead of sitting in the loop and doing a mi_switch(). I'm not even sure if it should call thread_suspend_check() for this case, but there are cases in thread_suspend_check() that I don't understand. Although I don't really understand thread_suspend_check(), I've attached a simple patch that might be a starting point for fixing this? I wouldn't recommend trying the patch until kib and/or jhb weigh in on whether it makes any sense. rick --- kern/subr_sleepqueue.c.sav 2013-02-14 20:39:47.0 -0500 +++ kern/subr_sleepqueue.c 2013-02-14 21:03:03.0 -0500 @@ -443,7 +443,7 @@ sleepq_catch_signals(void *wchan, int pr sig = cursig(td, stop_allowed); if (sig == 0) { mtx_unlock(ps-ps_mtx); - ret = thread_suspend_check(1); + ret = thread_suspend_check(1, stop_allowed); MPASS(ret == 0 || ret == EINTR || ret == ERESTART); } else { if (SIGISMEMBER(ps-ps_sigintr, sig)) --- kern/kern_exit.c.sav 2013-02-14 21:04:21.0 -0500 +++ kern/kern_exit.c 2013-02-14 21:04:50.0 -0500 @@ -159,7 +159,7 @@ exit1(struct thread *td, int rv) * First check if some other thread got here before us. * If so, act appropriately: exit or suspend. */ - thread_suspend_check(0); + thread_suspend_check(0, SIG_STOP_ALLOWED); /* * Kill off the other threads. This requires --- kern/kern_sig.c.sav 2013-02-14 21:05:06.0 -0500 +++ kern/kern_sig.c 2013-02-14 21:05:40.0 -0500 @@ -1463,7 +1463,7 @@ kern_sigsuspend(struct thread *td, sigse while (msleep(p-p_sigacts, p-p_mtx, PPAUSE|PCATCH, pause, 0) == 0) /* void */; - thread_suspend_check(0); + thread_suspend_check(0, SIG_STOP_ALLOWED); mtx_lock(p-p_sigacts-ps_mtx); while ((sig = cursig(td, SIG_STOP_ALLOWED)) != 0) has_sig += postsig(sig); --- kern/kern_thread.c.sav 2013-02-14 21:07:06.0 -0500 +++ kern/kern_thread.c 2013-02-14 21:44:10.0 -0500 @@ -762,7 +762,7 @@ stopme: * return_instead is set. */ int -thread_suspend_check(int return_instead) +thread_suspend_check(int return_instead, int stop_allowed) { struct thread *td; struct proc *p; @@ -794,6 +794,9 @@ thread_suspend_check(int return_instead) (p-p_flag P_SINGLE_BOUNDARY) return_instead) return (ERESTART); + if (stop_allowed == SIG_STOP_NOT_ALLOWED return_instead) + return (0); + /* * If the process is waiting for us to exit, * this thread should just suicide. --- kern/subr_trap.c.sav 2013-02-14 21:09:43.0 -0500 +++ kern/subr_trap.c 2013-02-14 21:10:02.0 -0500 @@ -283,7 +283,7 @@ ast(struct trapframe *framep) */ if (flags TDF_NEEDSUSPCHK) { PROC_LOCK(p); - thread_suspend_check(0); + thread_suspend_check(0, SIG_STOP_ALLOWED); PROC_UNLOCK(p); } --- sys/proc.h.sav 2013-02-14 21:10:58.0 -0500 +++ sys/proc.h 2013-02-14 21:12:01.0 -0500 @@ -943,7 +943,7 @@ void thread_stopped(struct proc *p); void childproc_stopped(struct proc *child, int reason); void childproc_continued(struct proc *child); void childproc_exited(struct proc *child); -int thread_suspend_check(int how); +int thread_suspend_check(int how, int stop_allowed); void thread_suspend_switch(struct thread *); void thread_suspend_one(struct thread *td); void thread_unlink(struct thread *td); ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: On 2013-02-14, at 16:24 , Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: On 2013-02-14, at 08:41 , Rick Macklem rmack...@uoguelph.ca wrote: Btw Marc, if you just want this problem to go away, I suspect getting rid of the intr mount option would do that. Am more interested in fixing the problem (if possible) then just masking it, but ... Based on the man page for mount_nfs, wouldn't that have the opposite effect: intr Make the mount interruptible, which implies that file system calls that are delayed due to an unresponsive server will fail with EINTR when a termination signal is posted for the process. I may be mis-reading, but from the above it sounds like a -9 *should* terminate the process if intr is enabled, while with it disabled, it would ignore it … ? Yes, you have misread it (or english is a wonderfully ambiguous thing, if you prefer;-). For hard mounts (which is what you get if you don't specify either soft nor intr), the RPCs behave like other I/O subsystems, which means they do non-interruptible sleeps (D stat in ps) waiting for server replies and continue to try and complete the RPC forever. You can't kill off the process/thread with any signal. If umount -f of the filesystem works, that terminates the thread(s). Unfortunately, umount -f is quite broken again. I have an idea on how to resolve this, but I haven't coded it yet. (The problem is that the process doing umount -f gets stuck before it does the VFS_UNMOUNT(), so the NFS client doesn't see it.) For how infrequently this problem generally manifests itself, is there an overall benefit from a debugging standpoint of my leaving intr on and reporting when it happens, including procstat output, and then upgrading to latest kernel … ? Its an annoyance, but it isn't like it happens daily, so I don't mind going through the process *towards* having it fixed if there is an overall benefit … Well, hopefully kib and/or jhb can make some progress w.r.t. this. I'll let them weigh in on what to do next, rick ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Tue, Feb 12, 2013 at 08:50:39PM -0500, Rick Macklem wrote: Marc Fournier wrote: Just reset server, so any further details will have to be 'next time' ??? but, just did a csup and am rebuilding ??? the following three files were modified since last build: grep nfs /tmp/output Edit src/sys/fs/nfs/nfs_commonsubs.c Edit src/sys/fs/nfsclient/nfs_clrpcops.c Edit src/sys/fs/nfsserver/nfs_nfsdserv.c On 2013-02-10, at 4:56 PM, Marc Fournier scra...@hub.org wrote: On 2013-02-10, at 4:31 PM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: Hi John ??? Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd It is probably too late, but all the lines (without the | grep du) would be more useful. I also include the H flag, so it lists threads as well as processes. The above just says the du command is waiting for a vnode lock. The interesting process/thread is the one that is holding a vnode lock while waiting for something else. As requested, 'ps auxlH' attached ??? ps.out.bz2 Well, I took a look at the ps output and I didn't see anything that would identify what the hang is. There are a lot of processes sleeping on newnfs (waiting for a vnode lock) and many sleeping on vofflock (waiting for the f_offset lock). I never got any attachments on the thread. See http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html for the description of what is needed to start debugging. Unfortunately, I can't spot any process/thread that is blocked on something else, where it would seem likely to be holding either an nfs vnode lock or f_offset lock that isn't one of these. There were changes about 5 months ago which it appears fixed a deadlock race between vnode locks and offset locks for paging (r236321 and friends). No, I do not think that the description of the changes is right. I am wondering if there could be other similar races, possibly specific to paging in over NFS? (I can't see any case where there is a LOR, so I can't think of what it might be?) If you just want the hangs to go away, I'd suggest moving the executable is /usr/local/sbin (httpd maybe) to a local file system on the server, since it does seem to be related to paging this executable in over NFS. rick ps: I've added kib@ to the cc, in case he is aware of other related races? Are you still getting the: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Fairly quiet: Screen Shot 2013-02-10 at 4.43.55 PM.png And that is it since last reboot ~20 days ago ??? messages logged? With John's recent patch, the error# would no longer be 13 if it was caused by the intr flag resulting in a Read RPC terminating with EINTR. If you are still getting the above with error 13, it suggests that the server is replying EACCES for the Read RPC. I suggested before that you check to make sure that the executable had read access for everyone one the file server. Since I didn't hear back, I'll assume this is the case. Don't understand this question ??? I have 34 VPSs running off of this server right now ??? that 'du process' runs against each of those VPSs every night, and this problem started happening on Friday night's run ??? ~18 days into uptime ??? so the same process has run repeatedly, with no issues, 18 times before it hung on Friday ??? also, the hang, once 'triggered', only seems to recur against the same directory ??? the same directory doesn't necessarily trigger it, but once it starts, it appears to do it for the same directory ??? I'm not sure if I've ever seem it happening to two different directories at the same time ??? Also, please note that the du command is run from the physical server, as root ??? rick ps: If it is still up and hasn't been rebooted, you could: sysctl debug.kdb.break_to_debugger=1 - then type ctrlaltesc at the console and do the following from the debugger http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html How well this work depends on what options your kernel was built with. My remote console on that one doesn't work very well ??? I can view, but I can't type ??? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to
Re: 9-STABLE - NFS - NetAPP:
Konstantin Belousov wrote: On Tue, Feb 12, 2013 at 08:50:39PM -0500, Rick Macklem wrote: Marc Fournier wrote: Just reset server, so any further details will have to be 'next time' ??? but, just did a csup and am rebuilding ??? the following three files were modified since last build: grep nfs /tmp/output Edit src/sys/fs/nfs/nfs_commonsubs.c Edit src/sys/fs/nfsclient/nfs_clrpcops.c Edit src/sys/fs/nfsserver/nfs_nfsdserv.c On 2013-02-10, at 4:56 PM, Marc Fournier scra...@hub.org wrote: On 2013-02-10, at 4:31 PM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: Hi John ??? Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd It is probably too late, but all the lines (without the | grep du) would be more useful. I also include the H flag, so it lists threads as well as processes. The above just says the du command is waiting for a vnode lock. The interesting process/thread is the one that is holding a vnode lock while waiting for something else. As requested, 'ps auxlH' attached ??? ps.out.bz2 Well, I took a look at the ps output and I didn't see anything that would identify what the hang is. There are a lot of processes sleeping on newnfs (waiting for a vnode lock) and many sleeping on vofflock (waiting for the f_offset lock). I never got any attachments on the thread. I got it resent from him. I've attached it to this post, just in case you are interested in taking a look at it. See http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html for the description of what is needed to start debugging. I already pointed this out (thanks to your previous email thread), but apparently he can't run a console, so I don't know if there is another way to do the same things? Unfortunately, I can't spot any process/thread that is blocked on something else, where it would seem likely to be holding either an nfs vnode lock or f_offset lock that isn't one of these. There were changes about 5 months ago which it appears fixed a deadlock race between vnode locks and offset locks for paging (r236321 and friends). No, I do not think that the description of the changes is right. He does get the odd error reported by nfs_getpages() and I don't think we've isolated why yet. The error is 13 (EACCES), but jhb@ thought it might be because of the bug he fixed where the krpc reported EACCES for the EINTR case. I don't think we've heard back from Marc w.r.t. whether he has gotten any more of these erros logged since applying jhb@'s patch and whether or not the errno has changed to EINTR? I'll admit I don't understand when the VOP_GETPAGES() path gets called vs the vn_io_fault() one. I plan on taking a closer look at the VOP_GETPAGES() call path and see if I can spot any locking issue. I am wondering if there could be other similar races, possibly specific to paging in over NFS? (I can't see any case where there is a LOR, so I can't think of what it might be?) If you just want the hangs to go away, I'd suggest moving the executable is /usr/local/sbin (httpd maybe) to a local file system on the server, since it does seem to be related to paging this executable in over NFS. rick ps: I've added kib@ to the cc, in case he is aware of other related races? Are you still getting the: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Fairly quiet: Screen Shot 2013-02-10 at 4.43.55 PM.png And that is it since last reboot ~20 days ago ??? messages logged? With John's recent patch, the error# would no longer be 13 if it was caused by the intr flag resulting in a Read RPC terminating with EINTR. If you are still getting the above with error 13, it suggests that the server is replying EACCES for the Read RPC. I suggested before that you check to make sure that the executable had read access for everyone one the file server. Since I didn't hear back, I'll assume this is the case. Don't understand this question ??? I have 34 VPSs running off of I was just asking if you have seen any of the nfs_getpages errors logged since applying jhb@'s patch and whether or not the errno in it has changed from 13 to something else? this server right now ??? that 'du process' runs against each of those VPSs every night, and
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-13, at 14:50 , Rick Macklem rmack...@uoguelph.ca wrote: He does get the odd error reported by nfs_getpages() and I don't think we've isolated why yet. The error is 13 (EACCES), but jhb@ thought it might be because of the bug he fixed where the krpc reported EACCES for the EINTR case. I don't think we've heard back from Marc w.r.t. whether he has gotten any more of these erros logged since applying jhb@'s patch and whether or not the errno has changed to EINTR? As mentioned previously, it doesn't happen all that often … this latest one was after 21 days of uptime (or so) … I just upgraded the kernel on that machine to take into consideration changes to hfs *since* the last upgrade, so it might be another 20-30 days before it happens again *if* that last patch didn't' fix it … I have several servers that do have fully operational remote consoles though … to save time if/when it happens next, what do I all need to run? ps auxlH procstat -kk pid (for which process? … all part of that group, or just one of the apparently hung processes?) sysctl debug.kdb.break_to_debugger=1 (shell) ctlaltesc (from console) now, is there a way of forcing it to do a dump core so that I can run the various commands from a shell *after* its rebooted? Not particularly easy to redirect console output to a file (or is it?), so anything that scrolls off the screen is pretty much lost … I'm using a DRAC card in most cases, no serial consoles or anything like that that I can run within a script session … a 'ps' listing is 500 lines long, just to give an idea ... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Wed, Feb 13, 2013 at 05:50:13PM -0500, Rick Macklem wrote: I got it resent from him. I've attached it to this post, just in case you are interested in taking a look at it. I do not see the voffset wchains surprising. All of them seems to occur in the multithreading process. The usual reason for the voffset blocking is the use of the same file (as in struct file *) to perform operations from several threads in parallel. One thread locked the file offset by using read() or write(), and sleeping waiting for the vnode locked. All other threads performing read or write on the same file, e.g. by using the same file descriptor, are locked on the file offset before even trying to lock the vnode. What I see interesting in the output you mailed, is the pid 93636. Note that several its threads are in the 'T' state. It means stopped, while other threads obviously do file i/o due to vofflock state. I wonder if some stopped thread owns nfs vnode lock. It could be some omission in the handling of PBDRY/TDF_BDRY, or other bug. It is absolutely impossible to say anything definitive without proper diagnostic. At least the procstat -kk is needed. pgp7HfKTNksqm.pgp Description: PGP signature
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-13, at 15:16 , Konstantin Belousov kostik...@gmail.com wrote: On Wed, Feb 13, 2013 at 05:50:13PM -0500, Rick Macklem wrote: I got it resent from him. I've attached it to this post, just in case you are interested in taking a look at it. I do not see the voffset wchains surprising. All of them seems to occur in the multithreading process. The usual reason for the voffset blocking is the use of the same file (as in struct file *) to perform operations from several threads in parallel. One thread locked the file offset by using read() or write(), and sleeping waiting for the vnode locked. All other threads performing read or write on the same file, e.g. by using the same file descriptor, are locked on the file offset before even trying to lock the vnode. What I see interesting in the output you mailed, is the pid 93636. Note that several its threads are in the 'T' state. It means stopped, while other threads obviously do file i/o due to vofflock state. I wonder if some stopped thread owns nfs vnode lock. It could be some omission in the handling of PBDRY/TDF_BDRY, or other bug. It is absolutely impossible to say anything definitive without proper diagnostic. At least the procstat -kk is needed. I had sent out the output of procstat -kk at the time … for next time, would you need procstat against all of the 'duplicate processes' that aren't' killable? for instance, in this case, there were three du commands running doing the same thing,none of which were killable … so procstat -kk for all three of those? ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: On 2013-02-13, at 14:50 , Rick Macklem rmack...@uoguelph.ca wrote: He does get the odd error reported by nfs_getpages() and I don't think we've isolated why yet. The error is 13 (EACCES), but jhb@ thought it might be because of the bug he fixed where the krpc reported EACCES for the EINTR case. I don't think we've heard back from Marc w.r.t. whether he has gotten any more of these erros logged since applying jhb@'s patch and whether or not the errno has changed to EINTR? As mentioned previously, it doesn't happen all that often … this latest one was after 21 days of uptime (or so) … I just upgraded the kernel on that machine to take into consideration changes to hfs *since* the last upgrade, so it might be another 20-30 days before it happens again *if* that last patch didn't' fix it … I have several servers that do have fully operational remote consoles though … to save time if/when it happens next, what do I all need to run? ps auxlH procstat -kk pid (for which process? … all part of that group, or just one of the apparently hung processes?) The pid that is in T state for the ps auxlH. sysctl debug.kdb.break_to_debugger=1 (shell) ctlaltesc (from console) Then the commands described in: http://www.freebsd.org/doc/en_US.ISO8859-1/book/developers-handbook/kerneldebug-deadlocks.html show alllocks and show lockedvnods may be the most useful, I think you can also show sleepchain pid show lockchain pid using the pid that is in T state. If you haven't built your kernel with options WITNESS, this won't work well. now, is there a way of forcing it to do a dump core so that I can run the various commands from a shell *after* its rebooted? No idea. Someone familiar with what you can do to core dump and how to get your system to make will have to answer this. Not particularly easy to redirect console output to a file (or is it?), so anything that scrolls off the screen is pretty much lost … I'm using a DRAC card in most cases, no serial consoles or anything like that that I can run within a script session … a 'ps' listing is 500 lines long, just to give an idea ... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process They are included as attachments … if these don't make it through, let me know, just figured I'd try and keep it compact ... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Note that checking the console, there are no errors pertaining to this on it … On 2013-02-13, at 9:26 PM, Marc Fournier scra...@hub.org wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process fullps.bz2procstat.bz2ps.out.bz2 They are included as attachments … if these don't make it through, let me know, just figured I'd try and keep it compact ... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
I don't know if this provides any benefit, but I just shut down all the VPSs on that server, so that all the 'noise' is removed from the ps listing, which I've attached … On 2013-02-13, at 9:31 PM, Marc Fournier scra...@hub.org wrote: Note that checking the console, there are no errors pertaining to this on it … On 2013-02-13, at 9:26 PM, Marc Fournier scra...@hub.org wrote: On 2013-02-13, at 3:54 PM, Rick Macklem rmack...@uoguelph.ca wrote: The pid that is in T state for the ps auxlH. Different server, last kernel update on Jan 22nd, https process this time instead of du last time. I've attached: ps auxlH ps auxlH of just the processes that are in TJ state (6 httpd servers) procstat output for each of the 6 process fullps.bz2procstat.bz2ps.out.bz2 They are included as attachments … if these don't make it through, let me know, just figured I'd try and keep it compact ... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: Just reset server, so any further details will have to be 'next time' … but, just did a csup and am rebuilding … the following three files were modified since last build: grep nfs /tmp/output Edit src/sys/fs/nfs/nfs_commonsubs.c Edit src/sys/fs/nfsclient/nfs_clrpcops.c Edit src/sys/fs/nfsserver/nfs_nfsdserv.c On 2013-02-10, at 4:56 PM, Marc Fournier scra...@hub.org wrote: On 2013-02-10, at 4:31 PM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: Hi John … Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd It is probably too late, but all the lines (without the | grep du) would be more useful. I also include the H flag, so it lists threads as well as processes. The above just says the du command is waiting for a vnode lock. The interesting process/thread is the one that is holding a vnode lock while waiting for something else. As requested, 'ps auxlH' attached … ps.out.bz2 Well, I took a look at the ps output and I didn't see anything that would identify what the hang is. There are a lot of processes sleeping on newnfs (waiting for a vnode lock) and many sleeping on vofflock (waiting for the f_offset lock). Unfortunately, I can't spot any process/thread that is blocked on something else, where it would seem likely to be holding either an nfs vnode lock or f_offset lock that isn't one of these. There were changes about 5 months ago which it appears fixed a deadlock race between vnode locks and offset locks for paging (r236321 and friends). I am wondering if there could be other similar races, possibly specific to paging in over NFS? (I can't see any case where there is a LOR, so I can't think of what it might be?) If you just want the hangs to go away, I'd suggest moving the executable is /usr/local/sbin (httpd maybe) to a local file system on the server, since it does seem to be related to paging this executable in over NFS. rick ps: I've added kib@ to the cc, in case he is aware of other related races? Are you still getting the: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Fairly quiet: Screen Shot 2013-02-10 at 4.43.55 PM.png And that is it since last reboot ~20 days ago … messages logged? With John's recent patch, the error# would no longer be 13 if it was caused by the intr flag resulting in a Read RPC terminating with EINTR. If you are still getting the above with error 13, it suggests that the server is replying EACCES for the Read RPC. I suggested before that you check to make sure that the executable had read access for everyone one the file server. Since I didn't hear back, I'll assume this is the case. Don't understand this question … I have 34 VPSs running off of this server right now … that 'du process' runs against each of those VPSs every night, and this problem started happening on Friday night's run … ~18 days into uptime … so the same process has run repeatedly, with no issues, 18 times before it hung on Friday … also, the hang, once 'triggered', only seems to recur against the same directory … the same directory doesn't necessarily trigger it, but once it starts, it appears to do it for the same directory … I'm not sure if I've ever seem it happening to two different directories at the same time … Also, please note that the du command is run from the physical server, as root … rick ps: If it is still up and hasn't been rebooted, you could: sysctl debug.kdb.break_to_debugger=1 - then type ctrlaltesc at the console and do the following from the debugger http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html How well this work depends on what options your kernel was built with. My remote console on that one doesn't work very well … I can view, but I can't type … ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Marc Fournier wrote: Hi John … Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd It is probably too late, but all the lines (without the | grep du) would be more useful. I also include the H flag, so it lists threads as well as processes. The above just says the du command is waiting for a vnode lock. The interesting process/thread is the one that is holding a vnode lock while waiting for something else. Are you still getting the: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) messages logged? With John's recent patch, the error# would no longer be 13 if it was caused by the intr flag resulting in a Read RPC terminating with EINTR. If you are still getting the above with error 13, it suggests that the server is replying EACCES for the Read RPC. I suggested before that you check to make sure that the executable had read access for everyone one the file server. Since I didn't hear back, I'll assume this is the case. rick ps: If it is still up and hasn't been rebooted, you could: sysctl debug.kdb.break_to_debugger=1 - then type ctrlaltesc at the console and do the following from the debugger http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html How well this work depends on what options your kernel was built with. root@io:~ # grep vm /etc/fstab 192.168.1.254:/vol/basic /vm nfs rw,nolockd,intr 0 0 Haven't rebooted yet … if there is anything I can do / try before … ? The kernel is from Jan 21st … On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote: On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? Also, can you get ps output that includes the 'l' flag to show what the processes are stuck on? -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On 2013-02-10, at 4:31 PM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: Hi John … Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd It is probably too late, but all the lines (without the | grep du) would be more useful. I also include the H flag, so it lists threads as well as processes. The above just says the du command is waiting for a vnode lock. The interesting process/thread is the one that is holding a vnode lock while waiting for something else. As requested, 'ps auxlH' attached … Are you still getting the: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Fairly quiet: And that is it since last reboot ~20 days ago … messages logged? With John's recent patch, the error# would no longer be 13 if it was caused by the intr flag resulting in a Read RPC terminating with EINTR. If you are still getting the above with error 13, it suggests that the server is replying EACCES for the Read RPC. I suggested before that you check to make sure that the executable had read access for everyone one the file server. Since I didn't hear back, I'll assume this is the case. Don't understand this question … I have 34 VPSs running off of this server right now … that 'du process' runs against each of those VPSs every night, and this problem started happening on Friday night's run … ~18 days into uptime … so the same process has run repeatedly, with no issues, 18 times before it hung on Friday … also, the hang, once 'triggered', only seems to recur against the same directory … the same directory doesn't necessarily trigger it, but once it starts, it appears to do it for the same directory … I'm not sure if I've ever seem it happening to two different directories at the same time … Also, please note that the du command is run from the physical server, as root … rick ps: If it is still up and hasn't been rebooted, you could: sysctl debug.kdb.break_to_debugger=1 - then type ctrlaltesc at the console and do the following from the debugger http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html How well this work depends on what options your kernel was built with. My remote console on that one doesn't work very well … I can view, but I can't type … ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Just reset server, so any further details will have to be 'next time' … but, just did a csup and am rebuilding … the following three files were modified since last build: grep nfs /tmp/output Edit src/sys/fs/nfs/nfs_commonsubs.c Edit src/sys/fs/nfsclient/nfs_clrpcops.c Edit src/sys/fs/nfsserver/nfs_nfsdserv.c On 2013-02-10, at 4:56 PM, Marc Fournier scra...@hub.org wrote: On 2013-02-10, at 4:31 PM, Rick Macklem rmack...@uoguelph.ca wrote: Marc Fournier wrote: Hi John … Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd It is probably too late, but all the lines (without the | grep du) would be more useful. I also include the H flag, so it lists threads as well as processes. The above just says the du command is waiting for a vnode lock. The interesting process/thread is the one that is holding a vnode lock while waiting for something else. As requested, 'ps auxlH' attached … ps.out.bz2 Are you still getting the: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Fairly quiet: Screen Shot 2013-02-10 at 4.43.55 PM.png And that is it since last reboot ~20 days ago … messages logged? With John's recent patch, the error# would no longer be 13 if it was caused by the intr flag resulting in a Read RPC terminating with EINTR. If you are still getting the above with error 13, it suggests that the server is replying EACCES for the Read RPC. I suggested before that you check to make sure that the executable had read access for everyone one the file server. Since I didn't hear back, I'll assume this is the case. Don't understand this question … I have 34 VPSs running off of this server right now … that 'du process' runs against each of those VPSs every night, and this problem started happening on Friday night's run … ~18 days into uptime … so the same process has run repeatedly, with no issues, 18 times before it hung on Friday … also, the hang, once 'triggered', only seems to recur against the same directory … the same directory doesn't necessarily trigger it, but once it starts, it appears to do it for the same directory … I'm not sure if I've ever seem it happening to two different directories at the same time … Also, please note that the du command is run from the physical server, as root … rick ps: If it is still up and hasn't been rebooted, you could: sysctl debug.kdb.break_to_debugger=1 - then type ctrlaltesc at the console and do the following from the debugger http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html How well this work depends on what options your kernel was built with. My remote console on that one doesn't work very well … I can view, but I can't type … ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Hi John … Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root12353 0.0 0.1 16176 5104 ?? DSat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root64529 0.0 0.1 16176 5164 ?? DFri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root12855 0.0 0.0 16308 1988 0 S+5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd root@io:~ # grep vm /etc/fstab 192.168.1.254:/vol/basic /vmnfs rw,nolockd,intr 0 0 Haven't rebooted yet … if there is anything I can do / try before … ? The kernel is from Jan 21st … On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote: On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? Also, can you get ps output that includes the 'l' flag to show what the processes are stuck on? -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Thanks … # procstat -kk 64529 PIDTID COMM TDNAME KSTACK 64529 100963 du -mi_switch+0x186 sleepq_wait+0x42 __lockmgr_args+0x5cb nfs_lock1+0x4a VOP_LOCK1_APV+0x46 _vn_lock+0x47 vget+0x70 cache_lookup_times+0x54f nfs_lookup+0x17e lookup+0x42f namei+0x4ac vn_open_cred+0x3bd kern_openat+0x20a amd64_syscall+0x540 Xfast_syscall+0xf7 On 2013-02-09, at 9:58 PM, Jeremy Chadwick j...@koitsu.org wrote: Off-list: Marc, You may want to also provide output from procstat -kk 64529, as this will give a full thread calling stack. The -kk (double-kay) is not a typo. :-) -- | Jeremy Chadwick j...@koitsu.org | | UNIX Systems Administratorhttp://jdc.koitsu.org/ | | Mountain View, CA, US| | Making life hard for others since 1977. PGP 4BD6C0CB | On Sat, Feb 09, 2013 at 09:29:30PM -0800, Marc Fournier wrote: Hi John ? Does this help? root@io:~ # ps auxl | grep du root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 0 81426 0 20 0 newnfs root12353 0.0 0.1 16176 5104 ?? DSat03AM 0:05.41 du -skx /vm/2799 0 91597 0 20 0 newnfs root64529 0.0 0.1 16176 5164 ?? DFri03AM 0:05.40 du -skx /vm/2799 0 43227 0 20 0 newnfs root12855 0.0 0.0 16308 1988 0 S+5:26AM 0:00.00 grep du 0 12847 0 20 0 piperd root@io:~ # grep vm /etc/fstab 192.168.1.254:/vol/basic /vmnfs rw,nolockd,intr 0 0 Haven't rebooted yet ? if there is anything I can do / try before ? ? The kernel is from Jan 21st ? On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote: On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server ? everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? Also, can you get ps output that includes the 'l' flag to show what the processes are stuck on? -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Sunday, January 20, 2013 01:10:29 AM Hub- Marketing wrote: On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote: On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? 192.168.1.253:/vol/vol1 /vm nfs rw,intr,soft,nolockd 0 0 I just added the 'soft' option to the mix … nolockd is enabled since I know for a fact that its not possible for two processes to access the same file on both mounts at the same time … Ah, ok. I just fixed a bug with interruptible mounts in HEAD where having a signal interrupt an NFS request returns EACCESS (13) rather than EINTR. You should retest with that fix applied. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Yup, saw those commits …am going through the servers and doing upgrades on them … will report on any issues post-upgrade … thx On 2013-01-20, at 6:47 AM, John Baldwin j...@freebsd.org wrote: On Sunday, January 20, 2013 01:10:29 AM Hub- Marketing wrote: On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote: On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? 192.168.1.253:/vol/vol1 /vm nfs rw,intr,soft,nolockd 0 0 I just added the 'soft' option to the mix … nolockd is enabled since I know for a fact that its not possible for two processes to access the same file on both mounts at the same time … Ah, ok. I just fixed a bug with interruptible mounts in HEAD where having a signal interrupt an NFS request returns EACCESS (13) rather than EINTR. You should retest with that fix applied. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? Also, can you get ps output that includes the 'l' flag to show what the processes are stuck on? -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
On 2013-01-19, at 4:57 AM, John Baldwin j...@freebsd.org wrote: On Tuesday, December 18, 2012 11:58:36 PM Hub- Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) Are you using interruptible mounts (intr mount option)? 192.168.1.253:/vol/vol1 /vm nfs rw,intr,soft,nolockd 0 0 I just added the 'soft' option to the mix … nolockd is enabled since I know for a fact that its not possible for two processes to access the same file on both mounts at the same time … Also, can you get ps output that includes the 'l' flag to show what the processes are stuck on? I will send an follow up the next time this happens, so it may be a few days … ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: 9-STABLE - NFS - NetAPP:
Hub-Marketing wrote: I'm running a few servers sitting on top of a NetAPP file server … everything runs great, but periodically I'm getting: nfs_getpages: error 13 vm_fault: pager read error, pid 11355 (https) 13 is EACCES. This message means that the Netapp server is replying EACCES to a read for a pagein. I notice that both root and www are running the executable. (Also, root is often mapped to something like nobody in the NFS server.) You could try making sure the httpd executable file has r_x permissions for all users (chmod 555 httpd). If it still keeps hapenning once you've done that, you'd need to capture packets when this happens and take a look at the NFS RPCs via wireshark to see when the EACCES is returned and what uid, gids are sent in the credentials for that Read. rick errors on my screen … not always same pid … the annoying part is that it seems to always affect the same jail that is running .. if I shutdown all jails on that physical server, everything shuts down except for that *one* jail, with a ps listing looking like: USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 6670 0.0 0.0 9936 1372 ?? DsJ 3:00AM 0:00.01 newsyslog root 6815 0.0 0.0 9936 1288 ?? DsJ 3:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 8361 0.0 0.1 220740 11400 ?? DsJ 7:33PM 0:01.25 /usr/local/sbin/httpd -DNOHTTPACCEPT www 8364 0.0 0.0 0 0 ?? ZJ 7:33PM 0:00.00 defunct www 11866 0.0 0.1 318444 16792 ?? TJ 7:36PM 0:00.03 /usr/local/sbin/httpd -DNOHTTPACCEPT www 11872 0.0 0.1 297964 14008 ?? TJ 7:36PM 0:00.01 /usr/local/sbin/httpd -DNOHTTPACCEPT www 11873 0.0 0.1 306156 15028 ?? DEJ 7:36PM 0:00.02 /usr/local/sbin/httpd -DNOHTTPACCEPT root 17190 0.0 0.0 9936 1240 ?? DsJ 8:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 24864 0.0 0.0 9936 1392 ?? DsJ 4:00AM 0:00.01 newsyslog root 24910 0.0 0.0 9936 1336 ?? DsJ 4:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 29972 0.0 0.0 9936 1240 ?? DsJ 9:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 34221 0.0 0.0 51480 4332 ?? DsJ 4:47AM 0:00.02 sshd: root@pts/1 (sshd) root 42452 0.0 0.0 9936 1296 ?? DsJ 10:00PM 0:00.01 newsyslog root 42522 0.0 0.0 9936 1240 ?? DsJ 10:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 55179 0.0 0.0 9936 1296 ?? DsJ 11:00PM 0:00.01 newsyslog root 55244 0.0 0.0 9936 1240 ?? DsJ 11:00PM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 67592 0.0 0.0 9936 1336 ?? DsJ 12:00AM 0:00.01 newsyslog root 67762 0.0 0.0 9936 1288 ?? DsJ 12:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 81603 0.0 0.0 9936 1340 ?? DsJ 1:00AM 0:00.01 newsyslog root 81640 0.0 0.0 9936 1284 ?? DsJ 1:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 93792 0.0 0.0 9936 1344 ?? DsJ 2:00AM 0:00.01 newsyslog root 93815 0.0 0.0 9936 1288 ?? DsJ 2:00AM 0:00.01 /usr/sbin/newsyslog -f /usr/local/etc/rotate_logs.cfg root 34228 0.0 0.0 67960 4464 1 Ds+J 4:47AM 0:00.00 sshd: root@pts/1 (sshd) root 38473 0.0 0.0 17556 3272 3 SJ 4:53AM 0:00.02 /bin/tcsh root 38475 0.0 0.0 14212 1512 3 R+J 4:53AM 0:00.00 ps aux I can do a 'jexec JID /bin/tcsh' to get into the jail, I can perform ps commands, etc … I just can't get those processes to shutdown … everything within the jail is 'up to date' … updates the userland and ports … I've checked over the NetApp, but everything appears fine, and it only seems to repeatedly affect that one jail, on that same physical server ... I have no ideas on what / how to debug this … thoughts? help? thx ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org