Re: [PATCH 0/2] audit: fix soft lockups and udevd errors when audit is overrun
On Mon, 02 Sep 2013 20:45:14 -0700 Chuck Anderson wrote: > The two patches that follow in separate emails resolve soft lockups and > udevd reported errors that prevented a large memory 3.8 system from booting. > > The patches are based on 3.11-rc7. > > I believe it is the same issue recently posted as: > >[RFC] audit: avoid soft lockup in audit_log_start() >https://lkml.org/lkml/2013/8/28/626 Nice to see someone else looking into this! And Thanks for CC'ing me. I've a couple of news to you. First, I've tried to apply your series but got this: [lcapitulino@volcano linux-2.6]$ git am ~/audit-fix.mbox Applying: audit: fix soft lockups due to loop in audit_log_start() wh,en audit_backlog_limit exceeded fatal: corrupt patch at line 23 Patch failed at 0001 audit: fix soft lockups due to loop in audit_log_start() wh,en audit_backlog_limit exceeded The copy of the patch that failed is found in: /home/lcapitulino/work/src/upstream/linux-2.6/.git/rebase-apply/patch When you have resolved this problem, run "git am --resolved". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". [lcapitulino@volcano linux-2.6]$ Now, I was a few minutes a away before sending a different fix I cooked this evening when I got your series in my inbox. So I really wanted to give this a try and applied the first patch manually (resulting version is attached). The softlockup is gone, but I still get a hang for several seconds just like I did with my first rfc. I found a very easy way to reproduce the problem and our analysis is similar, but our solutions differs. I'm going to send my solution right now, sorry for any mistakes it's almost 1h AM here but I really wanted to give your version a try before sending my version (and before going to bed). If you send a v2 I'll try it again and we can discuss our approaches. diff --git a/kernel/audit.c b/kernel/audit.c index 91e53d0..8255d9b 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -103,9 +103,11 @@ static int audit_rate_limit; /* Number of outstanding audit_buffers allowed. */ static int audit_backlog_limit = 64; -static int audit_backlog_wait_time = 60 * HZ; static int audit_backlog_wait_overflow = 0; +#define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) +static int audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; + /* The identity of the user shutting down the audit system. */ kuid_t audit_sig_uid = INVALID_UID; pid_t audit_sig_pid = -1; @@ -1053,14 +1055,14 @@ static inline void audit_get_stamp(struct audit_context *ctx, /* * Wait for auditd to drain the queue a little */ -static void wait_for_auditd(unsigned long sleep_time) +static void wait_for_auditd(unsigned long sleep_time, int limit) { DECLARE_WAITQUEUE(wait, current); set_current_state(TASK_UNINTERRUPTIBLE); add_wait_queue(_backlog_wait, ); if (audit_backlog_limit && - skb_queue_len(_skb_queue) > audit_backlog_limit) + skb_queue_len(_skb_queue) > limit) schedule_timeout(sleep_time); __set_current_state(TASK_RUNNING); @@ -1095,8 +1097,8 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, struct audit_buffer *ab = NULL; struct timespec t; unsigned int uninitialized_var(serial); - int reserve; unsigned long timeout_start = jiffies; + int limit; if (audit_initialized != AUDIT_INITIALIZED) return NULL; @@ -1104,22 +1106,21 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, if (unlikely(audit_filter_type(type))) return NULL; - if (gfp_mask & __GFP_WAIT) - reserve = 0; - else - reserve = 5; /* Allow atomic callers to go up to five -entries over the normal backlog limit */ + limit = audit_backlog_limit; + if (!(gfp_mask & __GFP_WAIT)) + limit += 5; while (audit_backlog_limit - && skb_queue_len(_skb_queue) > audit_backlog_limit + reserve) { + && skb_queue_len(_skb_queue) > limit) { if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) { unsigned long sleep_time; sleep_time = timeout_start + audit_backlog_wait_time - jiffies; - if ((long)sleep_time > 0) -wait_for_auditd(sleep_time); - continue; + if ((long)sleep_time > 0) { +wait_for_auditd(sleep_time, limit); +continue; + } } if (audit_rate_check() && printk_ratelimit()) printk(KERN_WARNING @@ -1133,6 +1134,8 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, return NULL; } + audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; + ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost("out of memory in audit_log_start");
[PATCH 0/2] audit: fix soft lockups and udevd errors when audit is overrun
The two patches that follow in separate emails resolve soft lockups and udevd reported errors that prevented a large memory 3.8 system from booting. The patches are based on 3.11-rc7. I believe it is the same issue recently posted as: [RFC] audit: avoid soft lockup in audit_log_start() https://lkml.org/lkml/2013/8/28/626 The first patch: audit: fix soft lockups due to loop in audit_log_start() when audit_backlog_limit exceeded fixes a bug in kernel/audit that caused many soft lockups during boot: BUG: soft lockup - CPU#66 stuck for 22s! [udevd:9559] RIP: 0010:[] [] audit_log_start+0xe6/0x350 Call Trace: [] ? try_to_wake_up+0x2d0/0x2d0 [] audit_log_exit+0x3f/0x590 [] __audit_syscall_exit+0x28d/0x2c0 [] sysret_audit+0x17/0x21 The second patch: audit: Two efficiency fixes for audit mechanism prevents these and similar error messages repeated often during boot: udevd[876]: worker [887] unexpectedly returned with status 0x0100 udevd[876]: worker [887] failed while handling '/devices/pci:00/:00:03.0/:40:00.0' udevd[876]: worker [880] unexpectedly returned with status 0x0100 udevd[876]: worker [880] failed while handling '/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1' udevadm settle - timeout of 180 seconds reached, the event queue contains: /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1 (3995) /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/INT3F0D:00 (4034) audit: audit_backlog=258 > audit_backlog_limit=256 audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=256 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] audit: fix soft lockups and udevd errors when audit is overrun
The two patches that follow in separate emails resolve soft lockups and udevd reported errors that prevented a large memory 3.8 system from booting. The patches are based on 3.11-rc7. I believe it is the same issue recently posted as: [RFC] audit: avoid soft lockup in audit_log_start() https://lkml.org/lkml/2013/8/28/626 The first patch: audit: fix soft lockups due to loop in audit_log_start() when audit_backlog_limit exceeded fixes a bug in kernel/audit that caused many soft lockups during boot: BUG: soft lockup - CPU#66 stuck for 22s! [udevd:9559] RIP: 0010:[810d1d06] [810d1d06] audit_log_start+0xe6/0x350 Call Trace: [8108ea30] ? try_to_wake_up+0x2d0/0x2d0 [810d8d6f] audit_log_exit+0x3f/0x590 [810d975d] __audit_syscall_exit+0x28d/0x2c0 [815e0440] sysret_audit+0x17/0x21 The second patch: audit: Two efficiency fixes for audit mechanism prevents these and similar error messages repeated often during boot: udevd[876]: worker [887] unexpectedly returned with status 0x0100 udevd[876]: worker [887] failed while handling '/devices/pci:00/:00:03.0/:40:00.0' udevd[876]: worker [880] unexpectedly returned with status 0x0100 udevd[876]: worker [880] failed while handling '/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1' udevadm settle - timeout of 180 seconds reached, the event queue contains: /sys/devices/LNXSYSTM:00/LNXPWRBN:00/input/input1/event1 (3995) /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/INT3F0D:00 (4034) audit: audit_backlog=258 audit_backlog_limit=256 audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=256 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] audit: fix soft lockups and udevd errors when audit is overrun
On Mon, 02 Sep 2013 20:45:14 -0700 Chuck Anderson chuck.ander...@oracle.com wrote: The two patches that follow in separate emails resolve soft lockups and udevd reported errors that prevented a large memory 3.8 system from booting. The patches are based on 3.11-rc7. I believe it is the same issue recently posted as: [RFC] audit: avoid soft lockup in audit_log_start() https://lkml.org/lkml/2013/8/28/626 Nice to see someone else looking into this! And Thanks for CC'ing me. I've a couple of news to you. First, I've tried to apply your series but got this: [lcapitulino@volcano linux-2.6]$ git am ~/audit-fix.mbox Applying: audit: fix soft lockups due to loop in audit_log_start() wh,en audit_backlog_limit exceeded fatal: corrupt patch at line 23 Patch failed at 0001 audit: fix soft lockups due to loop in audit_log_start() wh,en audit_backlog_limit exceeded The copy of the patch that failed is found in: /home/lcapitulino/work/src/upstream/linux-2.6/.git/rebase-apply/patch When you have resolved this problem, run git am --resolved. If you prefer to skip this patch, run git am --skip instead. To restore the original branch and stop patching, run git am --abort. [lcapitulino@volcano linux-2.6]$ Now, I was a few minutes a away before sending a different fix I cooked this evening when I got your series in my inbox. So I really wanted to give this a try and applied the first patch manually (resulting version is attached). The softlockup is gone, but I still get a hang for several seconds just like I did with my first rfc. I found a very easy way to reproduce the problem and our analysis is similar, but our solutions differs. I'm going to send my solution right now, sorry for any mistakes it's almost 1h AM here but I really wanted to give your version a try before sending my version (and before going to bed). If you send a v2 I'll try it again and we can discuss our approaches. diff --git a/kernel/audit.c b/kernel/audit.c index 91e53d0..8255d9b 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -103,9 +103,11 @@ static int audit_rate_limit; /* Number of outstanding audit_buffers allowed. */ static int audit_backlog_limit = 64; -static int audit_backlog_wait_time = 60 * HZ; static int audit_backlog_wait_overflow = 0; +#define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) +static int audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; + /* The identity of the user shutting down the audit system. */ kuid_t audit_sig_uid = INVALID_UID; pid_t audit_sig_pid = -1; @@ -1053,14 +1055,14 @@ static inline void audit_get_stamp(struct audit_context *ctx, /* * Wait for auditd to drain the queue a little */ -static void wait_for_auditd(unsigned long sleep_time) +static void wait_for_auditd(unsigned long sleep_time, int limit) { DECLARE_WAITQUEUE(wait, current); set_current_state(TASK_UNINTERRUPTIBLE); add_wait_queue(audit_backlog_wait, wait); if (audit_backlog_limit - skb_queue_len(audit_skb_queue) audit_backlog_limit) + skb_queue_len(audit_skb_queue) limit) schedule_timeout(sleep_time); __set_current_state(TASK_RUNNING); @@ -1095,8 +1097,8 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, struct audit_buffer *ab = NULL; struct timespec t; unsigned int uninitialized_var(serial); - int reserve; unsigned long timeout_start = jiffies; + int limit; if (audit_initialized != AUDIT_INITIALIZED) return NULL; @@ -1104,22 +1106,21 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, if (unlikely(audit_filter_type(type))) return NULL; - if (gfp_mask __GFP_WAIT) - reserve = 0; - else - reserve = 5; /* Allow atomic callers to go up to five -entries over the normal backlog limit */ + limit = audit_backlog_limit; + if (!(gfp_mask __GFP_WAIT)) + limit += 5; while (audit_backlog_limit - skb_queue_len(audit_skb_queue) audit_backlog_limit + reserve) { + skb_queue_len(audit_skb_queue) limit) { if (gfp_mask __GFP_WAIT audit_backlog_wait_time) { unsigned long sleep_time; sleep_time = timeout_start + audit_backlog_wait_time - jiffies; - if ((long)sleep_time 0) -wait_for_auditd(sleep_time); - continue; + if ((long)sleep_time 0) { +wait_for_auditd(sleep_time, limit); +continue; + } } if (audit_rate_check() printk_ratelimit()) printk(KERN_WARNING @@ -1133,6 +1134,8 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, return NULL; } + audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; + ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost(out of memory in audit_log_start);