Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Wed, 2014-10-29 at 21:28 -0700, Darren Hart wrote: > On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote: > > Here's the test code: > > > > I want to say "Thanks!" and pull it into futextest... but destroying > filesystems > and BIOS errors?!? may not be ideal failure detection modes. Yup, suse now has a test farm box with a slightly bent up btrfs root. I'll have to remember to poke the install button when I return it :) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote: > Here's the test code: > I want to say "Thanks!" and pull it into futextest... but destroying filesystems and BIOS errors?!? may not be ideal failure detection modes. (Apologies for being so late to this particular party). -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote: Here's the test code: I want to say Thanks! and pull it into futextest... but destroying filesystems and BIOS errors?!? may not be ideal failure detection modes. (Apologies for being so late to this particular party). -- Darren Hart Intel Open Source Technology Center -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Wed, 2014-10-29 at 21:28 -0700, Darren Hart wrote: On Thu, Oct 23, 2014 at 03:28:07PM -0400, Brian Silverman wrote: Here's the test code: I want to say Thanks! and pull it into futextest... but destroying filesystems and BIOS errors?!? may not be ideal failure detection modes. Yup, suse now has a test farm box with a slightly bent up btrfs root. I'll have to remember to poke the install button when I return it :) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Sat, 25 Oct 2014, Brian Silverman wrote: > On Sat, 25 Oct 2014, Thomas Gleixner wrote: > > > > pi_state_free and exit_pi_state_list both clean up futex_pi_state's. > > > exit_pi_state_list takes the hb lock first, and most callers of > > > pi_state_free do too. requeue_pi didn't, which causes lots of problems. > > > > "causes lots of problems" is not really a good explanation of the root > > cause. That wants a proper description of the race, i.e. > > > > CPU 0 CPU 1 > > ... > > > > I'm surely someone who is familiar with that code, but it took me > > quite some time to understand whats going on. The casual reader will > > just go into brain spiral mode and give up. > > Thinking about it again, I'm no longer so sure that exit_pi_state_list is the > only place that it can race against. However, I did catch that one with a > particularly lucky crashdump, so I know it _can_ happen there. Is just > giving an example for that good? Sure. That immediately makes clear whats going on. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Sat, 25 Oct 2014, Brian Silverman wrote: On Sat, 25 Oct 2014, Thomas Gleixner wrote: pi_state_free and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of pi_state_free do too. requeue_pi didn't, which causes lots of problems. causes lots of problems is not really a good explanation of the root cause. That wants a proper description of the race, i.e. CPU 0 CPU 1 ... I'm surely someone who is familiar with that code, but it took me quite some time to understand whats going on. The casual reader will just go into brain spiral mode and give up. Thinking about it again, I'm no longer so sure that exit_pi_state_list is the only place that it can race against. However, I did catch that one with a particularly lucky crashdump, so I know it _can_ happen there. Is just giving an example for that good? Sure. That immediately makes clear whats going on. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Sat, 25 Oct 2014, Thomas Gleixner wrote: > > pi_state_free and exit_pi_state_list both clean up futex_pi_state's. > > exit_pi_state_list takes the hb lock first, and most callers of > > pi_state_free do too. requeue_pi didn't, which causes lots of problems. > > "causes lots of problems" is not really a good explanation of the root > cause. That wants a proper description of the race, i.e. > > CPU 0 CPU 1 > ... > > I'm surely someone who is familiar with that code, but it took me > quite some time to understand whats going on. The casual reader will > just go into brain spiral mode and give up. Thinking about it again, I'm no longer so sure that exit_pi_state_list is the only place that it can race against. However, I did catch that one with a particularly lucky crashdump, so I know it _can_ happen there. Is just giving an example for that good? > > static void free_pi_state(struct futex_pi_state *pi_state) > > > @@ -1558,6 +1552,14 @@ retry_private: > > ret = get_futex_value_locked(, uaddr1); > > > > if (unlikely(ret)) { > > + if (flags & FLAGS_SHARED && pi_state != NULL) { > > Why is this dependend on "flags & FLAGS_SHARED"? The shared/private > property has nothing to do with that at all, but I might be missing > something. Nothing... Good catch. It was a bad rebase. Thanks, Brian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Thu, 23 Oct 2014, Brian Silverman wrote: First of all. Nice catch! > pi_state_free and exit_pi_state_list both clean up futex_pi_state's. > exit_pi_state_list takes the hb lock first, and most callers of > pi_state_free do too. requeue_pi didn't, which causes lots of problems. "causes lots of problems" is not really a good explanation of the root cause. That wants a proper description of the race, i.e. CPU 0 CPU 1 ... I'm surely someone who is familiar with that code, but it took me quite some time to understand whats going on. The casual reader will just go into brain spiral mode and give up. > +/** > + * Must be called with the hb lock held. > + */ Having that comment is nice, but there is nothing which enforces it. So we really should add another argument to that function, i.e. struct futex_hash_bucket *hb and verify that the lock is held at least when lockdep is enabled. > static void free_pi_state(struct futex_pi_state *pi_state) > @@ -1558,6 +1552,14 @@ retry_private: > ret = get_futex_value_locked(, uaddr1); > > if (unlikely(ret)) { > + if (flags & FLAGS_SHARED && pi_state != NULL) { Why is this dependend on "flags & FLAGS_SHARED"? The shared/private property has nothing to do with that at all, but I might be missing something. > + /* > + * We will have to lookup the pi_state again, so > + * free this one to keep the accounting correct. > + */ > + free_pi_state(pi_state); > + pi_state = NULL; Instead of copying the same code over and over, we should change free_pi_state() to handle being called with a NULL pointer. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Thu, 23 Oct 2014, Brian Silverman wrote: First of all. Nice catch! pi_state_free and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of pi_state_free do too. requeue_pi didn't, which causes lots of problems. causes lots of problems is not really a good explanation of the root cause. That wants a proper description of the race, i.e. CPU 0 CPU 1 ... I'm surely someone who is familiar with that code, but it took me quite some time to understand whats going on. The casual reader will just go into brain spiral mode and give up. +/** + * Must be called with the hb lock held. + */ Having that comment is nice, but there is nothing which enforces it. So we really should add another argument to that function, i.e. struct futex_hash_bucket *hb and verify that the lock is held at least when lockdep is enabled. static void free_pi_state(struct futex_pi_state *pi_state) @@ -1558,6 +1552,14 @@ retry_private: ret = get_futex_value_locked(curval, uaddr1); if (unlikely(ret)) { + if (flags FLAGS_SHARED pi_state != NULL) { Why is this dependend on flags FLAGS_SHARED? The shared/private property has nothing to do with that at all, but I might be missing something. + /* + * We will have to lookup the pi_state again, so + * free this one to keep the accounting correct. + */ + free_pi_state(pi_state); + pi_state = NULL; Instead of copying the same code over and over, we should change free_pi_state() to handle being called with a NULL pointer. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
On Sat, 25 Oct 2014, Thomas Gleixner wrote: pi_state_free and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of pi_state_free do too. requeue_pi didn't, which causes lots of problems. causes lots of problems is not really a good explanation of the root cause. That wants a proper description of the race, i.e. CPU 0 CPU 1 ... I'm surely someone who is familiar with that code, but it took me quite some time to understand whats going on. The casual reader will just go into brain spiral mode and give up. Thinking about it again, I'm no longer so sure that exit_pi_state_list is the only place that it can race against. However, I did catch that one with a particularly lucky crashdump, so I know it _can_ happen there. Is just giving an example for that good? static void free_pi_state(struct futex_pi_state *pi_state) @@ -1558,6 +1552,14 @@ retry_private: ret = get_futex_value_locked(curval, uaddr1); if (unlikely(ret)) { + if (flags FLAGS_SHARED pi_state != NULL) { Why is this dependend on flags FLAGS_SHARED? The shared/private property has nothing to do with that at all, but I might be missing something. Nothing... Good catch. It was a bad rebase. Thanks, Brian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
(CCs more eyeballs) On Thu, 2014-10-23 at 15:28 -0400, Brian Silverman wrote: > Here's the test code: Which took a 2 socket 28 core box (NOPREEMPT) out in short order. With patchlet applied, looks like it'll stay up (37 minutes and counting), I'll squeak if it explodes. Tested-by: Mike Galbraith [ 387.396020] BUG: unable to handle kernel NULL pointer dereference at 0b34 [ 387.414177] IP: [] free_pi_state+0x4e/0xb0 [ 387.427638] PGD 8394fe067 PUD 847c37067 PMD 0 [ 387.438457] Oops: 0002 [#1] SMP [ 387.446534] Modules linked in: nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) iscsi_ibft(E) iscsi_boot_sysfs(E) af_packet(E) ext4(E) crc16(E) mbcache(E) jbd2(E) joydev(E) hid_generic(E) intel_rapl(E) usbhid(E) x86_pkg_temp_thermal(E) iTCO_wdt(E) intel_powerclamp(E) iTCO_vendor_support(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) ixgbe(E) gf128mul(E) glue_helper(E) ptp(E) ablk_helper(E) pps_core(E) cryptd(E) sb_edac(E) pcspkr(E) mdio(E) edac_core(E) ipmi_si(E) dca(E) ipmi_msghandler(E) lpc_ich(E) mfd_core(E) wmi(E) acpi_power_meter(E) xhci_pci(E) mei_me(E) i2c_i801(E) acpi_pad(E) processor(E) mei(E) xhci_hcd(E) shpchp(E) button(E) dm_mod(E) btrfs(E) xor(E) raid6_pq(E) sr_mod(E) cdrom(E) sd_mod(E) mgag200(E) syscopyarea(E) [ 387.610406] sysfillrect(E) sysimgblt(E) i2c_algo_bit(E) drm_kms_helper(E) ehci_pci(E) ttm(E) ahci(E) ehci_hcd(E) crc32c_intel(E) libahci(E) drm(E) usbcore(E) libata(E) usb_common(E) sg(E) scsi_mod(E) autofs4(E) [ 387.651513] CPU: 27 PID: 136696 Comm: futex-exit-race Tainted: G E 3.18.0-default #51 [ 387.672339] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0030.R03.1405061547 05/06/2014 [ 387.696066] task: 880833002250 ti: 880830d04000 task.ti: 880830d04000 [ 387.713855] RIP: 0010:[] [] free_pi_state+0x4e/0xb0 [ 387.733015] RSP: 0018:880830d07d78 EFLAGS: 00010046 [ 387.746030] RAX: RBX: 8804592ea340 RCX: 0bb6 [ 387.763089] RDX: 8804592ea340 RSI: 880866c3fb48 RDI: 88083b40cb84 [ 387.780167] RBP: 880830d07d88 R08: c900136b8a70 R09: 88046afe8150 [ 387.797255] R10: R11: 000101b0 R12: 880830d07e08 [ 387.814360] R13: R14: c900136b8a70 R15: 0001 [ 387.831476] FS: 7fb2637c5700() GS:88087f1a() knlGS: [ 387.850710] CS: 0010 DS: ES: CR0: 80050033 [ 387.864766] CR2: 0b34 CR3: 0008374e7000 CR4: 001407e0 [ 387.881901] Stack: [ 387.887707] 880830d07d88 880830d07e58 810d52c4 [ 387.905497] 001b0001 880830d07e20 0002018230d07dc8 0001 [ 387.923290] c900136b8a84 880830d07f08 7fb2637d838c 00010001 [ 387.941071] Call Trace: [ 387.947864] [] futex_requeue+0x2b4/0x8e0 [ 387.961578] [] do_futex+0xa9/0x580 [ 387.974128] [] ? do_nanosleep+0x82/0x110 [ 387.987814] [] ? hrtimer_nanosleep+0xac/0x160 [ 388.002458] [] SyS_futex+0x71/0x150 [ 388.015181] [] system_call_fastpath+0x12/0x17 [ 388.029826] Code: 30 48 85 ff 74 41 48 81 c7 34 0b 00 00 e8 7b 1f 4b 00 48 8b 43 08 48 8b 13 48 89 42 08 48 89 10 48 89 1b 48 89 5b 08 48 8b 43 30 <66> 83 80 34 0b 00 00 01 fb 66 0f 1f 44 00 00 48 8b 73 30 48 8d [ 388.075136] RIP [] free_pi_state+0x4e/0xb0 [ 388.089266] RSP [ 388.098386] CR2: 0b34 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
Here's the test code: #define _GNU_SOURCE #include #include #include #include #include #include #include #include // Whether to use a pthread mutex+condvar or do the raw futex operations. Either // one will break. // There's less user-space code involved with the non-pthread version, but the // syscalls I think are relevant end up the same (except they're FUTEX_PRIVATE // with the pthread version). #define PTHREAD_MUTEX 0 // The number of processes that repeatedly call RunTest to fork off. // Making this big (like 2500) makes it reproduce faster, but I have seen it go // with as little as 4. // With big values, `ulimit -u unlimited` (or something big at least) is // necessary before running it (or do it as root). #define NUMBER_TESTERS 2500 #if PTHREAD_MUTEX #include typedef struct { pthread_mutex_t mutex; pthread_cond_t condition; } Shared; #else typedef volatile uint32_t my_futex __attribute__((aligned(sizeof(int; void condition_wait(my_futex *c, my_futex *m) { syscall(SYS_futex, c, FUTEX_WAIT_REQUEUE_PI, *c, NULL, m); } void condition_signal(my_futex *c, my_futex *m) { syscall(SYS_futex, c, FUTEX_CMP_REQUEUE_PI, 1, 0, m, *c); } typedef struct { my_futex mutex; my_futex condition; } Shared; #endif void RunTest(Shared *shared) { #if PTHREAD_MUTEX pthread_mutexattr_t mutexattr; assert(pthread_mutexattr_init() == 0); assert(pthread_mutexattr_setprotocol(, PTHREAD_PRIO_INHERIT) == 0); assert(pthread_mutex_init(>mutex, ) == 0); assert(pthread_mutexattr_destroy() == 0); assert(pthread_cond_init(>condition, NULL) == 0); #else shared->mutex = shared->condition = 0; #endif pid_t child = fork(); if (child == 0) { #if PTHREAD_MUTEX pthread_mutex_lock(>mutex); pthread_cond_wait(>condition, >mutex); #else condition_wait(>condition, >mutex); #endif _exit(0); } else { assert(child != -1); // This sleep makes it reproduce way faster, but it will still break // eventually without it. usleep(2); #if PTHREAD_MUTEX assert(pthread_cond_broadcast(>condition) == 0); #else condition_signal(>condition, >mutex); #endif assert(kill(child, SIGKILL) == 0); assert(waitpid(child, NULL, 0) == child); } } int main() { Shared *shared_memory = (Shared *)(mmap(NULL, sizeof(Shared) * NUMBER_TESTERS, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0)); int i; for (i = 0; i < NUMBER_TESTERS; ++i) { if (fork() == 0) { while (1) { RunTest(_memory[i]); } } } } On Thu, Oct 23, 2014 at 3:22 PM, Brian Silverman wrote: > pi_state_free and exit_pi_state_list both clean up futex_pi_state's. > exit_pi_state_list takes the hb lock first, and most callers of > pi_state_free do too. requeue_pi didn't, which causes lots of problems. > Move the pi_state_free calls in requeue_pi to before it drops the hb > locks which it's already holding. > > I have test code that forks a bunch of processes, which all fork more > processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do > futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That > test consistently breaks QEMU VMs without this patch. > > Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots > of CPUs are not necessary to reproduce this problem. The symptoms range > from random reboots to corrupting the test program to corrupting whole > filesystems to strange BIOS errors. Crashdumps (when they get created at > all) are usually a mess, to the point of crashing tools used to examine > them. > > Signed-off-by: Brian Silverman > --- > I am not subscribed to the list so please CC me on any responses. > > kernel/futex.c | 32 +--- > 1 file changed, 21 insertions(+), 11 deletions(-) > > diff --git a/kernel/futex.c b/kernel/futex.c > index 815d7af..dc69775 100644 > --- a/kernel/futex.c > +++ b/kernel/futex.c > @@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void) > return pi_state; > } > > +/** > + * Must be called with the hb lock held. > + */ > static void free_pi_state(struct futex_pi_state *pi_state) > { > if (!atomic_dec_and_test(_state->refcount)) > @@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned > int flags, > } > > retry: > - if (pi_state != NULL) { > - /* > -* We will have to lookup the pi_state again, so free this one > -* to keep the accounting correct. > -*/ > - free_pi_state(pi_state); > - pi_state = NULL; > - } > - > ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, , VERIFY_READ); > if (unlikely(ret != 0)) > goto out; > @@ -1558,6 +1552,14 @@ retry_private: > ret = get_futex_value_locked(, uaddr1); > > if (unlikely(ret)) { >
[PATCH] futex: fix a race condition between REQUEUE_PI and task death
pi_state_free and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of pi_state_free do too. requeue_pi didn't, which causes lots of problems. Move the pi_state_free calls in requeue_pi to before it drops the hb locks which it's already holding. I have test code that forks a bunch of processes, which all fork more processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That test consistently breaks QEMU VMs without this patch. Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots of CPUs are not necessary to reproduce this problem. The symptoms range from random reboots to corrupting the test program to corrupting whole filesystems to strange BIOS errors. Crashdumps (when they get created at all) are usually a mess, to the point of crashing tools used to examine them. Signed-off-by: Brian Silverman --- I am not subscribed to the list so please CC me on any responses. kernel/futex.c | 32 +--- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 815d7af..dc69775 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void) return pi_state; } +/** + * Must be called with the hb lock held. + */ static void free_pi_state(struct futex_pi_state *pi_state) { if (!atomic_dec_and_test(_state->refcount)) @@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, } retry: - if (pi_state != NULL) { - /* -* We will have to lookup the pi_state again, so free this one -* to keep the accounting correct. -*/ - free_pi_state(pi_state); - pi_state = NULL; - } - ret = get_futex_key(uaddr1, flags & FLAGS_SHARED, , VERIFY_READ); if (unlikely(ret != 0)) goto out; @@ -1558,6 +1552,14 @@ retry_private: ret = get_futex_value_locked(, uaddr1); if (unlikely(ret)) { + if (flags & FLAGS_SHARED && pi_state != NULL) { + /* +* We will have to lookup the pi_state again, so +* free this one to keep the accounting correct. +*/ + free_pi_state(pi_state); + pi_state = NULL; + } double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1617,6 +1619,10 @@ retry_private: case 0: break; case -EFAULT: + if (pi_state != NULL) { + free_pi_state(pi_state); + pi_state = NULL; + } double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); put_futex_key(); @@ -1632,6 +1638,10 @@ retry_private: * exit to complete. * - The user space value changed. */ + if (pi_state != NULL) { + free_pi_state(pi_state); + pi_state = NULL; + } double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); put_futex_key(); @@ -1708,6 +1718,8 @@ retry_private: } out_unlock: + if (pi_state != NULL) + free_pi_state(pi_state); double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1725,8 +1737,6 @@ out_put_keys: out_put_key1: put_futex_key(); out: - if (pi_state != NULL) - free_pi_state(pi_state); return ret ? ret : task_count; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] futex: fix a race condition between REQUEUE_PI and task death
pi_state_free and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of pi_state_free do too. requeue_pi didn't, which causes lots of problems. Move the pi_state_free calls in requeue_pi to before it drops the hb locks which it's already holding. I have test code that forks a bunch of processes, which all fork more processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That test consistently breaks QEMU VMs without this patch. Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots of CPUs are not necessary to reproduce this problem. The symptoms range from random reboots to corrupting the test program to corrupting whole filesystems to strange BIOS errors. Crashdumps (when they get created at all) are usually a mess, to the point of crashing tools used to examine them. Signed-off-by: Brian Silverman bsilver16...@gmail.com --- I am not subscribed to the list so please CC me on any responses. kernel/futex.c | 32 +--- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 815d7af..dc69775 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void) return pi_state; } +/** + * Must be called with the hb lock held. + */ static void free_pi_state(struct futex_pi_state *pi_state) { if (!atomic_dec_and_test(pi_state-refcount)) @@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, } retry: - if (pi_state != NULL) { - /* -* We will have to lookup the pi_state again, so free this one -* to keep the accounting correct. -*/ - free_pi_state(pi_state); - pi_state = NULL; - } - ret = get_futex_key(uaddr1, flags FLAGS_SHARED, key1, VERIFY_READ); if (unlikely(ret != 0)) goto out; @@ -1558,6 +1552,14 @@ retry_private: ret = get_futex_value_locked(curval, uaddr1); if (unlikely(ret)) { + if (flags FLAGS_SHARED pi_state != NULL) { + /* +* We will have to lookup the pi_state again, so +* free this one to keep the accounting correct. +*/ + free_pi_state(pi_state); + pi_state = NULL; + } double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1617,6 +1619,10 @@ retry_private: case 0: break; case -EFAULT: + if (pi_state != NULL) { + free_pi_state(pi_state); + pi_state = NULL; + } double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); put_futex_key(key2); @@ -1632,6 +1638,10 @@ retry_private: * exit to complete. * - The user space value changed. */ + if (pi_state != NULL) { + free_pi_state(pi_state); + pi_state = NULL; + } double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); put_futex_key(key2); @@ -1708,6 +1718,8 @@ retry_private: } out_unlock: + if (pi_state != NULL) + free_pi_state(pi_state); double_unlock_hb(hb1, hb2); hb_waiters_dec(hb2); @@ -1725,8 +1737,6 @@ out_put_keys: out_put_key1: put_futex_key(key1); out: - if (pi_state != NULL) - free_pi_state(pi_state); return ret ? ret : task_count; } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
Here's the test code: #define _GNU_SOURCE #include unistd.h #include sys/types.h #include sys/wait.h #include assert.h #include sys/mman.h #include sys/syscall.h #include inttypes.h #include linux/futex.h // Whether to use a pthread mutex+condvar or do the raw futex operations. Either // one will break. // There's less user-space code involved with the non-pthread version, but the // syscalls I think are relevant end up the same (except they're FUTEX_PRIVATE // with the pthread version). #define PTHREAD_MUTEX 0 // The number of processes that repeatedly call RunTest to fork off. // Making this big (like 2500) makes it reproduce faster, but I have seen it go // with as little as 4. // With big values, `ulimit -u unlimited` (or something big at least) is // necessary before running it (or do it as root). #define NUMBER_TESTERS 2500 #if PTHREAD_MUTEX #include pthread.h typedef struct { pthread_mutex_t mutex; pthread_cond_t condition; } Shared; #else typedef volatile uint32_t my_futex __attribute__((aligned(sizeof(int; void condition_wait(my_futex *c, my_futex *m) { syscall(SYS_futex, c, FUTEX_WAIT_REQUEUE_PI, *c, NULL, m); } void condition_signal(my_futex *c, my_futex *m) { syscall(SYS_futex, c, FUTEX_CMP_REQUEUE_PI, 1, 0, m, *c); } typedef struct { my_futex mutex; my_futex condition; } Shared; #endif void RunTest(Shared *shared) { #if PTHREAD_MUTEX pthread_mutexattr_t mutexattr; assert(pthread_mutexattr_init(mutexattr) == 0); assert(pthread_mutexattr_setprotocol(mutexattr, PTHREAD_PRIO_INHERIT) == 0); assert(pthread_mutex_init(shared-mutex, mutexattr) == 0); assert(pthread_mutexattr_destroy(mutexattr) == 0); assert(pthread_cond_init(shared-condition, NULL) == 0); #else shared-mutex = shared-condition = 0; #endif pid_t child = fork(); if (child == 0) { #if PTHREAD_MUTEX pthread_mutex_lock(shared-mutex); pthread_cond_wait(shared-condition, shared-mutex); #else condition_wait(shared-condition, shared-mutex); #endif _exit(0); } else { assert(child != -1); // This sleep makes it reproduce way faster, but it will still break // eventually without it. usleep(2); #if PTHREAD_MUTEX assert(pthread_cond_broadcast(shared-condition) == 0); #else condition_signal(shared-condition, shared-mutex); #endif assert(kill(child, SIGKILL) == 0); assert(waitpid(child, NULL, 0) == child); } } int main() { Shared *shared_memory = (Shared *)(mmap(NULL, sizeof(Shared) * NUMBER_TESTERS, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0)); int i; for (i = 0; i NUMBER_TESTERS; ++i) { if (fork() == 0) { while (1) { RunTest(shared_memory[i]); } } } } On Thu, Oct 23, 2014 at 3:22 PM, Brian Silverman bsilver16...@gmail.com wrote: pi_state_free and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of pi_state_free do too. requeue_pi didn't, which causes lots of problems. Move the pi_state_free calls in requeue_pi to before it drops the hb locks which it's already holding. I have test code that forks a bunch of processes, which all fork more processes that do futex(FUTEX_WAIT_REQUEUE_PI), then do futex(FUTEX_CMP_REQUEUE_PI) before killing the waiting processes. That test consistently breaks QEMU VMs without this patch. Although they both make it faster to crash, CONFIG_PREEMPT_RT and lots of CPUs are not necessary to reproduce this problem. The symptoms range from random reboots to corrupting the test program to corrupting whole filesystems to strange BIOS errors. Crashdumps (when they get created at all) are usually a mess, to the point of crashing tools used to examine them. Signed-off-by: Brian Silverman bsilver16...@gmail.com --- I am not subscribed to the list so please CC me on any responses. kernel/futex.c | 32 +--- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 815d7af..dc69775 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -639,6 +639,9 @@ static struct futex_pi_state * alloc_pi_state(void) return pi_state; } +/** + * Must be called with the hb lock held. + */ static void free_pi_state(struct futex_pi_state *pi_state) { if (!atomic_dec_and_test(pi_state-refcount)) @@ -1519,15 +1522,6 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, } retry: - if (pi_state != NULL) { - /* -* We will have to lookup the pi_state again, so free this one -* to keep the accounting correct. -*/ - free_pi_state(pi_state); - pi_state = NULL; - } - ret = get_futex_key(uaddr1, flags FLAGS_SHARED, key1, VERIFY_READ); if
Re: [PATCH] futex: fix a race condition between REQUEUE_PI and task death
(CCs more eyeballs) On Thu, 2014-10-23 at 15:28 -0400, Brian Silverman wrote: Here's the test code: Which took a 2 socket 28 core box (NOPREEMPT) out in short order. With patchlet applied, looks like it'll stay up (37 minutes and counting), I'll squeak if it explodes. Tested-by: Mike Galbraith umgwanakikb...@gmail.com [ 387.396020] BUG: unable to handle kernel NULL pointer dereference at 0b34 [ 387.414177] IP: [810d411e] free_pi_state+0x4e/0xb0 [ 387.427638] PGD 8394fe067 PUD 847c37067 PMD 0 [ 387.438457] Oops: 0002 [#1] SMP [ 387.446534] Modules linked in: nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) iscsi_ibft(E) iscsi_boot_sysfs(E) af_packet(E) ext4(E) crc16(E) mbcache(E) jbd2(E) joydev(E) hid_generic(E) intel_rapl(E) usbhid(E) x86_pkg_temp_thermal(E) iTCO_wdt(E) intel_powerclamp(E) iTCO_vendor_support(E) coretemp(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) ixgbe(E) gf128mul(E) glue_helper(E) ptp(E) ablk_helper(E) pps_core(E) cryptd(E) sb_edac(E) pcspkr(E) mdio(E) edac_core(E) ipmi_si(E) dca(E) ipmi_msghandler(E) lpc_ich(E) mfd_core(E) wmi(E) acpi_power_meter(E) xhci_pci(E) mei_me(E) i2c_i801(E) acpi_pad(E) processor(E) mei(E) xhci_hcd(E) shpchp(E) button(E) dm_mod(E) btrfs(E) xor(E) raid6_pq(E) sr_mod(E) cdrom(E) sd_mod(E) mgag200(E) syscopyarea(E) [ 387.610406] sysfillrect(E) sysimgblt(E) i2c_algo_bit(E) drm_kms_helper(E) ehci_pci(E) ttm(E) ahci(E) ehci_hcd(E) crc32c_intel(E) libahci(E) drm(E) usbcore(E) libata(E) usb_common(E) sg(E) scsi_mod(E) autofs4(E) [ 387.651513] CPU: 27 PID: 136696 Comm: futex-exit-race Tainted: G E 3.18.0-default #51 [ 387.672339] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0030.R03.1405061547 05/06/2014 [ 387.696066] task: 880833002250 ti: 880830d04000 task.ti: 880830d04000 [ 387.713855] RIP: 0010:[810d411e] [810d411e] free_pi_state+0x4e/0xb0 [ 387.733015] RSP: 0018:880830d07d78 EFLAGS: 00010046 [ 387.746030] RAX: RBX: 8804592ea340 RCX: 0bb6 [ 387.763089] RDX: 8804592ea340 RSI: 880866c3fb48 RDI: 88083b40cb84 [ 387.780167] RBP: 880830d07d88 R08: c900136b8a70 R09: 88046afe8150 [ 387.797255] R10: R11: 000101b0 R12: 880830d07e08 [ 387.814360] R13: R14: c900136b8a70 R15: 0001 [ 387.831476] FS: 7fb2637c5700() GS:88087f1a() knlGS: [ 387.850710] CS: 0010 DS: ES: CR0: 80050033 [ 387.864766] CR2: 0b34 CR3: 0008374e7000 CR4: 001407e0 [ 387.881901] Stack: [ 387.887707] 880830d07d88 880830d07e58 810d52c4 [ 387.905497] 001b0001 880830d07e20 0002018230d07dc8 0001 [ 387.923290] c900136b8a84 880830d07f08 7fb2637d838c 00010001 [ 387.941071] Call Trace: [ 387.947864] [810d52c4] futex_requeue+0x2b4/0x8e0 [ 387.961578] [810d5e89] do_futex+0xa9/0x580 [ 387.974128] [81585502] ? do_nanosleep+0x82/0x110 [ 387.987814] [810c414c] ? hrtimer_nanosleep+0xac/0x160 [ 388.002458] [810d63d1] SyS_futex+0x71/0x150 [ 388.015181] [815868a9] system_call_fastpath+0x12/0x17 [ 388.029826] Code: 30 48 85 ff 74 41 48 81 c7 34 0b 00 00 e8 7b 1f 4b 00 48 8b 43 08 48 8b 13 48 89 42 08 48 89 10 48 89 1b 48 89 5b 08 48 8b 43 30 66 83 80 34 0b 00 00 01 fb 66 0f 1f 44 00 00 48 8b 73 30 48 8d [ 388.075136] RIP [810d411e] free_pi_state+0x4e/0xb0 [ 388.089266] RSP 880830d07d78 [ 388.098386] CR2: 0b34 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/