Ulrich Weigand made an interesting comment on the glibc code in our internal bug (no longer mirrored), so I'm mirroring it by hand here.
--- Comment #77 from Ulrich Weigand <[email protected]> --- According to comment #45, we see an invalid read at __lll_unlock_elision (elision-unlock.c:36) and and invalid write at the next line. Looking at the code, those are: int __lll_unlock_elision (int *lock, short *adapt_count, int pshared) { /* When the lock was free we're in a transaction. */ if (*lock == 0) __libc_tend (0); else { lll_unlock ((*lock), pshared); /* Update the adapt count AFTER completing the critical section. Doing this here prevents unneeded stalling when entering a critical section. Saving about 8% runtime on P8. */ if (*adapt_count > 0) (*adapt_count)--; } return 0; } exactly the two lines accessing *adapt_count (which is a short). What seems suspicious here is the fact that these lines happen (deliberately) *outside* of the critical section protected by the mutex. Now this should indeed be normally OK. However, I understand it is allowed to use that critical section to also protect the life time of the memory object holding the mutex itself. If this is done, I'm wondering whether those two lines outside the critical section may now access that *memory* outside of its protected life time ... Just as a theoretical example, assume you have thread 1 executing a function that allocates a mutex on its stack, locks the mutex, and passes its address to a thread 2. Thread 2 now does now work and then unlocks the mutex. Thread 1, in the meantime, attempts to lock the mutex again, which will block until it was unlocked by thread 2. After it got the lock, thread 1 now immediately unlocks the mutex again and then the function returns. Now in this scenario, the cross-thread lock-unlock pair cannot use TM, so it will fall back to regular locks, which means the unlock in thread 2 will perform the *adapt_count update. On the other hand, the second lock-unlock pair in thread 1 might just happen to go through with TM, and thus not touch *adapt_count at all. This means there is no synchronization between the update of *adapt_count in thread 2, and the re-use of the stack memory underlying the mutex in thread 1, which could lead to the *adapt_count update being seen *after* that reuse in thread 1, clobbering memory ... Am I missing something here? Of course, I'm not sure if MongoDB shows that exact pattern. But I guess different patterns might show similar races too. One simple test might be to move the *adapt_count update to before the lll_unlock and check whether the issue still reproduces. Not sure about performance impacts, however. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1640518 Title: MongoDB Memory corruption To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
