Given the results of Aaron's experiments over the weekend, here's a summary of 
what we think we're seeing:
 - There is an unsafe interaction between two threads.
 - This interaction can only be observed in a very small time window, and then 
relatively rarely.
 - The interaction is observable only when TLE is enabled.  We posit this is 
because TLE improves lock/unlock speed, so that the observable time window 
applies.
 - The time window is sufficiently small that a relatively fast syscall to 
mprotect is sufficient to disrupt the timing.
 - If all threads are forced to run on separate processors (SMT=1), this is 
sufficient to disrupt the timing.

We believe this is an application problem.  Further "evidence" against a
problem in TLE itself is that TLE has been enabled for ppc64el on Ubuntu
for over 18 months without any similar reports.

Unfortunately, I don't think we have any way to directly debug the
problem, due to the narrow time window.  We have considered hardware
watchpoints.  There is a DAWR (data address watch register) for each
thread, but using it in this case seems impractical.  GDB's
implementation of hardware watchpoints in a multithreaded environment is
such that, when a hardware watchpoint is set, it is set to the same
address for all threads.  So even if we were to script the setting of a
hardware watchpoint under GDB, the time required for GDB to set up the
watchpoint address on the fly would surely exceed the critical time
window.  You could try it, but I wouldn't expect much.

Further debugging seems to require application knowledge or a code crawl
of some kind.  The setting of two flag bytes is the only clue we have.
To my knowledge we have never seen only a single byte clobbered in the
canary, and the two bytes appear to be aligned on a 16-bit boundary.  So
the code doing the clobbering very likely contains a store-halfword (sth
or, less likely, sthx or sthu) instruction.  You could examine the
disassembly of the application for occurrences of sth to narrow the
field of search.

It could be two individual stores, in which case you'd be looking for
two instructions very close together of the form:

  stb r<x>,<n>(r<b>)
  stb r<y>,<n+1>(r<b>)

Example:

  stb r5,0(r9)
  stb r6,1(r9)

Obviously this is a huge application so this doesn't help much in and of
itself, but perhaps if you've already narrowed the problem down
somewhat, this could be helpful.

I am running low on ideas for how we can help you debug the problem.
We'll continue discussing it; if any better thoughts arise, we'll be
sure to let you know.

Bill

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
  MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to