The following are reproduction instructions for the behavior that we are
observing on Ubuntu 16.04 ppc64le. Note that we have run this same test
on RHEL 7.1 ppc64le, and we do not observe any stack corruption. Note
also that building and running this repro may depend on certain system
libraries (SSL, etc) or python libraries being available on the system.
Please install as needed. The particular commit here is fairly recent,
just one that I happen to know demonstrates the issue.

- git clone https://github.com/mongodb/mongo.git
- cd mongo
- git checkout 3220495083b0d678578a76591f54ee1d7a5ec5df
- git apply acm.nov9.patch
- python ./buildscripts/scons.py CC=/usr/bin/gcc CXX=/usr/bin/g++ 
CCFLAGS="-mcpu=power8 -mtune=power8 -mcmodel=medium" --ssl --implicit-cache 
--build-fast-and-loose -j$(echo "$(grep -c processor /proc/cpuinfo)/2" | bc) 
./mongo ./mongod ./mongos
- ulimit -c unlimited && python buildscripts/resmoke.py 
--suites=concurrency_sharded --storageEngine=wiredTiger 
--excludeWithAnyTags=requires_mmapv1 --dbpathPrefix=... --repeat=500 
--continueOnFailure

Note that you should provide an actual argument for the --dbpathPrefix
argument in the last step, as this is where the running database
instances will store data.

You will need to leave this running for several hours, perhaps
overnight. In our runs, we find that about 1% of the repeated runs of
the test fail, dropping a core.

The core files are typically (but not always!) associated with crashes
of the mongos binary inside one of the several mongo::bsonExtractXXX
functions, where we find our hand-rolled stack canary to be corrupted. A
typical stack trace of a crashing thread looks like:

$ gdb ./mongos core.2016-11-09T23:11:56+00:00
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc64le-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./mongos...done.
[New LWP 3821]
...
[New LWP 3736]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
Core was generated by `/home/acm/opt/src/mongo/mongos --configdb 
test-configRS/ubuntu1604-ppc-dev.pic.'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at 
../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x3fff5d98f140 (LWP 3821))]
(gdb) bt
#0  0x00003fff779ff21c in __GI_raise (sig=<optimized out>) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00003fff77a01894 in __GI_abort () at abort.c:89
#2  0x000000005d899504 in mongo::fassertFailedWithLocation (msgid=<optimized 
out>, file=0x5e6f9570 "src/mongo/bson/util/bson_extract.cpp", line=<optimized 
out>) at src/mongo/util/assert_util.cpp:172
#3  0x000000005d95cb64 in mongo::fassertWithLocation (line=46, file=0x5e6f9570 
"src/mongo/bson/util/bson_extract.cpp", testOK=<optimized out>, msgid=100001) 
at src/mongo/util/assert_util.h:273
#4  mongo::(anonymous namespace)::Canary::Canary (t=0x3fff5d98be90 '\315' 
<repeats 199 times>, <incomplete sequence \315>..., this=<synthetic pointer>) 
at src/mongo/bson/util/bson_extract.cpp:46
#5  mongo::bsonExtractTypedField (object=owned BSONObj 34 bytes @ 
0x10008d85a37, fieldName=..., type=mongo::Bool, outElement=0x3fff5d98c730) at 
src/mongo/bson/util/bson_extract.cpp:83
#6  0x000000005d95cc5c in mongo::bsonExtractBooleanField (object=..., 
fieldName=..., out=0x3fff5d98c798) at src/mongo/bson/util/bson_extract.cpp:101
#7  0x000000005df5e7ec in mongo::AutoSplitSettingsType::fromBSON (obj=...) at 
src/mongo/s/balancer_configuration.cpp:400

The hand-rolled canary is implemented as follows:

class Canary {
public:

    static constexpr size_t kSize = 2048;

    __attribute__((always_inline)) explicit Canary(volatile unsigned char* 
const t) noexcept : _t(t) {
        __builtin_memset(const_cast<unsigned char*>(t), kBits, kSize);
        fassert(100001, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
    }

    __attribute__((always_inline)) ~Canary() {
        fassert(100002, std::accumulate(&_t[0], &_t[kSize], 0UL) == kChecksum);
    }

private:
    static constexpr uint8_t kBits = 0xCD;
    static constexpr size_t kChecksum = kSize * size_t(kBits);

    const volatile unsigned char* const _t;
};

}  // namespace

The setup of the Canary in mongo::bsonExtractField is here:

Status bsonExtractField(const BSONObj& object, StringData fieldName,
BSONElement* outElement) {

    volatile unsigned char* const cookie = static_cast<unsigned char 
*>(alloca(Canary::kSize));
    const Canary c(cookie);

    ...

}

In the crash above, it can be seen that the hand-rolled canary detected
stack corruption in the Canary constructor. We memset the bytes, and
then we read them back to checksum them, and they aren't the same. We
also sometimes see the checksum fail in the Canary destructor. But the
constructor case is more interesting. What could have happened to the
memory between when we memset it, and when we read it back? One
hypothesis would be that we had leaked a pointer to a local to another
thread which wrote to it, but if that were the case we would expect to
see crashes all the time, and on other systems, and we don't. More
details on that below.

Looking at the corrupted memory, we see:

(gdb) frame 4
#4  mongo::(anonymous namespace)::Canary::Canary (t=0x3fff5d98be90 '\315' 
<repeats 199 times>, <incomplete sequence \315>..., this=<synthetic pointer>) 
at src/mongo/bson/util/bson_extract.cpp:46
46              fassert(100001, std::accumulate(&_t[0], &_t[kSize], 0UL) == 
kChecksum);
(gdb) print t
$1 = (volatile unsigned char * const) 0x3fff5d98be90 '\315' <repeats 199 
times>, <incomplete sequence \315>...
(gdb) x /2048ub t
0x3fff5d98be90: 205     205     205     205     205     205     205     205
...
0x3fff5d98c010: 205     205     205     205     205     205     205     205
0x3fff5d98c018: 205     205     205     205     205     205     205     205
0x3fff5d98c020: 205     205     205     205     205     205     1       0
0x3fff5d98c028: 205     205     205     205     205     205     205     205
...
0x3fff5d98c680: 205     205     205     205     205     205     205     205
0x3fff5d98c688: 205     205     205     205     205     205     205     205

Interestingly, the corrupted bytes are always two bytes, always either
0x00 or 0x01, and always starting at an offset aligned 0xe.

We have tried several things to narrow down the range of possible
causes.

- We have reproduced with our home built GCC 5.4.
- We have reproduced with the system GCC 5.4.
- We have reproduced with clang-3.9.
- We have run the testcase under the clang address sanitizer, with 
ASAN_OPTIONS=detect_stack_use_after_return=1.
- We have run on different hardware to ensure that this is not bad memory.
- We have run on bare metal to ensure that this is not related to the 
virtualization layer on which most of our ppc64le Ubuntu 16.04 instances run.
- The same test case is part of our continuous integration loop and runs 
nightly across dozens of operating systems and compiler variations, including 
Windows, OS X, and Linux on x86_64 (including Ubuntu 16.04).
- We have run the same test case on RHEL 7.1 ppc64le.

In none of these cases have we been able to reproduce the issue. It
appears only on Ubuntu 16.04, and only when running that OS on POWER8.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1640518

Title:
  MongoDB Memory corruption

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/gcc-5/+bug/1640518/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to