On 06/07/2012 04:43 PM, Arnt Gulbrandsen wrote:
On 06/07/2012 04:37 PM, Lee Garrett wrote:
I'm tring to debug a 100% CPU lockup with aox. I'm experiencing the
bug in 3.1.3 and in master branch of the git repo (up to commit e96c93d).
There is a IMAP folder with 2000+ new messages, my mail client
accesses it ( UID fetch 1:* (FLAGS) (CHANGEDSINCE 13466) ), the SQL
query takes maybe a sec or two and after that aox "loops" in
Allocator::allocate.
359 if ( taken< capacity ) {
(gdb) bt
#0 0x0000000000547a88 in Allocator::allocate (this=0x21f5270, size=24,
pointers=1) at core/allocator.cpp:359
#1 0x0000000000547b0e in Allocator::allocate (this=0x232b310, size=24,
pointers=1) at core/allocator.cpp:393
#2 0x0000000000547b0e in Allocator::allocate (this=0x2363d60, size=24,
pointers=1) at core/allocator.cpp:393
#3 0x0000000000547b0e in Allocator::allocate (this=0x232abb0, size=24,
pointers=1) at core/allocator.cpp:393
#4 0x0000000000547b0e in Allocator::allocate (this=0x232b7b0, size=24,
pointers=1) at core/allocator.cpp:393
#5 0x0000000000547b0e in Allocator::allocate (this=0x231ab80, size=24,
pointers=1) at core/allocator.cpp:393
#6 0x0000000000547b0e in Allocator::allocate (this=0x21f1850, size=24,
pointers=1) at core/allocator.cpp:393
#7 0x0000000000547b0e in Allocator::allocate (this=0x23aaca0, size=24,
pointers=1) at core/allocator.cpp:393
[...]
Sounds like memory corruption affecting the allocator. Do the 'this'
values ever repeat?
I checked three different backtraces. All the 'this' values on
allocate() within a backtrace are unique.
If you change BlockShift in core/allocator.cpp to e.g. 20 or 24, does
the stack change? If you build with jam -sOPTIM= install, does the bug
go away or change?
I tried jam -sOPTIM= first, the problem was still there. Though it did
enter the "loop" from a different part of the code:
[...]
#270 0x000000000059e008 in Allocator::allocate (this=0x147f4d0, size=16,
pointers=2) at core/allocator.cpp:393
#271 0x000000000059e008 in Allocator::allocate (this=0x13e2010, size=16,
pointers=2) at core/allocator.cpp:393
#272 0x000000000059e008 in Allocator::allocate (this=0x1401830, size=16,
pointers=2) at core/allocator.cpp:393
#273 0x000000000059e008 in Allocator::allocate (this=0x1401c00, size=16,
pointers=2) at core/allocator.cpp:393
#274 0x000000000059d4d1 in Allocator::alloc (s=16, n=2) at
core/allocator.cpp:183
#275 0x0000000000596772 in Garbage::operator new (s=16) at
core/global.cpp:48
#276 0x00000000004ca0ed in PgDataRow (this=0x7fff6b7320d0,
b=0x7ffc347c1948, d=0x7ffc323c0a88) at db/pgmessage.cpp:965
#277 0x00000000004bc456 in Postgres::process (this=0x7ffc347c1748,
type=68 'D') at db/postgres.cpp:579
#278 0x00000000004ba8fb in Postgres::react (this=0x7ffc347c1748,
e=Connection::Read) at db/postgres.cpp:326
#279 0x000000000055bf02 in EventLoop::dispatch (this=0x7ffc34802248,
c=0x7ffc347c1748, r=true, w=false, now=1339083595) at
server/eventloop.cpp:463
#280 0x000000000055b60b in EventLoop::start (this=0x7ffc34802248) at
server/eventloop.cpp:306
#281 0x0000000000562c2d in Server::run (this=0x7fff6b732daf) at
server/server.cpp:645
#282 0x0000000000404501 in main (argc=2, argv=0x7fff6b732eb8) at
archiveopteryx/archiveopteryx.cpp:257
However, changing BlockShift to 20 did fix my problem. The IMAP folder
did finish loading and my aox instance is still responsive. I guess
throwing larger memory chunks at aox did help :)
Thanks for helping me with debugging!
Greets, Lee
17 means "allocate 2^17 bytes from the OS at a time", 128k. 20 is 1MB,
24 16MB.
taken and capacity are always the same size, so that block gets
skipped and AFAICS a new Allocator object gets added to the chain. I
don't known enough of aox's memory allocator to understand the gist of
the problem. What strikes me as odd is that apparently
Allocator::alloc tries to find 4.3 billion 44 byte ranges?
No, that function is alloc( number of bytes, maximum number of pointers
) and the default number of pointers is UINT_MAX, meaning "all of the
bytes are pointers". So it's looking for a 44-byte object which may
contain pointers anywhere.
Arnt