I have a fairly classic master/slave set up.

Response times on the slave are generally good with blips periodically, 
apparently when replication is happening.

Occasionally however the process will have one incredibly slow query and 
will peg the CPU at 100%.

The weird thing is that it will remain that way even if we stop querying 
it and stop replication and then wait for over 20 minutes. The only way 
to fix the problem at that point is to restart tomcat.

Looking at slow queries around the time of the incident they don't look 
particularly bad - they're predominantly filter queries running under 
dismax and there doesn't seem to be anything unusual about them.

The index file is about 266G and has 30G of disk free. The machine has 
50G of RAM and is running with -Xmx35G.

Looking at the processes running it appears to be the main Java thread 
that's CPU bound, not the child threads. 

Stracing the process gives a lot of brk instructions (presumably some 
sort of wait loop) with occasional blips of: 


mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0
futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
325, {1294683789, 614186000}, ffffffff) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0
mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0
futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1
mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0
futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, 
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0
futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0
mmap(0x7fc2e0230000, 121962496, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fc2e0230000
mmap(0x7fbca58e0000, 237568, PROT_NONE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 
0x7fbca58e0000

Any ideas about what's happening and if there's anyway to mitigate it? 
If the box at least recovered then I could run another slave and load 
balance between them working on the principle that the second box 
would pick up the slack whilst the first box restabilised but, as it is, 
that's not reliable.

Thanks,

Simon

Reply via email to