I have a fairly classic master/slave set up. Response times on the slave are generally good with blips periodically, apparently when replication is happening.
Occasionally however the process will have one incredibly slow query and will peg the CPU at 100%. The weird thing is that it will remain that way even if we stop querying it and stop replication and then wait for over 20 minutes. The only way to fix the problem at that point is to restart tomcat. Looking at slow queries around the time of the incident they don't look particularly bad - they're predominantly filter queries running under dismax and there doesn't seem to be anything unusual about them. The index file is about 266G and has 30G of disk free. The machine has 50G of RAM and is running with -Xmx35G. Looking at the processes running it appears to be the main Java thread that's CPU bound, not the child threads. Stracing the process gives a lot of brk instructions (presumably some sort of wait loop) with occasional blips of: mprotect(0x7fc5721d9000, 4096, PROT_READ) = 0 futex(0x451c24a4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x451c24a0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x4269dd14, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x4269dd10, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 325, {1294683789, 614186000}, ffffffff) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ) = 0 mprotect(0x7fc5721d8000, 4096, PROT_READ|PROT_WRITE) = 0 futex(0x7fbc94eeb5b4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fbc94eeb5b0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x426a6a28, FUTEX_WAKE_PRIVATE, 1) = 1 mprotect(0x7fc5721d9000, 4096, PROT_NONE) = 0 futex(0x41cae8f4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x41cae8f0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x41cae328, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fbc941603b4, FUTEX_WAIT_PRIVATE, 327, NULL) = 0 futex(0x41d19b28, FUTEX_WAKE_PRIVATE, 1) = 0 mmap(0x7fc2e0230000, 121962496, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc2e0230000 mmap(0x7fbca58e0000, 237568, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fbca58e0000 Any ideas about what's happening and if there's anyway to mitigate it? If the box at least recovered then I could run another slave and load balance between them working on the principle that the second box would pick up the slack whilst the first box restabilised but, as it is, that's not reliable. Thanks, Simon