It's a known issue that compaction maybe cannot complete under heavy
write load. At some point maybe we should implement a mechanism to
throttle writes if the compaction isn't making enough progress during
updates.
-Damien
On Dec 22, 2008, at 7:32 PM, Adam Kocoloski wrote:
Hi, I ran into an odd failure mode last week and I thought I'd ask
around here to see if anyone has seen something similar. I have a
CouchDB server (recent trunk) on a large EC2 instance with a DB that
sees a constant update rate of ~50 Hz. I triggered a compaction
when the DB had reached ~27M update sequences (80 GB in total). The
first pass finished after 7h40m, but of course another 1.4M updates
had been written to the original DB. So far, so good.
Unfortunately, the subsequent iterations of copy_compact() ran much
slower than that original pass. After a few passes, the compactor
rate was equal to the new write rate, so it effectively entered a
runaway mode. The stats looked like
Pass 1: 7h40m 27870955 docs 1010 Hz
Pass 2: 3h44m 1473387 docs 110 Hz
Pass 3: 2h58m 617008 docs 58 Hz
Pass 4: 2h44m 450607 docs 46 Hz
.....
Pass 23: 4h08m 719541 docs 48 Hz
Pass 24: 1h04m 436105 docs 113 Hz
Pass 25: 21 seconds -- done.
We stopped the new write load sometime after the end of Pass 23, and
the compaction finished soon after that.
We turned the write load back on and have been compacting the DB
once/day ever since. We haven't seen this runaway mode again. I've
reviewed the compaction code a couple of times, but I can't figure
out what would cause such a dramatic slowdown. Our system
monitoring wasn't able to turn up any red flags, either -- in
particular, all the latency/throughput/IOPS stats for the disk
hosting the database were pretty much constant throughout the
lifetime of the compaction.
Best, Adam