On Mar 2, 2011, at 2:33 PM, Wayne Conrad wrote: > We run a compaction script that compacts every database every night. > Compaction of our biggest (0.6 TB) database took about 10 hours today. > Granted, the hardware has poor I/O bandwidth, but even if we improve the > hardware, a change in strategy could be good. Along with splitting that > database into more manageable pieces, I hope to write a compaction script > that only compacts a database sometimes (a la Postgresql's autovacuum). To > do that, I want some way to estimate whether there's anything to gain from > compacting any given database. > > I thought I could use the doc_del_count returned by GET /<database-name> as a > gauge of whether to compact or not, but in my tests doc_del_count remained > the same after compaction. Are there any statistics, however imperfect, that > could help my code guess when compaction ought to be done? > > Best Regards, > Wayne Conrad
Hi Wayne, I don't think there's a satisfactory solution to this at the moment, which is why I've been working with Bob Dionne to add some more detailed statistics to help inform that kind of decision-making. The idea is to add a new field to the response to GET /dbname (and GET /db/_design/dname/_info) which will report the number of bytes allocated for storage of "user data"; i.e. latest versions of document bodies and attachments in databases, KV pairs and reductions in view indexes. You could then write a script to trigger compaction if the ratio of "data_size" / disk_size drops below a threshold. Bob has a pull request in process for BigCouch; the changes he's making should apply to CouchDB as well with a little tweaking. By the way, you're right that doc_del_count does not change before and after compaction. The document bodies are removed, but a small record is retained for the purposes of replication and in-progress view index updates. Regards, Adam
