Re: Database size seems off even after compaction runs.

CGS Tue, 27 Dec 2011 01:55:59 -0800

I understand your confusion about documentation, but I can say, thisbehavior is documented. If you read carefully the officialdocumentation, you will see the example with the bank in which if onetransaction is not successfully, the transaction is not erased, butupdated. This is, in this case, if one deletes a document, it's just anupdate of the old document (i.e., giving a new revision and marking itas deleted). So, in other words, "nothing is lost, everything istransformed." There is no point to be concerned about security herebecause if you use HTTP predicate GET on the document, CouchDB willreturn a JSON of a form {"error":"not_found","reason":"deleted"}(compared with a non-existing document which is reported with theappropriate reason, "document does not exist" or so - I don't remembernow the message, but I know it's quite meaningful). Even in the case ofsomebody breaking into your server and obtaining the file or adminpassword, after compaction, all the previous versions of the documentare no longer there, so, no data can be extracted from there.


Nevertheless, as it was said here, there are few distinct cases:

1. Using DELETE predicate from HTTP. That will ensure the minimum dataare written on the harddisk.2. Using "_deleted":true in combination with HTTP PUT/POST. If no otherdata are added to the document while sending the request, it has thesame effect as the first point.3. Emptying the document. This will reduce the document size even more,but it will not allow you to reuse the document unless you provide thecorrect revision of the document (in the other options, no revision isrequired).

My point in enumerating these option is related to their usage. If youcan afford one HTTP request at the time, then using DELETE is probablythe best option. But, in many cases, that is a luxury you cannot affordbecause of the harddisk writing speed limitation. In most of the cases,you would like to use bulk operations. That means buffering your data.At this time, option 1 is no longer available.

As you can see, each of the options has its own advantages, but alsodisadvantages/limitations. But that is another story already.


This choice of such a behavior has two major pros:

1. History. If you delete a document which you need it later on, theundo action can be done easily by reverting the document revision to theprevious one (providing that no compaction was triggered in between thetwo actions).2. Harddisk write speed optimization. If you delete a document and youwant to reuse the name later, in the case of the pointer toward thedocument being simply deleted, then you need mandatory to trigger acompaction to avoid document name conflict. And that is a much slowerprocess than just updating a document.

The only way to delete completely a document is to re-create thephysical file containing the database. But if this is more annoying thanfew extra-bytes per document, then leave the "tombstone" there. If bothof the previously mentioned options are not convenient for your project,then CouchDB may not be what you need (I am not discouraging people touse CouchDB, but only stating the fact that there is no gain withoutpain, and using CouchDB is quite a gain in my opinion). Nevertheless, tobe kept in mind that there is a way to reclaim the physical space keptby the deleted documents.


And two more things I would like to clarify from my previous messages:

1. "making the document unavailable" meant the HTTP GET will return"error" in the case of trying to access a deleted document;2. when I was speaking about my design for the given case, I stated thatthere are limitations in the specified design (e.g., race condition andhow often you can trigger such a switch), so, one can invent anotherdesign based on the information (as I said before) that deleting adocument completely can be done only by re-creating the databasefiltering out the deleted documents (e.g., no "crazy storage blowouts"if you use a round-robin on all your databases, just temporaryinconvenience of adding some extra-space to your server system - PC,cluster... - while you perform the space reclaiming procedure).


CGS






On 12/25/2011 01:10 AM, Daniel Bryan wrote:

I understand if this is necessary for eventual consistency, but shouldn't
this be better-documented? I generally expected that if I delete sensitive
or unwanted data, or that a user requests that their personal or private
data be deleted, it'll be deleted in a way that's more solid than basically
hiding it. Sure, CouchDB won't let you get at that document, but it's
certainly still there on the disk, and presumably detectable if you
inspected the data structure that holds individual documents. Not a very
good situation vis a vis security. I know that normal unix "deletion"
leaves files technically on disk, but there are ways to allow for that and
prevent it from being an issue.

Even setting data security aside, I've been using CouchDB as a kind of
staging environment for large amounts of data which should ultimately be
elsewhere (different flavours relational databases, databases belonging to
different organisations, etc.) because it's really easy to implement as an
interface and let people just throw whatever they want into it with a POST.
It's really the perfect tool for that, but pretty soon there'll be tens of
gigabytes a day of data flowing through the system, and most of it just
needs to be indexed for a while before our scheduled scripts pull it all
out, shove it elsewhere and delete it. In this use case, if I'm
understanding this correctly, we'll get crazy storage blowouts unless we
implement a bunch of hacks to switch to new databases after performing
deletions (as well as scripts that make our HTTP reverse proxy
transparently and intelligently route data to the new database - absolutely
not a trivial task in any complex system with many moving parts).

But you know, this all comes with the territory. If the devs say there's a
good reason for documents to stick around after deletion, I believe them,
but I think that's a pretty huge point and I don't know how I've missed it.

What's the way to delete a document if I actually want to really delete the
data? Changing it to a blank document before deleting, and then compacting?

On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke<[email protected]>  wrote:

On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:

1) How exactly could you make this switch without interrupting service?

Replicate database to new db, then atomically switch your proxy or
whatever to the new db from the old one.
Depending on how long the replication takes, there’s a race condition here
where changes made to the old db during the replication won’t be propagated
to the new one; you could either repeat the process incrementally until
this doesn’t happen, or else put the db into read-only mode while you’re
doing the copy.

This might also be helpful: http://tinyurl.com/89lr3fl

2) Wouldn't this procedure create the exact same eventual consistency
problems that deleting documents in a db would?

No; what’s necessary is the revision tree, and the replication will
preserve that. You’re just losing the contents of the deleted revisions
that accidentally got left behind because of the weird way the documents
were deleted.

—Jens

Re: Database size seems off even after compaction runs.

Reply via email to