I understand your confusion about documentation, but I can say, this behavior is documented. If you read carefully the official documentation, you will see the example with the bank in which if one transaction is not successfully, the transaction is not erased, but updated. This is, in this case, if one deletes a document, it's just an update of the old document (i.e., giving a new revision and marking it as deleted). So, in other words, "nothing is lost, everything is transformed." There is no point to be concerned about security here because if you use HTTP predicate GET on the document, CouchDB will return a JSON of a form {"error":"not_found","reason":"deleted"} (compared with a non-existing document which is reported with the appropriate reason, "document does not exist" or so - I don't remember now the message, but I know it's quite meaningful). Even in the case of somebody breaking into your server and obtaining the file or admin password, after compaction, all the previous versions of the document are no longer there, so, no data can be extracted from there.

Nevertheless, as it was said here, there are few distinct cases:
1. Using DELETE predicate from HTTP. That will ensure the minimum data are written on the harddisk. 2. Using "_deleted":true in combination with HTTP PUT/POST. If no other data are added to the document while sending the request, it has the same effect as the first point. 3. Emptying the document. This will reduce the document size even more, but it will not allow you to reuse the document unless you provide the correct revision of the document (in the other options, no revision is required).

My point in enumerating these option is related to their usage. If you can afford one HTTP request at the time, then using DELETE is probably the best option. But, in many cases, that is a luxury you cannot afford because of the harddisk writing speed limitation. In most of the cases, you would like to use bulk operations. That means buffering your data. At this time, option 1 is no longer available.

As you can see, each of the options has its own advantages, but also disadvantages/limitations. But that is another story already.

This choice of such a behavior has two major pros:
1. History. If you delete a document which you need it later on, the undo action can be done easily by reverting the document revision to the previous one (providing that no compaction was triggered in between the two actions). 2. Harddisk write speed optimization. If you delete a document and you want to reuse the name later, in the case of the pointer toward the document being simply deleted, then you need mandatory to trigger a compaction to avoid document name conflict. And that is a much slower process than just updating a document.

The only way to delete completely a document is to re-create the physical file containing the database. But if this is more annoying than few extra-bytes per document, then leave the "tombstone" there. If both of the previously mentioned options are not convenient for your project, then CouchDB may not be what you need (I am not discouraging people to use CouchDB, but only stating the fact that there is no gain without pain, and using CouchDB is quite a gain in my opinion). Nevertheless, to be kept in mind that there is a way to reclaim the physical space kept by the deleted documents.

And two more things I would like to clarify from my previous messages:
1. "making the document unavailable" meant the HTTP GET will return "error" in the case of trying to access a deleted document; 2. when I was speaking about my design for the given case, I stated that there are limitations in the specified design (e.g., race condition and how often you can trigger such a switch), so, one can invent another design based on the information (as I said before) that deleting a document completely can be done only by re-creating the database filtering out the deleted documents (e.g., no "crazy storage blowouts" if you use a round-robin on all your databases, just temporary inconvenience of adding some extra-space to your server system - PC, cluster... - while you perform the space reclaiming procedure).

CGS






On 12/25/2011 01:10 AM, Daniel Bryan wrote:
I understand if this is necessary for eventual consistency, but shouldn't
this be better-documented? I generally expected that if I delete sensitive
or unwanted data, or that a user requests that their personal or private
data be deleted, it'll be deleted in a way that's more solid than basically
hiding it. Sure, CouchDB won't let you get at that document, but it's
certainly still there on the disk, and presumably detectable if you
inspected the data structure that holds individual documents. Not a very
good situation vis a vis security. I know that normal unix "deletion"
leaves files technically on disk, but there are ways to allow for that and
prevent it from being an issue.

Even setting data security aside, I've been using CouchDB as a kind of
staging environment for large amounts of data which should ultimately be
elsewhere (different flavours relational databases, databases belonging to
different organisations, etc.) because it's really easy to implement as an
interface and let people just throw whatever they want into it with a POST.
It's really the perfect tool for that, but pretty soon there'll be tens of
gigabytes a day of data flowing through the system, and most of it just
needs to be indexed for a while before our scheduled scripts pull it all
out, shove it elsewhere and delete it. In this use case, if I'm
understanding this correctly, we'll get crazy storage blowouts unless we
implement a bunch of hacks to switch to new databases after performing
deletions (as well as scripts that make our HTTP reverse proxy
transparently and intelligently route data to the new database - absolutely
not a trivial task in any complex system with many moving parts).

But you know, this all comes with the territory. If the devs say there's a
good reason for documents to stick around after deletion, I believe them,
but I think that's a pretty huge point and I don't know how I've missed it.

What's the way to delete a document if I actually want to really delete the
data? Changing it to a blank document before deleting, and then compacting?

On Sat, Dec 24, 2011 at 2:37 PM, Jens Alfke<[email protected]>  wrote:

On Dec 23, 2011, at 4:09 PM, Mark Hahn wrote:

1) How exactly could you make this switch without interrupting service?
Replicate database to new db, then atomically switch your proxy or
whatever to the new db from the old one.
Depending on how long the replication takes, there’s a race condition here
where changes made to the old db during the replication won’t be propagated
to the new one; you could either repeat the process incrementally until
this doesn’t happen, or else put the db into read-only mode while you’re
doing the copy.

This might also be helpful: http://tinyurl.com/89lr3fl

2) Wouldn't this procedure create the exact same eventual consistency
problems that deleting documents in a db would?
No; what’s necessary is the revision tree, and the replication will
preserve that. You’re just losing the contents of the deleted revisions
that accidentally got left behind because of the weird way the documents
were deleted.

—Jens



Reply via email to