Examining the Proxy and Object Server code I believe there is a problem when two Proxies attempt to update the same object at the same exact time (i.e. the two proxies have identical timestamps for the transaction).
For PUT, POST and DELETE the Object Server will rename the temporary file to <timestamp>.[data|ts] even if <timestamp> already exists. However in no case will the container update be done unless orig_timestamp is missing or less than the new timestamp. So a concurrent PUT and DELETE will: * Result in *both* the .data and .ts file being created, and *neither* being deleted as "old". Since ".ts" is "later" in a sort than ".data" the delete will be effective for subsequent gets. * The container will be updated by each Object Server *once*, but different Object Servers may receive the concurrent transactions in varying orders. The Container Server will end up with the Object as perceived by the first transaction on the last Object Server (essentially arbitrary). Even if all three Object Servers perform the two transactions in the same order, the result can be an object that is effectively deleted on the Object Servers but still listed in the Container. With two concurrent PUTs, the latter PUT renames the tempfile to <timestamp>.data, but Only he original transaction updates the Container. This can result in different versions of Objects on different servers, and almost certainly will result in the etag held by the Object Server not being in sync with the etag held for the Object by the Container Server. Changing the test "orig_timestamp <" to "orig_tiimestamp <=" does not really solve the Problem, it will just make it harder to catch. This is because while any one Object Server will now be consistent in terms of its interactions with the Container Server you could still have two Proxy Servers submit updates to Object X with Timestamp Y and have *both* succeed, and have some of the Object Servers have the object as put from Proxy A while others will have it as put by Proxy B. I believe the intent was for the Auditor to catch this by comparing the etag for each Object in the Container DB with the actual etag. But I can find no code that references the etag in the Container DB. The Object Auditor compares the calculated MD5 versus the etag stored as metadata for that file. If the auditor were to cross-validate the etag AND the check in the object servers was changed from "orig_timestamp <" to "oriig_timestamp <=" then the result would be eventual consistency. There would be a period of reduced resiliency before the incorrectly updated Object Servers were repaired by the Auditor, but given the extremely low frequency of identical timestamps this would probably be acceptable. However such a solution would still leave a problem. At most one file can be <timestamp>.data. If the "retain old versions" option is enabled then the older of the two timestamp X versions cannot be retained. If Swift were used as a Document retention system this would be very undesirable. If a put is successful the version put should be retained even if it is not the most recent version. Supporting full versioning for clients would be a relatively easy enhancement for Swift, but not If at most one revision at time X can be retained. A better solution would be to ensure that there can only be one <timestamp>.<extension> file Created, which involves recognizing that the timestamp is being used as an increasing version number (albeit not an monotonically increasing version number). Without relying on a single server for a given object the best way to create a unique version number would be to extend <timestamp> with something that would be different for two Proxy servers putting two different versions of the Object, albeit at the same instant: * A proxy ID. This would preferably be a short configured ID, but any IP address assigned to the Proxy server could serve as a unique extension. * The md5 hash of the payload, or potentially n bits of it. My reading of the current code is that either extension would co-exist with the current unextended timestamps, they just sort later than the current timestamp with the identical base timestamp. No change would be required to the GET/HEAD logic, only to PUT/POST/DELETE. _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp