S3 emits different ETags for single- and multi-part uploads. You can use both types of ETags for future conditional GET and PUT operations but only single-part upload returns an MD5 hash. Multi-part upload returns an opaque token which is likely a hash of hashes combined with number of parts.
You can ensure data integrity in-transit via comparing the ETag or via providing a Content-MD5 for single-part uploads. Multi-part is more complicated; each upload part call can have a Content-MD5 and each call returns the MD5 hash. jclouds supplies the per-part ETag hashes to the final complete multi-part upload call but does not provide a way to check the results of per-part calls or a way to supply a Content-MD5 for each. Fixing this requires calculating the MD5 in BaseBlobStore.putMultipartBlob. We could either calculate it beforehand for repeatable Payloads or compare afterwards for InputStream payloads. There is some subtlety to this for providers like Azure which do not return an MD5 ETag. We would likely want to guard this with a property since not every caller wants to pay the CPU overhead. Would you like to take a look at this? If you want a purely application fix, look at calling the BlobStore methods initiateMultipartUpload, uploadMultipartPart, and completeMultipartUpload. jclouds internally uses these to implement putBlob(new PutOptions.multipart()). On Tue, Sep 22, 2015 at 05:10:18PM +0200, Veit Guna wrote: > Hi. > > We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used > the returned etag of blobStore.putBlob() to manually verify > against a client provided hash. That worked quite well for us. But since we > are hitting the 5GB limit of S3, we switched to the multipart() upload > that jclouds offers. But now, putBlob() returns someting like > <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course > fails with our validation. > > I guess this is due to the fact, that each chunk is hashed separately and > send to S3. So there is no complete hash over the whole payload that could > be returned by putBlob() - is that correct? > > During my research I stumbled across this: > > https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6 > > Now I'm wondering, what the contract of putBlob() is. Should it only return > valid etag/hashes otherwise return null? > > I'm asking that, because otherwise, I would have to start parsing and > validating the returned value by myself and skip any > validation when it isn't a normal md5 hash. My guess is, that this is the > hash from the last transferred chunk plus > the chunk number? > > Maybe someone can shed some light on this :). > > Thanks > Veit > -- Andrew Gaul http://gaul.org/
