Hi Andrew. Thanks for the detailed explanation.
I think an option sounds like the way to go. Although I've never checked how expensive the hash calculation is. Maybe I perform some benchmarks for that. Anyway, if jclouds would calculate the complete-payload-md5 by itself where necessary, the contract could be kept - also when using multipart. Besides checking the returned hashes from the providers. Maybe I find some times looking into this by myself. Cheers Veit Am 29.09.2015 um 21:01 schrieb Andrew Gaul: > S3 emits different ETags for single- and multi-part uploads. You can > use both types of ETags for future conditional GET and PUT operations > but only single-part upload returns an MD5 hash. Multi-part upload > returns an opaque token which is likely a hash of hashes combined with > number of parts. > > You can ensure data integrity in-transit via comparing the ETag or via > providing a Content-MD5 for single-part uploads. Multi-part is more > complicated; each upload part call can have a Content-MD5 and each call > returns the MD5 hash. jclouds supplies the per-part ETag hashes to the > final complete multi-part upload call but does not provide a way to > check the results of per-part calls or a way to supply a Content-MD5 for > each. > > Fixing this requires calculating the MD5 in > BaseBlobStore.putMultipartBlob. We could either calculate it beforehand > for repeatable Payloads or compare afterwards for InputStream payloads. > There is some subtlety to this for providers like Azure which do not > return an MD5 ETag. We would likely want to guard this with a property > since not every caller wants to pay the CPU overhead. Would you like to > take a look at this? > > If you want a purely application fix, look at calling the BlobStore > methods initiateMultipartUpload, uploadMultipartPart, and > completeMultipartUpload. jclouds internally uses these to implement > putBlob(new PutOptions.multipart()). > > On Tue, Sep 22, 2015 at 05:10:18PM +0200, Veit Guna wrote: >> Hi. >> >> We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used >> the returned etag of blobStore.putBlob() to manually verify >> against a client provided hash. That worked quite well for us. But since we >> are hitting the 5GB limit of S3, we switched to the multipart() upload >> that jclouds offers. But now, putBlob() returns someting like >> <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course >> fails with our validation. >> >> I guess this is due to the fact, that each chunk is hashed separately and >> send to S3. So there is no complete hash over the whole payload that could >> be returned by putBlob() - is that correct? >> >> During my research I stumbled across this: >> >> https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6 >> >> Now I'm wondering, what the contract of putBlob() is. Should it only return >> valid etag/hashes otherwise return null? >> >> I'm asking that, because otherwise, I would have to start parsing and >> validating the returned value by myself and skip any >> validation when it isn't a normal md5 hash. My guess is, that this is the >> hash from the last transferred chunk plus >> the chunk number? >> >> Maybe someone can shed some light on this :). >> >> Thanks >> Veit >>
