[openstack-dev] Feedback about Swift API - Especially about Large Objects

2015-10-09 Thread Pierre SOUCHAY
Hi Swift Developpers,

We have been using Swift as a IAAS provider for more than two years now, but 
this mail is about feedback on the API side. I think it would be great to 
include some of the ideas in future revisions of API.

I’ve been developping a few Swift clients in HTML (in Cloudwatt Dashboard) with 
CORS, Java with Swing GUI (https://github.com/pierresouchay/swiftbrowser 
) and Go for Swift to filesystem 
(https://github.com/pierresouchay/swiftsync/ 
), so I have now a few ideas about 
how improving a bit the API.

The API is quite straightforward and intuitive to use, and writing a client is 
now that difficult, but unfortunately, the Large Object support is not easy at 
all to deal with.

The biggest issue is that there is now way to know whenever a file is a large 
object when performing listings using JSON format, since, AFAIK a large object 
is an object with 0 bytes (so its size in bytes is 0), but it also has a hash 
of a zero file bytes.

For instance, a signature of such object is :
 {"hash": "d41d8cd98f00b204e9800998ecf8427e", "last_modified": 
"2015-06-04T10:23:57.618760", "bytes": 0, "name": "5G", "content_type": 
"octet/stream"}

which is, exactly the hash of a 0 bytes file :
$ echo -n | md5
d41d8cd98f00b204e9800998ecf8427e

Ok, now lets try HEAD :
$ curl -vv -XHEAD -H X-Auth-Token:$TOKEN 
'https://storage.fr1.cloudwatt.com/v1/AUTH_61b8fe6dfd0a4ce69f6622ea7e0f/large_files/5G
…
< HTTP/1.1 200 OK
< Date: Fri, 09 Oct 2015 19:43:09 GMT
< Content-Length: 50
< Accept-Ranges: bytes
< X-Object-Manifest: large_files/5G/.part-50-
< Last-Modified: Thu, 04 Jun 2015 10:16:33 GMT
< Etag: "479517ec4767ca08ed0547dca003d116"
< X-Timestamp: 1433413437.61876
< Content-Type: octet/stream
< X-Trans-Id: txba36522b0b7743d683a5d-00561818cd

WTF ? While all files have the same value for ETag and hash, this is not the 
case for Large files…

Furthermore, the ETag is not the md5 of the whole file, but the hash of the 
hash of all manifest files (as described somewhere hidden deeply in the 
documentation)

Why this is a problem ?
---

Imagine a « naive »  client using the API which performs some kind of Sync.

The client download each file and when it syncs, compares the local md5 to the 
md5 of the listing… of course, the hash is the hash of a zero bytes files… so 
it downloads the file again… and again… and again. Unfortunaly for our naive 
client, this is exactly the kind of files we don’t want to download twice… 
since the file is probably huge (after all, it has been split for a reason no ?)

I think this is really a design flaw since you need to know everything about 
Swift API and extensions to have a proper behavior. The minimum would be to at 
least return the same value as the ETag header.

OK, let’s continue…

We are not so Naive… our Swift Sync client know that 0 files needs more work.

* First issue: we have to know whenever the file is a « real » 0 bytes file or 
not. You may think most people do not create 0 bytes files after all… this is 
dummy. Actually, some I have seen two Object Storage middleware using many 0 
bytes files (for instance to store meta data or two set up some kind of 
directory like structure). So, in this cas, we need to perform a HEAD request 
to each 0 bytes files. If you have 1000 files like this, you have to perform 
1000 HEAD requests to finally know that there are not any Large file. Not very 
efficient. Your Swift Sync client took 1 second to sync 20G of data with naive 
approach, now, you need 5 minutes… hash of 0 bytes is not a good idea at all.

* Second issue: since the hash is the hash of all parts (I have an idea about 
why this decision was made, probably for performance reasons), your client 
cannot work on files since the hash of local file is not the hash of the Swift 
aggregated file (which is the hash of all the hash of manifest). So, it means 
you cannot work on existing data, you have to either :
 - split all the files in the same way as the manifest, compute the MD5 of each 
part, than compute the MD5 of the hashes and compare to the MD5 on server… (ok… 
doable, but I gave up with such system)
 - have a local database in your client (when you download, store the REAL Hash 
of file and store that in fact you have to compare it the the HASH returned by 
server)
 - perform some kind of crappy heuristics (size + grab the starting bytes of 
each data of each part or something like that…)

* Third issue:
 - If you don’t want to store the parts of your object file, you have to wait 
for all your HEAD requests to finish since it is the only way to guess all the 
files that are referenced in your manifest headers.

So summarize, I think the current API really need some refinements about the 
listings since a competent developper may trust the bytes value and the hash 
value and create an algorithm that does not behave 

Re: [openstack-dev] Feedback about Swift API - Especially about Large Objects

2015-10-09 Thread Clay Gerrard
A lot of these deficiencies are drastically improved with static large
objects - and non-trivial to address (impossible?) with DLO's because of
their dynamic nature.  It's unfortunate, but DLO's don't really serve your
use-case very well - and you should find a way to transition to SLO's [1].

We talked about improving the checksumming behavior in SLO's for the
general naive sync case back at the hack-a-thon before the Vancouver summit
- but it's tricky (MD5 => CRC) - and would probably require a API version
bump.

All we've been able to get done so far is improve the native client
handling [2] - but if using SLO's you may find a similar solution quite
manageable.

Thanks for the feedback.

-Clay

1.
http://docs-draft.openstack.org/91/219991/7/check/gate-swift-docs/75fb84c//doc/build/html/overview_large_objects.html#module-swift.common.middleware.slo
2.
https://github.com/openstack/python-swiftclient/commit/ff0b3b02f07de341fa9eb81156ac2a0565d85cd4

On Friday, October 9, 2015, Pierre SOUCHAY 
wrote:

> Hi Swift Developpers,
>
> We have been using Swift as a IAAS provider for more than two years now,
> but this mail is about feedback on the API side. I think it would be great
> to include some of the ideas in future revisions of API.
>
> I’ve been developping a few Swift clients in HTML (in Cloudwatt Dashboard)
> with CORS, Java with Swing GUI (
> https://github.com/pierresouchay/swiftbrowser) and Go for Swift to
> filesystem (https://github.com/pierresouchay/swiftsync/), so I have now a
> few ideas about how improving a bit the API.
>
> The API is quite straightforward and intuitive to use, and writing a
> client is now that difficult, but unfortunately, the Large Object support
> is not easy at all to deal with.
>
> The biggest issue is that there is now way to know whenever a file is a
> large object when performing listings using JSON format, since, AFAIK a
> large object is an object with 0 bytes (so its size in bytes is 0), but it
> also has a hash of a zero file bytes.
>
> For instance, a signature of such object is :
>  {"hash": "d41d8cd98f00b204e9800998ecf8427e", "last_modified":
> "2015-06-04T10:23:57.618760", "bytes": 0, "name": "5G", "content_type": "
> octet/stream"}
>
> which is, exactly the hash of a 0 bytes file :
> $ echo -n | md5
> d41d8cd98f00b204e9800998ecf8427e
>
> Ok, now lets try HEAD :
> $ curl -vv -XHEAD -H X-Auth-Token:$TOKEN '
> https://storage.fr1.cloudwatt.com/v1/AUTH_61b8fe6dfd0a4ce69f6622ea7e0f/large_files/5G
> …
> < HTTP/1.1 200 OK
> < Date: Fri, 09 Oct 2015 19:43:09 GMT
> < Content-Length: 50
> < Accept-Ranges: bytes
> < X-Object-Manifest: large_files/5G/.part-50-
> < Last-Modified: Thu, 04 Jun 2015 10:16:33 GMT
> < Etag: "479517ec4767ca08ed0547dca003d116"
> < X-Timestamp: 1433413437.61876
> < Content-Type: octet/stream
> < X-Trans-Id: txba36522b0b7743d683a5d-00561818cd
>
> WTF ? While all files have the same value for ETag and hash, this is not
> the case for Large files…
>
> Furthermore, the ETag is not the md5 of the whole file, but the hash of
> the hash of all manifest files (as described somewhere hidden deeply in the
> documentation)
>
> Why this is a problem ?
> ---
>
> Imagine a « naive »  client using the API which performs some kind of Sync.
>
> The client download each file and when it syncs, compares the local md5 to
> the md5 of the listing… of course, the hash is the hash of a zero bytes
> files… so it downloads the file again… and again… and again. Unfortunaly
> for our naive client, this is exactly the kind of files we don’t want to
> download twice… since the file is probably huge (after all, it has been
> split for a reason no ?)
>
> I think this is really a design flaw since you need to know everything
> about Swift API and extensions to have a proper behavior. The minimum would
> be to at least return the same value as the ETag header.
>
> OK, let’s continue…
>
> We are not so Naive… our Swift Sync client know that 0 files needs more
> work.
>
> * First issue: we have to know whenever the file is a « real » 0 bytes
> file or not. You may think most people do not create 0 bytes files after
> all… this is dummy. Actually, some I have seen two Object Storage
> middleware using many 0 bytes files (for instance to store meta data or two
> set up some kind of directory like structure). So, in this cas, we need to
> perform a HEAD request to each 0 bytes files. If you have 1000 files like
> this, you have to perform 1000 HEAD requests to finally know that there are
> not any Large file. Not very efficient. Your Swift Sync client took 1
> second to sync 20G of data with naive approach, now, you need 5 minutes…
> hash of 0 bytes is not a good idea at all.
>
> * Second issue: since the hash is the hash of all parts (I have an idea
> about why this decision was made, probably for performance reasons), your
> client cannot work on files since the hash of local file is not the hash of
> the