Re: [S3tools-general] question regarding --no-check-md5

2014-04-14 Thread WagnerOne
Thanks, Matt.

I looked around for the algorithm in the src, but couldn't discern how things 
worked myself. Thank you for the explanation and reasoning behind how it's done.

The aws s3 sync algorithm states it operate like so

A local file will require uploading if the size  of  the  local file  is  
different  than  the size of the s3 object, the last modified time of the local 
file is newer than the last modified time of  the  s3 object, or the local file 
does not exist under the specified bucket and prefix.

Is my interpretation of this right to think aws s3 cli always does a HEAD for 
every object to get that s3 mod date for local mod date comparison? I realize 
the md5 comparison is always ultimately preferable and aws s3 cli doesn't offer 
that.

It's good to know how these 2 tools (what I've come to think of the 2 main cli 
tools for s3) treat and use that s3 mod date differently.

Mike

On Apr 13, 2014, at 10:52 PM, Matt Domsch m...@domsch.com wrote:

 The algorithm is in S3/FileLists.py compare_filelists().
 
 Check if one side has a file (by name) the other doesn't.  If so, there's 
 nothing to compare.
 
 Check that both files have the same size (as reported by stat() for local 
 files, and in the remote directory listing)
 
 If checking MD5:
   calculate or get the MD5, compare the two values
 
 Date isn't actually compared, because what's in the XML returned by the 
 object listing from S3 doesn't contain the file date, only the 
 Last-Modified header, which is when the file was uploaded to S3.
 
 Date (really, ctime, mtime, atime) as obtained from local files, when 
 --preserve is used (the default) _is_ stored in the x-amz-meta-s3cmd-attrs 
 metadata value for an object.  But getting this value back from S3 requires 
 doing a HEAD on the object itself, for every object, which is really 
 expensive.  So, we don't do that, unless we are comparing MD5s, and the MD5 
 (really, ETag) value returned in the directory listing indicates the file was 
 uploaded using multipart upload, in which case the ETag  value isn't the MD5 
 value for the whole file, but only for the last chunk of the file committed 
 to disk (not necessarily even the MD5 of the last chunk of the file).  We 
 don't, in general, get that value.
 
 Now, if we are syncing from remote to local, we get x-amz-meta-s3cmd-attrs 
 value for free as a header when we do the GET to get the object, so we do 
 use it to set the values back to what were originally stored there.
 
 So, the manpage is correct, date is not used in the comparison for syncing 
 purposes.  One could argue that the expensive HEAD call is still cheaper than 
 calculating the local MD5 of a file, but we can mitigate the local expense 
 using the --cache-file mechanism such that we only read the local file once 
 and then read its md5 out of the cache until it changes, so  the HEAD isn't 
 cheaper in general.
 
 To detect a change to a file whose size hasn't changed, but its content has, 
 we have to do the HEAD call, and calculate the MD5 of the local file (and use 
 --cache-file to record that for posterity), and compare.
 
 
 
 
 On Sun, Apr 13, 2014 at 5:08 PM, WagnerOne wag...@wagnerone.com wrote:
 Hi,
 
 The man page states the following:
 
 --no-check-md5
Do not check MD5 sums when comparing files for [sync].  Only size will be 
 compared. May significantly speed up transfer but may also miss some changed 
 files.
 
 When this says only size will be compared, I'm taking it to mean only size 
 of size and md5 will be compared?
 
 Date (and name, of course), is still used in addition to size if 
 --no-check-md5 is passed?
 
 Thanks,
 Mike
 
 --
 wag...@wagnerone.com
 I have no complaints, ever, about anything.-Steve McQueen
 
 
 
 --
 Put Bad Developers to Shame
 Dominate Development with Jenkins Continuous Integration
 Continuously Automate Build, Test  Deployment
 Start a new project now. Try Jenkins in the cloud.
 http://p.sf.net/sfu/13600_Cloudbees
 ___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general
 
 
 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and their
 applications. Written by three acclaimed leaders in the field,
 this first edition is now available. Download your free book today!
 http://p.sf.net/sfu/NeoTech___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general

-- 
wag...@wagnerone.com
Always consider the possibility your assumptions are wrong.-Wheel of Time



--
Learn Graph Databases - Download 

[S3tools-general] aws and s3cmd - huge object count syncs, mod dates

2014-04-14 Thread WagnerOne
I've struggled with some huge object count transfers and went back and forth 
between aws s3 cli and s3cmd.

aws s3 cli seems to edge s3cmd out on speed and RAM consumption when doing huge 
object count transfers. However, aws s3 cli seems to choke on large object 
counts and s3cmd offers so much more in terms of feedback and options I hate to 
ever resort to aws s3 cli.

I encountered a bug with aws s3 cli in that it ceased to sync some dirs fully 
for reasons yet unknown to me. From aws dev forums posts, it appears others 
have the same problem with aws s3 sync crapping out. It has something to do 
with high object counts and/or series of subdirs with similar names. I'm going 
to revisit that when I have time.

Matt having modified the s3cmd --no-check-md5 command to not generate md5 on 
source files (and thus provide a speed boost) helped me be able to switch back. 
I increased the RAM in my s3cmd migration hosts and went back to s3cmd.

Based on my misperception that s3cmd wrote the source object's mod date to the 
s3 mod date, to get s3cmd to mimic aws s3 cli (and Amazon Import), I turned on 
s3cmd --no-preserve. 

Neither aws s3 cli nor Import maintain source object creation date on the s3 
side, where by default I thought s3cmd did (I was wrong). Further tests 
indicate it doesn't either. Since none of these tools can do this, it must be 
an Amazon side issue preventing it?

It is perplexing to me why that option is unavailable as it seems like common 
practice when using rsync to do data migrations. I know this is not apple to 
apples, but my comparison point (as I expect is not uncommon) is years of using 
rsync. 

At any rate, my s3cmd runs (even with 16 GB of RAM in my hosts) are 
occasionally terminated by the kernel for RAM over-consumption. I have to not 
be greedy on source dir selections. :) I expect my use is an edge case. All 
this said, s3cmd does a great job.

Thank you to those of you who develop and maintain it.

Mike

-- 
wag...@wagnerone.com
The advantage of a bad memory is that one enjoys several times the same good 
things for the first time.-Nietzsche



--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
___
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general


Re: [S3tools-general] question regarding --no-check-md5

2014-04-14 Thread Matt Domsch
aws-cli (after a quick perusal of their source code) uses the LastModified
value (set by S3 to be the time the upload of the object occurred) on
objects in S3, which is obtainable from the ListBucket XML, without doing
a HEAD call.  They then go on to calculate the difference between
LastModified and stat.mtime(), accounting for time zone differences.
local files use stat.mtime.

aws-cli then proceeds to compare the two values; if LastModified is newer
than stat.mtime, syncing local-remote is skipped (the remote is newer than
local); likewise on download, if LastModified is older than the local file,
it too is skipped.

Neither tool sets LastModified = stat.mtime on upload (nor can they).
s3cmd gets around this by setting stat.mtime into the file's metadata when
--preserve (the default) is used, but then would have to use a HEAD or GET
call to get it back.  s3cmd does update the local on-disk mtime and atime
when downloading (GETting), because we get the header back, and that's free
then.  Likewise, aws-cli sets both mtime and atime = LastModified on
download.

So aside from aws-cli skipping over newer files in destination, I think
their behavior is identical.
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech___
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general