Re: [S3tools-general] question regarding --no-check-md5

2014-05-19 Thread WagnerOne
Hi Matt,

I posted in another thread regarding what is used for file comparisons during 
sync if a multipart ETag is encountered and believe you answered it here 
already.

If a multipart ETag is encountered during sync, s3cmd then reverts to date 
comparison, but not the S3 stored Last-Modified date. Rather it uses the date 
values from x-amz-meta-s3cmd-attrs.

I have been using --no-preserve, so I don't have the x-amz-meta-s3cmd-attrs 
values for comparison. I felt I didn't need that extra metadata on my objects 
in S3 (in retrospect, I'd have been better off to have it likely). 

What then happens on sync when a multipart ETag is encountered and there is no 
x-amz-meta-s3cmd-attrs for date comparison?

Are we using size only at that point?

Mike


On Apr 13, 2014, at 10:52 PM, Matt Domsch m...@domsch.com wrote:

 The algorithm is in S3/FileLists.py compare_filelists().
 
 Check if one side has a file (by name) the other doesn't.  If so, there's 
 nothing to compare.
 
 Check that both files have the same size (as reported by stat() for local 
 files, and in the remote directory listing)
 
 If checking MD5:
   calculate or get the MD5, compare the two values
 
 Date isn't actually compared, because what's in the XML returned by the 
 object listing from S3 doesn't contain the file date, only the 
 Last-Modified header, which is when the file was uploaded to S3.
 
 Date (really, ctime, mtime, atime) as obtained from local files, when 
 --preserve is used (the default) _is_ stored in the x-amz-meta-s3cmd-attrs 
 metadata value for an object.  But getting this value back from S3 requires 
 doing a HEAD on the object itself, for every object, which is really 
 expensive.  So, we don't do that, unless we are comparing MD5s, and the MD5 
 (really, ETag) value returned in the directory listing indicates the file was 
 uploaded using multipart upload, in which case the ETag  value isn't the MD5 
 value for the whole file, but only for the last chunk of the file committed 
 to disk (not necessarily even the MD5 of the last chunk of the file).  We 
 don't, in general, get that value.
 
 Now, if we are syncing from remote to local, we get x-amz-meta-s3cmd-attrs 
 value for free as a header when we do the GET to get the object, so we do 
 use it to set the values back to what were originally stored there.
 
 So, the manpage is correct, date is not used in the comparison for syncing 
 purposes.  One could argue that the expensive HEAD call is still cheaper than 
 calculating the local MD5 of a file, but we can mitigate the local expense 
 using the --cache-file mechanism such that we only read the local file once 
 and then read its md5 out of the cache until it changes, so  the HEAD isn't 
 cheaper in general.
 
 To detect a change to a file whose size hasn't changed, but its content has, 
 we have to do the HEAD call, and calculate the MD5 of the local file (and use 
 --cache-file to record that for posterity), and compare.
 
 
 
 
 On Sun, Apr 13, 2014 at 5:08 PM, WagnerOne wag...@wagnerone.com wrote:
 Hi,
 
 The man page states the following:
 
 --no-check-md5
Do not check MD5 sums when comparing files for [sync].  Only size will be 
 compared. May significantly speed up transfer but may also miss some changed 
 files.
 
 When this says only size will be compared, I'm taking it to mean only size 
 of size and md5 will be compared?
 
 Date (and name, of course), is still used in addition to size if 
 --no-check-md5 is passed?
 
 Thanks,
 Mike
 
 --
 wag...@wagnerone.com
 I have no complaints, ever, about anything.-Steve McQueen
 
 
 
 --
 Put Bad Developers to Shame
 Dominate Development with Jenkins Continuous Integration
 Continuously Automate Build, Test  Deployment
 Start a new project now. Try Jenkins in the cloud.
 http://p.sf.net/sfu/13600_Cloudbees
 ___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general
 
 
 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and their
 applications. Written by three acclaimed leaders in the field,
 this first edition is now available. Download your free book today!
 http://p.sf.net/sfu/NeoTech___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general

-- 
wag...@wagnerone.com
I want to hear the man in the suit say that it's all wrong. I want to hear a 
man with millions of dollars say that he hurts people.-Rollins


--
Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ 

Re: [S3tools-general] question regarding --no-check-md5

2014-04-23 Thread Matt Domsch
Yes, I believe it is accurate.  it's also fair to ask why we don't follow
suit and compare local mtime with S3 LastModified, and upload if mtime is
newer.

it's been that way since Michal first wrote the initial sync code back in
September 2007.  Doesn't mean it has to stay that way.


On Wed, Apr 23, 2014 at 8:29 AM, WagnerOne wag...@wagnerone.com wrote:

 Thank you, Matt, for inspecting that and for the continued explanation.

 If I have a local and S3 file pair and the local file is modified such
 that size is not modified, but its date is, a sync with aws cli would copy
 that modified local file over the existing s3 counterpart (due to the
 source file having a newer mtime when compared to S3 LastModified), but a
 sync with s3cmd --no-check-md5 (which I unfortunately often have to use)
 would not.

 Is this statement accurate?

 Mike

 On Apr 14, 2014, at 6:51 PM, Matt Domsch m...@domsch.com wrote:

  aws-cli (after a quick perusal of their source code) uses the
 LastModified value (set by S3 to be the time the upload of the object
 occurred) on objects in S3, which is obtainable from the ListBucket XML,
 without doing a HEAD call.  They then go on to calculate the difference
 between LastModified and stat.mtime(), accounting for time zone
 differences.   local files use stat.mtime.
 
  aws-cli then proceeds to compare the two values; if LastModified is
 newer than stat.mtime, syncing local-remote is skipped (the remote is
 newer than local); likewise on download, if LastModified is older than the
 local file, it too is skipped.
 
  Neither tool sets LastModified = stat.mtime on upload (nor can they).
  s3cmd gets around this by setting stat.mtime into the file's metadata when
 --preserve (the default) is used, but then would have to use a HEAD or GET
 call to get it back.  s3cmd does update the local on-disk mtime and atime
 when downloading (GETting), because we get the header back, and that's free
 then.  Likewise, aws-cli sets both mtime and atime = LastModified on
 download.
 
  So aside from aws-cli skipping over newer files in destination, I think
 their behavior is identical.
 
 
 
 
 
 
 --
  Learn Graph Databases - Download FREE O'Reilly Book
  Graph Databases is the definitive new guide to graph databases and
 their
  applications. Written by three acclaimed leaders in the field,
  this first edition is now available. Download your free book today!
 
 http://p.sf.net/sfu/NeoTech___
  S3tools-general mailing list
  S3tools-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/s3tools-general

 --
 wag...@wagnerone.com
 Never fall into the trap of judging that which you don't understand as
 nonsense. That error can destroy you.-Feist




 --
 Start Your Social Network Today - Download eXo Platform
 Build your Enterprise Intranet with eXo Platform Software
 Java Based Open Source Intranet - Social, Extensible, Cloud Ready
 Get Started Now And Turn Your Intranet Into A Collaboration Platform
 http://p.sf.net/sfu/ExoPlatform
 ___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general


--
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform___
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general


Re: [S3tools-general] question regarding --no-check-md5

2014-04-14 Thread WagnerOne
Thanks, Matt.

I looked around for the algorithm in the src, but couldn't discern how things 
worked myself. Thank you for the explanation and reasoning behind how it's done.

The aws s3 sync algorithm states it operate like so

A local file will require uploading if the size  of  the  local file  is  
different  than  the size of the s3 object, the last modified time of the local 
file is newer than the last modified time of  the  s3 object, or the local file 
does not exist under the specified bucket and prefix.

Is my interpretation of this right to think aws s3 cli always does a HEAD for 
every object to get that s3 mod date for local mod date comparison? I realize 
the md5 comparison is always ultimately preferable and aws s3 cli doesn't offer 
that.

It's good to know how these 2 tools (what I've come to think of the 2 main cli 
tools for s3) treat and use that s3 mod date differently.

Mike

On Apr 13, 2014, at 10:52 PM, Matt Domsch m...@domsch.com wrote:

 The algorithm is in S3/FileLists.py compare_filelists().
 
 Check if one side has a file (by name) the other doesn't.  If so, there's 
 nothing to compare.
 
 Check that both files have the same size (as reported by stat() for local 
 files, and in the remote directory listing)
 
 If checking MD5:
   calculate or get the MD5, compare the two values
 
 Date isn't actually compared, because what's in the XML returned by the 
 object listing from S3 doesn't contain the file date, only the 
 Last-Modified header, which is when the file was uploaded to S3.
 
 Date (really, ctime, mtime, atime) as obtained from local files, when 
 --preserve is used (the default) _is_ stored in the x-amz-meta-s3cmd-attrs 
 metadata value for an object.  But getting this value back from S3 requires 
 doing a HEAD on the object itself, for every object, which is really 
 expensive.  So, we don't do that, unless we are comparing MD5s, and the MD5 
 (really, ETag) value returned in the directory listing indicates the file was 
 uploaded using multipart upload, in which case the ETag  value isn't the MD5 
 value for the whole file, but only for the last chunk of the file committed 
 to disk (not necessarily even the MD5 of the last chunk of the file).  We 
 don't, in general, get that value.
 
 Now, if we are syncing from remote to local, we get x-amz-meta-s3cmd-attrs 
 value for free as a header when we do the GET to get the object, so we do 
 use it to set the values back to what were originally stored there.
 
 So, the manpage is correct, date is not used in the comparison for syncing 
 purposes.  One could argue that the expensive HEAD call is still cheaper than 
 calculating the local MD5 of a file, but we can mitigate the local expense 
 using the --cache-file mechanism such that we only read the local file once 
 and then read its md5 out of the cache until it changes, so  the HEAD isn't 
 cheaper in general.
 
 To detect a change to a file whose size hasn't changed, but its content has, 
 we have to do the HEAD call, and calculate the MD5 of the local file (and use 
 --cache-file to record that for posterity), and compare.
 
 
 
 
 On Sun, Apr 13, 2014 at 5:08 PM, WagnerOne wag...@wagnerone.com wrote:
 Hi,
 
 The man page states the following:
 
 --no-check-md5
Do not check MD5 sums when comparing files for [sync].  Only size will be 
 compared. May significantly speed up transfer but may also miss some changed 
 files.
 
 When this says only size will be compared, I'm taking it to mean only size 
 of size and md5 will be compared?
 
 Date (and name, of course), is still used in addition to size if 
 --no-check-md5 is passed?
 
 Thanks,
 Mike
 
 --
 wag...@wagnerone.com
 I have no complaints, ever, about anything.-Steve McQueen
 
 
 
 --
 Put Bad Developers to Shame
 Dominate Development with Jenkins Continuous Integration
 Continuously Automate Build, Test  Deployment
 Start a new project now. Try Jenkins in the cloud.
 http://p.sf.net/sfu/13600_Cloudbees
 ___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general
 
 
 --
 Learn Graph Databases - Download FREE O'Reilly Book
 Graph Databases is the definitive new guide to graph databases and their
 applications. Written by three acclaimed leaders in the field,
 this first edition is now available. Download your free book today!
 http://p.sf.net/sfu/NeoTech___
 S3tools-general mailing list
 S3tools-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/s3tools-general

-- 
wag...@wagnerone.com
Always consider the possibility your assumptions are wrong.-Wheel of Time



--
Learn Graph Databases - Download 

Re: [S3tools-general] question regarding --no-check-md5

2014-04-14 Thread Matt Domsch
aws-cli (after a quick perusal of their source code) uses the LastModified
value (set by S3 to be the time the upload of the object occurred) on
objects in S3, which is obtainable from the ListBucket XML, without doing
a HEAD call.  They then go on to calculate the difference between
LastModified and stat.mtime(), accounting for time zone differences.
local files use stat.mtime.

aws-cli then proceeds to compare the two values; if LastModified is newer
than stat.mtime, syncing local-remote is skipped (the remote is newer than
local); likewise on download, if LastModified is older than the local file,
it too is skipped.

Neither tool sets LastModified = stat.mtime on upload (nor can they).
s3cmd gets around this by setting stat.mtime into the file's metadata when
--preserve (the default) is used, but then would have to use a HEAD or GET
call to get it back.  s3cmd does update the local on-disk mtime and atime
when downloading (GETting), because we get the header back, and that's free
then.  Likewise, aws-cli sets both mtime and atime = LastModified on
download.

So aside from aws-cli skipping over newer files in destination, I think
their behavior is identical.
--
Learn Graph Databases - Download FREE O'Reilly Book
Graph Databases is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech___
S3tools-general mailing list
S3tools-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/s3tools-general