When I first set up my mirror, I was told "Because the items are release
artifacts, they are never altered or removed, just added.", so I
configured my caching around this design. This does not seem to be the
case in general, it wasn't just a one-off problem on 1827131.
Looking at the original 1827131 issue, I was suspicious: rsync ran over
200 times between when I first got the file, and when rsync later
updated the file, so it seemed unlikely that an incomplete file somehow
slipped through. rsync doesn't normally allow this, but even if it did,
it should have been fixed within 5 minutes, not 18 hours later.
From logs, I can tell that 1827131 and 1827165 were each written, then
~18 hours later, updated. With 1827165 I was able to capture the
original files including the .sha1, which rules out any sort of
incomplete copy as the .sha1 wouldn't match if there was any sort of
error or truncation during the copy process.
tl;dr:
My concern is this: Is it expected behaviour that the files are first
published, then later an updated version of the same update number is
published?
And second: Given that the scores are modified between these two
versions, what is the impact on SpamAssassin users who obtain the first
vs the second? There are several score differences, some significant.
On 2018-03-20 11:27, Kevin A. McGrail wrote:
Not sure what you mean. Original ticket was about 1827131.tar.gz
--
Kevin A. McGrail
Asst. Treasurer & VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171
On Tue, Mar 20, 2018 at 1:10 PM, Dave Warren <d...@thedave.ca> wrote:
Interestingly the file really is changing and wasn't just a poorly timed
copy, check this out:
Date: Mon, 19 Mar 2018 02:38:08 -0600 (MDT), the files were created:
.d..t...... ./
f+++++++++ 1827165.tar.gz
f+++++++++ 1827165.tar.gz.asc
f+++++++++ 1827165.tar.gz.sha1
And the .sha1 hash validates (which obviously wouldn't happen if I had an
incomplete copy).
# sha1sum 1827165.tar.gz;cat 1827165.tar.gz.sha1
a3abb2aad004a3401acfad9167e77b0ca31ef9c4 1827165.tar.gz
a3abb2aad004a3401acfad9167e77b0ca31ef9c4 /usr/local/spamassassin/automc
/tmp/stage/3.4.2/update.tgz
Date: Mon, 19 Mar 2018 20:48:17 -0600 (MDT), the files were updated:
f.st...... 1827165.tar.gz
f..t...... 1827165.tar.gz.asc
f.st...... 1827165.tar.gz.sha1
And once again, the .sha1 hash validates the new file:
# sha1sum 1827165.tar.gz;cat 1827165.tar.gz.sha1
ea74b1eb682bbb25c2028ffe01a8e20bd1943885 1827165.tar.gz
ea74b1eb682bbb25c2028ffe01a8e20bd1943885 /usr/local/spamassassin/automc
/tmp/mkupdate-with-scores/1827165.tar.gz
I don't know if any of this is actually a problem, but it's not what I
expected to see.
If anyone is curious, I placed copies of the files named -first and
-second as appropriate, including uncompressed copies of the .tar.gz files.
The files are here: https://mirrors.razx.cloud/sa-update-backup/1827165/
This is curiosity more than anything else at this stage, I will leave my
caching to be less aggressive to allow files to be updated.
On 2018-03-19 13:36, Kevin A. McGrail wrote:
I would guess you caught it mid copy and it arose because of the caching.
Just a guess but glad we know what's going on.
On Mon, Mar 19, 2018, 15:09 Dave Warren <d...@thedave.ca> wrote:
Howdy. I'm on this list.
Okay, so this is a bit odd, it looks like the file 1827131.tar.gz was
actually modified by rsync many hours after the initial write:
Date: Sun, 18 Mar 2018 02:36:30 -0600 (MDT)
.d..t...... ./
>f+++++++++ 1827131.tar.gz
>f+++++++++ 1827131.tar.gz.asc
>f+++++++++ 1827131.tar.gz.sha1
My cron runs every 5 minutes (with up to 220 seconds variability). I see
"MIRROR.CHECK" being updated at 03:18, 04:21, 05:23, 06:18, etc)
confirming rsync was running.
1827131.tar.gz is modified just over 18 hours later:
Date: Sun, 18 Mar 2018 20:47:39 -0600 (MDT)
>f.st...... 1827131.tar.gz
>f..t...... 1827131.tar.gz.asc
>f.st...... 1827131.tar.gz.sha1
I was under the impression that the *.tar.gz* files were immutable, but
looking through my rsync logs, this is definitely not the case, I see
the files being created and later updated nearly daily (although not
every day, March 8th I see 1826189.tar.gz was created and never
modified), the only reference to it is here:
8 Mar 2018 19:46:40 -0700 (MST)
.d..t...... ./
>f+++++++++ 1826189.tar.gz
>f+++++++++ 1826189.tar.gz.asc
>f+++++++++ 1826189.tar.gz.sha1
Due to my belief in the immutable nature of these files, the files were
being cached without verifying whether the on-disk source had changed.
For the moment, I will cache less aggressively which should resolve the
problem.
Can anyone confirm why the files are being modified? Is this
intentional/expected?
On 2018-03-19 07:52, Dave Jones wrote:
I found an email address in the SA archives from 2013. Hopefully this
makes it to him.
On 03/19/2018 08:33 AM, Dave Jones wrote:
Is Dave Warren on this list? If no response, does anyone have an old
email with his contact info so I can ask him how his rsync's are setup?
Dave
On 03/19/2018 08:26 AM, bugzilla-dae...@bugzilla.spamassassin.org
wrote:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7566
Dave Jones <da...@apache.org> changed:
What |Removed |Added
------------------------------------------------------------
----------------
CC| |da...@apache.org
--- Comment #2 from Dave Jones <da...@apache.org> ---
I guess I can add logic to our hourly script to check sha1 values on
the latest
tar.gz to catch rsync'ing issues.