When I first set up my mirror, I was told "Because the items are release artifacts, they are never altered or removed, just added.", so I configured my caching around this design. This does not seem to be the case in general, it wasn't just a one-off problem on 1827131.

Looking at the original 1827131 issue, I was suspicious: rsync ran over 200 times between when I first got the file, and when rsync later updated the file, so it seemed unlikely that an incomplete file somehow slipped through. rsync doesn't normally allow this, but even if it did, it should have been fixed within 5 minutes, not 18 hours later.

From logs, I can tell that 1827131 and 1827165 were each written, then ~18 hours later, updated. With 1827165 I was able to capture the original files including the .sha1, which rules out any sort of incomplete copy as the .sha1 wouldn't match if there was any sort of error or truncation during the copy process.

tl;dr:

My concern is this: Is it expected behaviour that the files are first published, then later an updated version of the same update number is published?

And second: Given that the scores are modified between these two versions, what is the impact on SpamAssassin users who obtain the first vs the second? There are several score differences, some significant.



On 2018-03-20 11:27, Kevin A. McGrail wrote:
Not sure what you mean.  Original ticket was about  1827131.tar.gz

--
Kevin A. McGrail
Asst. Treasurer & VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

On Tue, Mar 20, 2018 at 1:10 PM, Dave Warren <d...@thedave.ca> wrote:

Interestingly the file really is changing and wasn't just a poorly timed
copy, check this out:

Date: Mon, 19 Mar 2018 02:38:08 -0600 (MDT), the files were created:

.d..t...... ./
f+++++++++ 1827165.tar.gz
f+++++++++ 1827165.tar.gz.asc
f+++++++++ 1827165.tar.gz.sha1

And the .sha1 hash validates (which obviously wouldn't happen if I had an
incomplete copy).

# sha1sum 1827165.tar.gz;cat 1827165.tar.gz.sha1
a3abb2aad004a3401acfad9167e77b0ca31ef9c4  1827165.tar.gz
a3abb2aad004a3401acfad9167e77b0ca31ef9c4 /usr/local/spamassassin/automc
/tmp/stage/3.4.2/update.tgz


Date: Mon, 19 Mar 2018 20:48:17 -0600 (MDT), the files were updated:

f.st...... 1827165.tar.gz
f..t...... 1827165.tar.gz.asc
f.st...... 1827165.tar.gz.sha1

And once again, the .sha1 hash validates the new file:

# sha1sum 1827165.tar.gz;cat 1827165.tar.gz.sha1
ea74b1eb682bbb25c2028ffe01a8e20bd1943885  1827165.tar.gz
ea74b1eb682bbb25c2028ffe01a8e20bd1943885 /usr/local/spamassassin/automc
/tmp/mkupdate-with-scores/1827165.tar.gz


I don't know if any of this is actually a problem, but it's not what I
expected to see.

If anyone is curious, I placed copies of the files named -first and
-second as appropriate, including uncompressed copies of the .tar.gz files.
The files are here: https://mirrors.razx.cloud/sa-update-backup/1827165/

This is curiosity more than anything else at this stage, I will leave my
caching to be less aggressive to allow files to be updated.


On 2018-03-19 13:36, Kevin A. McGrail wrote:

I would guess you caught it mid copy and it arose because of the caching.
Just a guess but glad we know what's going on.

On Mon, Mar 19, 2018, 15:09 Dave Warren <d...@thedave.ca> wrote:

Howdy. I'm on this list.

Okay, so this is a bit odd, it looks like the file 1827131.tar.gz was
actually modified by rsync many hours after the initial write:

Date: Sun, 18 Mar 2018 02:36:30 -0600 (MDT)
.d..t...... ./
   >f+++++++++ 1827131.tar.gz
   >f+++++++++ 1827131.tar.gz.asc
   >f+++++++++ 1827131.tar.gz.sha1

My cron runs every 5 minutes (with up to 220 seconds variability). I see
"MIRROR.CHECK" being updated at 03:18, 04:21, 05:23, 06:18, etc)
confirming rsync was running.


1827131.tar.gz is modified just over 18 hours later:

Date: Sun, 18 Mar 2018 20:47:39 -0600 (MDT)
   >f.st...... 1827131.tar.gz
   >f..t...... 1827131.tar.gz.asc
   >f.st...... 1827131.tar.gz.sha1

I was under the impression that the *.tar.gz* files were immutable, but
looking through my rsync logs, this is definitely not the case, I see
the files being created and later updated nearly daily (although not
every day, March 8th I see 1826189.tar.gz was created and never
modified), the only reference to it is here:

8 Mar 2018 19:46:40 -0700 (MST)
.d..t...... ./
   >f+++++++++ 1826189.tar.gz
   >f+++++++++ 1826189.tar.gz.asc
   >f+++++++++ 1826189.tar.gz.sha1


Due to my belief in the immutable nature of these files, the files were
being cached without verifying whether the on-disk source had changed.
For the moment, I will cache less aggressively which should resolve the
problem.


Can anyone confirm why the files are being modified? Is this
intentional/expected?




On 2018-03-19 07:52, Dave Jones wrote:

I found an email address in the SA archives from 2013.  Hopefully this
makes it to him.

On 03/19/2018 08:33 AM, Dave Jones wrote:

Is Dave Warren on this list?  If no response, does anyone have an old
email with his contact info so I can ask him how his rsync's are setup?

Dave

On 03/19/2018 08:26 AM, bugzilla-dae...@bugzilla.spamassassin.org

wrote:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7566

Dave Jones <da...@apache.org> changed:

              What    |Removed                     |Added

------------------------------------------------------------
----------------


                    CC|                            |da...@apache.org

--- Comment #2 from Dave Jones <da...@apache.org> ---
I guess I can add logic to our hourly script to check sha1 values on
the latest
tar.gz to catch rsync'ing issues.










Reply via email to