Re: Checksum mismatch in single repo

2023-10-27 Thread Nathan Hartman
On Fri, Oct 27, 2023 at 6:42 AM Pierre Fourès 
wrote:

> Hi Felix,
>
> Your SMART data looks good to me, except for the hard drive temperature.
> Experiencing 53°C looks quite a lot to me. Yet, this should not be the
> cause of your corrupted data.
>
> Two data-corruption problems on the same server which looks independant
> from each other, and occured at a quite long time range interval from each
> other, reminds me of a server who caused me lots of trouble until I
> discovered it had memory defects. I suspected hard disk failure and/or hard
> drive data corruption, but couldn't nail it with smartctl nor with the
> badblocks utility. I eventually nailed the problem when doing extensive
> test with the stress utility, showing that in some runs, the memory was
> corrupting data (which ended up corrupting data on disk). I had to run the
> tests many times to spot the defect. Subtle defects are real hard to spot
> on.
>
> IMO, I would advice you to do a full scan of this server to spot where the
> problem is in order to file this trail of problems as definitively solved.
> In my situation, similar to your one, the problems occured too distantly
> from each other to commit resources to investigate thoroughly. This period
> of uncertaintly and intuitive distrust of the server caused us a hidden
> costs like stress and fatigue. Having experienced it, if that happened
> again, I would prefer to rule out this situation quickly instead of knowing
> it dormant.
>
> Here are some links which might be relevant to you :
>   - https://en.wikipedia.org/wiki/Badblocks
>   - https://wiki.archlinux.org/title/Badblocks
>   - https://man.archlinux.org/man/stress.1
>   - https://wiki.archlinux.org/title/Stress_testing
>   - https://www.memtest.org/
>
> Best Regards,
> Pierre.
>


I can speak to RAM corruption as well. In one instance, we were
experiencing the strangest problems and blamed just about everything until
I ran the above memtest utility and it showed tremendous numbers of memory
errors. When I opened up the hardware, I found dust on and around the
memory. I cleaned that very thoroughly, put the system back together, and
ran memtest overnight or over a weekend with zero errors. Evidently, dust
can be conductive enough to act like a bunch of resistors across pins that
shouldn't have resistors across them.

As trivial as that sounds, I recommend to check for things like dust, and
since heat was mentioned, I'd check for fans that don't spin very freely. I
also recommend running memtest over a weekend, and finally, I am with the
camp who believe that ECC RAM is a good idea, so I'd suggest to check
whether you are using ECC RAM.

Hope this helps,
Nathan


Re: Checksum mismatch in single repo

2023-10-27 Thread Pierre Fourès
Hi Felix,

Your SMART data looks good to me, except for the hard drive temperature.
Experiencing 53°C looks quite a lot to me. Yet, this should not be the
cause of your corrupted data.

Two data-corruption problems on the same server which looks independant
from each other, and occured at a quite long time range interval from each
other, reminds me of a server who caused me lots of trouble until I
discovered it had memory defects. I suspected hard disk failure and/or hard
drive data corruption, but couldn't nail it with smartctl nor with the
badblocks utility. I eventually nailed the problem when doing extensive
test with the stress utility, showing that in some runs, the memory was
corrupting data (which ended up corrupting data on disk). I had to run the
tests many times to spot the defect. Subtle defects are real hard to spot
on.

IMO, I would advice you to do a full scan of this server to spot where the
problem is in order to file this trail of problems as definitively solved.
In my situation, similar to your one, the problems occured too distantly
from each other to commit resources to investigate thoroughly. This period
of uncertaintly and intuitive distrust of the server caused us a hidden
costs like stress and fatigue. Having experienced it, if that happened
again, I would prefer to rule out this situation quickly instead of knowing
it dormant.

Here are some links which might be relevant to you :
  - https://en.wikipedia.org/wiki/Badblocks
  - https://wiki.archlinux.org/title/Badblocks
  - https://man.archlinux.org/man/stress.1
  - https://wiki.archlinux.org/title/Stress_testing
  - https://www.memtest.org/

Best Regards,
Pierre.


Re: Checksum mismatch in single repo

2023-10-27 Thread Felix Natter

hello Daniel,

thank you for your quick answer, I reply inline:

On 27.10.23 08:23, Daniel Sahlberg wrote:
Den fre 27 okt. 2023 kl 07:30 skrev Felix Natter 
:


Dear svn experts,

I do a daily dump+backup of my svn server. Without any known trigger
(no server crash, except about 2 months ago I had a single I/O
error on the
ProxMox virtualization server), the dump of one repo failed with:

svnadmin: E200046: LZ4 decompression failed

The svnadmin verify I ran to double check that also failed for
that one repo:

verifying /repos/X/Y...
* Error verifying repository metadata.
svnadmin: E160004: Checksum mismatch in item at offset 18983705 of
length 11921122 bytes in file /repos/X/Y/db/revs/0/221

After I restored X/Y from the last backup, and ran a
dump/backup/verify,
everything is fine for 4 days now.

Good thing you did the dump/backup and verify steps!

Do I understand the issue occurred about a week ago, you restored the 
backup and now it has been working fine for the last 4 days? As 
compared with the known I/O error 2 month ago (ie, a lot earlier)?


Yes, the I/O error occurred earlier and did not have consequences for 
"svnadmin dump/verify".


With the current (4 days ago) corruption, I did not see any I/O errors. 
SMART is also green

(please see below).


I couldn't find an error in the system logs (especially no I/O
errors).
The repos are on a HDD (in my experience they last longer than SSDs
with lots of write activity, i.e. daily dumps/backups/etc...).

Question: Can I rule out software failure?

It is difficult to rule out, but there are not many reports of this 
failure so I would guess it is more likely to be a corrupted bit of 
data on your HDD.

Ok, thanks.


I am running svn 1.14.1
on ALMA Linux 8.x. Shall I install on a new HDD?

You should probably check the SMART stats on the drive (on the 
virtualisation host!) or any other indications you might have on an 
upcoming failure to see if the HDD is indeed the issue.


I do not see a single problem in "smartctl -a /dev/sda" (I started a 
long test

with -t long earlier this week):

https://pastebin.com/7hi31CUg

But then I never identified a failing HDD using SMART...

Many Thanks and Best Regards,

Felix



No action needed?
Any other advice?

Many Thanks in Advance and Best Regards,
Felix


Kind regards,
Daniel Sahlberg


--

*SIDACT GmbH
Simulation Data Analysis and
Compression Technologies
*
*Felix Natter*
/Software Developer /

Auguststraße 29
53229 Bonn
Germany

Phone    :   +49 228 5348 0430
Direct   :   +49 228 4097 7118
Email    : felix.nat...@sidact.com
Web  : http://www.sidact.com/


Re: Checksum mismatch in single repo

2023-10-27 Thread Daniel Sahlberg
Den fre 27 okt. 2023 kl 07:30 skrev Felix Natter :

> Dear svn experts,
>
> I do a daily dump+backup of my svn server. Without any known trigger
> (no server crash, except about 2 months ago I had a single I/O error on
> the
> ProxMox virtualization server), the dump of one repo failed with:
>
> svnadmin: E200046: LZ4 decompression failed
>
> The svnadmin verify I ran to double check that also failed for that one
> repo:
>
> verifying /repos/X/Y...
> * Error verifying repository metadata.
> svnadmin: E160004: Checksum mismatch in item at offset 18983705 of length
> 11921122 bytes in file /repos/X/Y/db/revs/0/221
>
> After I restored X/Y from the last backup, and ran a dump/backup/verify,
> everything is fine for 4 days now.
>
Good thing you did the dump/backup and verify steps!

Do I understand the issue occurred about a week ago, you restored the
backup and now it has been working fine for the last 4 days? As compared
with the known I/O error 2 month ago (ie, a lot earlier)?


> I couldn't find an error in the system logs (especially no I/O errors).
> The repos are on a HDD (in my experience they last longer than SSDs
> with lots of write activity, i.e. daily dumps/backups/etc...).
>
> Question: Can I rule out software failure?
>
It is difficult to rule out, but there are not many reports of this failure
so I would guess it is more likely to be a corrupted bit of data on your
HDD.


> I am running svn 1.14.1
> on ALMA Linux 8.x. Shall I install on a new HDD?
>
You should probably check the SMART stats on the drive (on the
virtualisation host!) or any other indications you might have on an
upcoming failure to see if the HDD is indeed the issue.


> No action needed?
> Any other advice?
>
> Many Thanks in Advance and Best Regards,
> Felix
>

Kind regards,
Daniel Sahlberg