On Fri, Oct 27, 2023 at 6:42 AM Pierre Fourès <pierre.fou...@gmail.com> wrote:
> Hi Felix, > > Your SMART data looks good to me, except for the hard drive temperature. > Experiencing 53°C looks quite a lot to me. Yet, this should not be the > cause of your corrupted data. > > Two data-corruption problems on the same server which looks independant > from each other, and occured at a quite long time range interval from each > other, reminds me of a server who caused me lots of trouble until I > discovered it had memory defects. I suspected hard disk failure and/or hard > drive data corruption, but couldn't nail it with smartctl nor with the > badblocks utility. I eventually nailed the problem when doing extensive > test with the stress utility, showing that in some runs, the memory was > corrupting data (which ended up corrupting data on disk). I had to run the > tests many times to spot the defect. Subtle defects are real hard to spot > on. > > IMO, I would advice you to do a full scan of this server to spot where the > problem is in order to file this trail of problems as definitively solved. > In my situation, similar to your one, the problems occured too distantly > from each other to commit resources to investigate thoroughly. This period > of uncertaintly and intuitive distrust of the server caused us a hidden > costs like stress and fatigue. Having experienced it, if that happened > again, I would prefer to rule out this situation quickly instead of knowing > it dormant. > > Here are some links which might be relevant to you : > - https://en.wikipedia.org/wiki/Badblocks > - https://wiki.archlinux.org/title/Badblocks > - https://man.archlinux.org/man/stress.1 > - https://wiki.archlinux.org/title/Stress_testing > - https://www.memtest.org/ > > Best Regards, > Pierre. > I can speak to RAM corruption as well. In one instance, we were experiencing the strangest problems and blamed just about everything until I ran the above memtest utility and it showed tremendous numbers of memory errors. When I opened up the hardware, I found dust on and around the memory. I cleaned that very thoroughly, put the system back together, and ran memtest overnight or over a weekend with zero errors. Evidently, dust can be conductive enough to act like a bunch of resistors across pins that shouldn't have resistors across them. As trivial as that sounds, I recommend to check for things like dust, and since heat was mentioned, I'd check for fans that don't spin very freely. I also recommend running memtest over a weekend, and finally, I am with the camp who believe that ECC RAM is a good idea, so I'd suggest to check whether you are using ECC RAM. Hope this helps, Nathan