Hi Sebastian,
actually it's hard to tell what's happening with this osd... May be it's
less fragmented and hence benefit from the sequential reading. IIRC
you're using spinning drives IIRC which are very susceptible to access
pattern.
Thanks,
Igor
On 3/17/2022 11:54 PM, Sebastian Mazza wr
Hi Igor,
thank you very much for your explanation. I much appreciate it.
You was right, as always. :-)
There was not a single corrupted object. I did run `time ceph-bluestore-tool
fsck --path /var/lib/ceph/osd/ceph-$X` and `time ceph-bluestore-tool fsck
--path /var/lib/ceph/osd/ceph-$X --deep y
Hi Sebastian,
I don't think you have got tons of corrupted objects. The tricky thing
about the bug is that corruption might occur if new allocation occurred
in a pretty short period only: when OSD is starting but haven't applied
deferred writes yet. This mostly applies to Bluefs/RocksDB perfo
Hi Igor,
great that you was able to reproduce it!
I did read your comments at the issue #54547. Am I right that I probably have
hundreds of corrupted objects on my EC pools (cephFSD and RBD)? But I only ever
noticed when a rocksDB was damaged. A deep scrub should find the other errors,
right?
Hi Sebastian,
the proper parameter name is 'osd fast shutdown".
As with any other OSD config parameter one can use either ceph.conf or
'ceph config set osd.N osd_fast_shutdown false' command to adjust it.
I'd recommend the latter form.
And yeah from my last experiments it looks like setting
Hallo Igor,
I'm glad I could be of help. Thank you for your explanation!
> And I was right this is related to deferred write procedure and apparently
> fast shutdown mode.
Does that mean I can prevent the error in the meantime, before you can fix the
root cause, by disabling osd_fast_shutdow
Hi Igor!
I hope I've cracked the checkpot now. I have logs with osd debug level 20 for
bluefs, bdev, and bluestore. The log files ceph-osd.4.log shows 2 consecutive
startups of the osd.4 where the second startup results in:
```
rocksdb: Corruption: Bad table magic number: expected 98635183903770
Hi Sebastian,
I submitted a ticket https://tracker.ceph.com/issues/54409 which shows
my analysis based on your previous log (from 21-02-2022). Which wasn't
verbose enough at debug-bluestore level to make the final conclusion.
Unfortunately the last logs (from 24-02-2022) you shared don't incl
Hi Igor,
I let ceph rebuild the OSD.7. Then I added
```
[osd]
debug bluefs = 20
debug bdev = 20
debug bluestore = 20
```
to the ceph.conf of all 3 nodes and shut down all 3 nodes without writing
anything to the pools on the HDDs (the Debian VM was not even running).
Immed
Hi Alexander,
thank you for your suggestion! All my Nodes have ECC memory. However, I have
now checked that it was recognized correctly on every system (dmesg | grep
EDAC). Furthermore I checkt if an error occurred by using `edac-util` and also
by searching in the logs of the mainboard BMCs. Ev
I have another suggestion: check the RAM, just in case, with memtest86
or https://github.com/martinwhitaker/pcmemtest (which is a fork of
memtest86+). Ignore the suggestion if you have ECC RAM.
вт, 22 февр. 2022 г. в 15:45, Igor Fedotov :
>
> Hi Sebastian,
>
> On 2/22/2022 3:01 AM, Sebastian Mazza
Hi Sebastian,
On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
Hey Igor!
thanks a lot for the new logs - looks like they provides some insight.
I'm glad the logs are helpful.
At this point I think the root cause is apparently a race between deferred
writes replay and some DB maintenance task
Hey Igor!
> thanks a lot for the new logs - looks like they provides some insight.
I'm glad the logs are helpful.
> At this point I think the root cause is apparently a race between deferred
> writes replay and some DB maintenance task happening on OSD startup. It seems
> that deferred writ
Hey Sebastian,
thanks a lot for the new logs - looks like they provides some insight.
At this point I think the root cause is apparently a race between
deferred writes replay and some DB maintenance task happening on OSD
startup. It seems that deferred write replay updates a block extent
whic
Hi Igor,
today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And this
time I have logs with “debug bluefs = 20” and "debug bdev = 20” for every OSD
in the cluster! It was the OSD with the ID 7 again. So the HDD has failed now
the third time! Coincidence? Probably not…
The import
Hi Igor,
please find the the startup log under the following link:
https://we.tl/t-E6CadpW1ZL
It also includes the “normal" log of that OSD from the the day before the crash
and the RocksDB sst file with the “Bad table magic number” (db/001922.sst)
Best regards,
Sebastian
> On 21.02.2022, at
Hi Sebastian,
could you please share failing OSD startup log?
Thanks,
Igor
On 2/20/2022 5:10 PM, Sebastian Mazza wrote:
Hi Igor,
it happened again. One of the OSDs that crashed last time, has a corrupted
RocksDB again. Unfortunately I do not have debug logs from the OSDs again. I
was coll
Hi Igor,
it happened again. One of the OSDs that crashed last time, has a corrupted
RocksDB again. Unfortunately I do not have debug logs from the OSDs again. I
was collecting hundreds of Gigabytes of OSD debug logs in the last two month.
But this week, I disabled the debug logging, because I d
On 1/26/2022 1:18 AM, Sebastian Mazza wrote:
Hey Igor,
thank you for your response!
Do you suggest to disable the HDD write-caching and / or the bluefs_buffered_io
for productive clusters?
Generally upstream recommendation is to disable disk write caching, there were
multiple complains it
Hey Igor,
thank you for your response!
>>
>> Do you suggest to disable the HDD write-caching and / or the
>> bluefs_buffered_io for productive clusters?
>>
> Generally upstream recommendation is to disable disk write caching, there
> were multiple complains it might negatively impact the perf
Hey Sebastian,
thanks a lot for the update, please see more questions inline.
Thanks,
Igor
On 1/22/2022 2:13 AM, Sebastian Mazza wrote:
Hey Igor,
thank you for your response and your suggestions.
I've tried to simulate every imaginable load that the cluster might have done
before the thr
Hey Igor,
thank you for your response and your suggestions.
>> I've tried to simulate every imaginable load that the cluster might have
>> done before the three OSD crashed.
>> I rebooted the servers many times while the Custer was under load. If more
>> than a single node was rebooted at the s
Hey Sebastian,
thanks a lot for your help and the update.
On 1/21/2022 4:58 PM, Sebastian Mazza wrote:
Hi Igor,
I want to give you a short update, since I tried now for quite some time to
reproduce the problem as you suggested. I've tried to simulate every imaginable
load that the cluster mi
Hi Igor,
I want to give you a short update, since I tried now for quite some time to
reproduce the problem as you suggested. I've tried to simulate every imaginable
load that the cluster might have done before the three OSD crashed.
I rebooted the servers many times while the Custer was under lo
Hi Mazzystr,
thank you very much for your suggestion! The OSDs did find the bluestore block
device and I do not use any USB drives. All failed OSD are on SATA drives
connected to AMD CPUs / Chipsets.
It seams now clear that the problem is that one of the RocksDBs is corrupted on
each of the fa
Hey Sebastian,
On 12/22/2021 1:53 AM, Sebastian Mazza wrote:
9) Would you be able to run some long lasting (and potentially data corrupting)
experiments at this cluster in an attempt to pin point the issue. I'm thinking
about periodic OSD shutdown under the load to catch the corrupting event
Hi Igor,
I now fixed my wrong OSD debug config to:
[osd.7]
debug bluefs = 20
debug bdev = 20
and you can download the debug log from: https://we.tl/t-3e4do1PQGj
Thanks,
Sebastian
> On 21.12.2021, at 19:44, Igor Fedotov wrote:
>
> Hi Sebastian,
>
> first of all I'm not su
Hi Mehmet,
thank you for your suggestion. I did now check the kernel log, but I didn’t see
something interesting. However, I copied the parts that seams to be related to
the SATA disks of the failed OSDs. Maybe you see more than I do.
[1.815801] ata7: SATA link down (SStatus 0 SControl 300)
Hi Sebastian,
first of all I'm not sure this issue has the same root cause as Francois
one. Highly likely it's just another BlueFS/RocksDB data corruption
which is indicated in the same way.
In this respect I would rather mention this one reported just yesterday:
https://lists.ceph.io/hyperk
Hi,
This
> fsck failed: (5) Input/output error
Sounds like an Hardware issue.
Did you have a Look on "dmesg"?
Hth
Mehmet
Am 21. Dezember 2021 17:47:35 MEZ schrieb Sebastian Mazza
:
>Hi all,
>
>after a reboot of a cluster 3 OSDs can not be started. The OSDs exit with the
>following error messa
30 matches
Mail list logo