Finally, the problem is solved by changing the whole hardware of failure
server except hard disks.
The last test which I have done before changing server was, cross
exchanging SSD disks between failure server(node A) and one of the healthy
servers(node B) and recreating the cluster.
In this test w
Did you remove and recreate the OSDs that used the SSD for their WAL/DB?
Or did you try to do something to not have to do that? That is an integral
part of the OSD and changing the SSD would destroy the OSDs involved unless
you attempted some sort of dd. If you did that, then any corruption for
t
Hi Brian,
The issue started with failing mon service and after that both OSDs on that
node failed to start.
Mon service is on SSD disk and WAL/DB of OSDs on that SSD too with lvm.
I have changed SSD disk with new one, and changing SATA port and cable but
the problem is still remaining.
All disk te
Hello
Wasn't this originally an issue with mon store now you are getting a
checksum error from an OSD? I think some hardware here in this node is just
hosed.
On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani
wrote:
> Hi there,
>
> I changed SATA port and cable of SSD disk and also update ceph t
Hi there,
I changed SATA port and cable of SSD disk and also update ceph to version
12.2.3 and rebuild OSDs
but when recovery starts OSDs failed with this error:
2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7)
_verify_csum bad crc32c/0x1000 checksum at blob offset
but disks pass all the tests with smartctl, badblocks and there isn't any
error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
to test it on other cluster nodes
On Wed, Feb 21, 2018 at 4:58 PM, wrote:
> Could the problem be related with some faulty hardware (RAID-controller,
Could the problem be related with some faulty hardware (RAID-controller, port, cable) but not disk?
Does "faulty" disk works OK on other server?
Behnam Loghmani wrote on 21/02/18 16:09:
Hi there,
I changed the SSD on the problematic node with the new one and reconfigure OSDs
and MON service o
Hi there,
I changed the SSD on the problematic node with the new one and reconfigure
OSDs and MON service on it.
but the problem occurred again with:
"rocksdb: submit_transaction error: Corruption: block checksum mismatch
code = 2"
I get fully confused now.
On Tue, Feb 20, 2018 at 5:16 PM, Be
Hi Caspar,
I checked the filesystem and there isn't any error on filesystem.
The disk is SSD and it doesn't any attribute related to Wear level in
smartctl and filesystem is mounted with default options and no discard.
my ceph structure on this node is like this:
it has osd,mon,rgw services
1 SS
Hi Behnam,
I would firstly recommend running a filesystem check on the monitor disk
first to see if there are any inconsistencies.
Is the disk where the monitor is running on a spinning disk or SSD?
If SSD you should check the Wear level stats through smartctl.
Maybe trim (discard) enabled on th
I checked the disk that monitor is on it with smartctl and it didn't return
any error and it doesn't have any Current_Pending_Sector.
Do you recommend any disk checks to make sure that this disk has problem
and then I can send the report to the provider for replacing the disk
On Sat, Feb 17, 2018
The disk that the monitor is on...there isn't anything for you to configure
about a monitor WAL though so I'm not sure how that enters into it?
On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani
wrote:
> Thanks for your reply
>
> Do you mean, that's the problem with the disk I use for WAL and DB?
Thanks for your reply
Do you mean, that's the problem with the disk I use for WAL and DB?
On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum wrote:
>
> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani
> wrote:
>
>> Hi there,
>>
>> I have a Ceph cluster version 12.2.2 on CentOS 7.
>>
>> It is a te
On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani
wrote:
> Hi there,
>
> I have a Ceph cluster version 12.2.2 on CentOS 7.
>
> It is a testing cluster and I have set it up 2 weeks ago.
> after some days, I see that one of the three mons has stopped(out of
> quorum) and I can't start it anymore.
> I
Hi there,
I have a Ceph cluster version 12.2.2 on CentOS 7.
It is a testing cluster and I have set it up 2 weeks ago.
after some days, I see that one of the three mons has stopped(out of
quorum) and I can't start it anymore.
I checked the mon service log and the output shows this error:
"""
mon.
15 matches
Mail list logo