Re: [ceph-users] mon service failed to start

2018-02-23 Thread Behnam Loghmani
Finally, the problem is solved by changing the whole hardware of failure server except hard disks. The last test which I have done before changing server was, cross exchanging SSD disks between failure server(node A) and one of the healthy servers(node B) and recreating the cluster. In this test w

Re: [ceph-users] mon service failed to start

2018-02-22 Thread David Turner
Did you remove and recreate the OSDs that used the SSD for their WAL/DB? Or did you try to do something to not have to do that? That is an integral part of the OSD and changing the SSD would destroy the OSDs involved unless you attempted some sort of dd. If you did that, then any corruption for t

Re: [ceph-users] mon service failed to start

2018-02-22 Thread Behnam Loghmani
Hi Brian, The issue started with failing mon service and after that both OSDs on that node failed to start. Mon service is on SSD disk and WAL/DB of OSDs on that SSD too with lvm. I have changed SSD disk with new one, and changing SATA port and cable but the problem is still remaining. All disk te

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Brian :
Hello Wasn't this originally an issue with mon store now you are getting a checksum error from an OSD? I think some hardware here in this node is just hosed. On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani wrote: > Hi there, > > I changed SATA port and cable of SSD disk and also update ceph t

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
Hi there, I changed SATA port and cable of SSD disk and also update ceph to version 12.2.3 and rebuild OSDs but when recovery starts OSDs failed with this error: 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum at blob offset

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
but disks pass all the tests with smartctl, badblocks and there isn't any error on disks. because the ssd has contain WAL/DB of OSDs it's difficult to test it on other cluster nodes On Wed, Feb 21, 2018 at 4:58 PM, wrote: > Could the problem be related with some faulty hardware (RAID-controller,

Re: [ceph-users] mon service failed to start

2018-02-21 Thread knawnd
Could the problem be related with some faulty hardware (RAID-controller, port, cable) but not disk? Does "faulty" disk works OK on other server? Behnam Loghmani wrote on 21/02/18 16:09: Hi there, I changed the SSD on the problematic node with the new one and reconfigure OSDs and MON service o

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
Hi there, I changed the SSD on the problematic node with the new one and reconfigure OSDs and MON service on it. but the problem occurred again with: "rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 2" I get fully confused now. On Tue, Feb 20, 2018 at 5:16 PM, Be

Re: [ceph-users] mon service failed to start

2018-02-20 Thread Behnam Loghmani
Hi Caspar, I checked the filesystem and there isn't any error on filesystem. The disk is SSD and it doesn't any attribute related to Wear level in smartctl and filesystem is mounted with default options and no discard. my ceph structure on this node is like this: it has osd,mon,rgw services 1 SS

Re: [ceph-users] mon service failed to start

2018-02-19 Thread Caspar Smit
Hi Behnam, I would firstly recommend running a filesystem check on the monitor disk first to see if there are any inconsistencies. Is the disk where the monitor is running on a spinning disk or SSD? If SSD you should check the Wear level stats through smartctl. Maybe trim (discard) enabled on th

Re: [ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
I checked the disk that monitor is on it with smartctl and it didn't return any error and it doesn't have any Current_Pending_Sector. Do you recommend any disk checks to make sure that this disk has problem and then I can send the report to the provider for replacing the disk On Sat, Feb 17, 2018

Re: [ceph-users] mon service failed to start

2018-02-16 Thread Gregory Farnum
The disk that the monitor is on...there isn't anything for you to configure about a monitor WAL though so I'm not sure how that enters into it? On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani wrote: > Thanks for your reply > > Do you mean, that's the problem with the disk I use for WAL and DB?

Re: [ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
Thanks for your reply Do you mean, that's the problem with the disk I use for WAL and DB? On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum wrote: > > On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani > wrote: > >> Hi there, >> >> I have a Ceph cluster version 12.2.2 on CentOS 7. >> >> It is a te

Re: [ceph-users] mon service failed to start

2018-02-16 Thread Gregory Farnum
On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani wrote: > Hi there, > > I have a Ceph cluster version 12.2.2 on CentOS 7. > > It is a testing cluster and I have set it up 2 weeks ago. > after some days, I see that one of the three mons has stopped(out of > quorum) and I can't start it anymore. > I

[ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
Hi there, I have a Ceph cluster version 12.2.2 on CentOS 7. It is a testing cluster and I have set it up 2 weeks ago. after some days, I see that one of the three mons has stopped(out of quorum) and I can't start it anymore. I checked the mon service log and the output shows this error: """ mon.