Re: [ceph-users] mon service failed to start

2018-02-23 Thread Behnam Loghmani
Finally, the problem is solved by changing the whole hardware of failure
server except hard disks.

The last test which I have done before changing server was, cross
exchanging SSD disks between failure server(node A) and one of the healthy
servers(node B) and recreating the cluster.
In this test we see that again node A failed with "Corruption: block
checksum mismatch code". So we figured out that there is something strange
with the board of node A and disks are healthy.

In this scenario, various tests were done:
1- recreating OSDs
2- changing SSD disk
3- changing SATA port and cable
4- cross exchanging SSD disks

To those who helped me with this problem, I sincerely thank you so much.

Best regards,
Behnam Loghmani



On Thu, Feb 22, 2018 at 3:18 PM, David Turner  wrote:

> Did you remove and recreate the OSDs that used the SSD for their WAL/DB?
> Or did you try to do something to not have to do that?  That is an integral
> part of the OSD and changing the SSD would destroy the OSDs involved unless
> you attempted some sort of dd.  If you did that, then any corruption for
> the mon very well might still persist.
>
> On Thu, Feb 22, 2018 at 3:44 AM Behnam Loghmani 
> wrote:
>
>> Hi Brian,
>>
>> The issue started with failing mon service and after that both OSDs on
>> that node failed to start.
>> Mon service is on SSD disk and WAL/DB of OSDs on that SSD too with lvm.
>> I have changed SSD disk with new one, and changing SATA port and cable
>> but the problem is still remaining.
>> All disk tests are fine and disk doesn't have any error.
>>
>> On Wed, Feb 21, 2018 at 10:03 PM, Brian :  wrote:
>>
>>> Hello
>>>
>>> Wasn't this originally an issue with mon store now you are getting a
>>> checksum error from an OSD? I think some hardware here in this node is just
>>> hosed.
>>>
>>>
>>> On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani <
>>> behnam.loghm...@gmail.com> wrote:
>>>
 Hi there,

 I changed SATA port and cable of SSD disk and also update ceph to
 version 12.2.3 and rebuild OSDs
 but when recovery starts OSDs failed with this error:


 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 
 bluestore(/var/lib/ceph/osd/ceph-7)
 _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
 expected 0xaf1040a2, device location [0x1~1000], logical extent
 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable
 to read osd superblock
 2018-02-21 21:12:18.038009 7f3479fe2d00  1 
 bluestore(/var/lib/ceph/osd/ceph-7)
 umount
 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
 shutdown
 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
 [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
 ceph-12.2.3/src/rocksdb/db/db_impl.cc:217] Shutdown: ca
 nceling all background work
 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
 2018/02/21-21:12:18.041514) [/home/jenkins-build/build/
 workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
 AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/
 release/12.2.3/rpm/el7/BUILD/ceph-12.
 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
 level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
 MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
 hutdown in progress: Database shutdown or Column
 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
 "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
 "output_level": 1, "num_output_files": 1, "total_ou
 tput_size": 902552, "num_input_records": 4470, "num_output_records":
 4377, "num_subcompactions": 1, "num_single_delete_mismatches": 0,
 "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0,
 0]}
 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
 {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
 "file_number": 249}
 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
 [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
 ceph-12.2.3/src/rocksdb/db/db_impl.cc:343] Shutdown com
 plete
 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
 shutdown
 2018-02-21 

Re: [ceph-users] mon service failed to start

2018-02-22 Thread David Turner
Did you remove and recreate the OSDs that used the SSD for their WAL/DB?
Or did you try to do something to not have to do that?  That is an integral
part of the OSD and changing the SSD would destroy the OSDs involved unless
you attempted some sort of dd.  If you did that, then any corruption for
the mon very well might still persist.

On Thu, Feb 22, 2018 at 3:44 AM Behnam Loghmani 
wrote:

> Hi Brian,
>
> The issue started with failing mon service and after that both OSDs on
> that node failed to start.
> Mon service is on SSD disk and WAL/DB of OSDs on that SSD too with lvm.
> I have changed SSD disk with new one, and changing SATA port and cable but
> the problem is still remaining.
> All disk tests are fine and disk doesn't have any error.
>
> On Wed, Feb 21, 2018 at 10:03 PM, Brian :  wrote:
>
>> Hello
>>
>> Wasn't this originally an issue with mon store now you are getting a
>> checksum error from an OSD? I think some hardware here in this node is just
>> hosed.
>>
>>
>> On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani <
>> behnam.loghm...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I changed SATA port and cable of SSD disk and also update ceph to
>>> version 12.2.3 and rebuild OSDs
>>> but when recovery starts OSDs failed with this error:
>>>
>>>
>>> 2018-02-21 21:12:18.037974 7f3479fe2d00 -1
>>> bluestore(/var/lib/ceph/osd/ceph-7) _verify_csum bad crc32c/0x1000 checksum
>>> at blob offset 0x0, got 0x84c097b0, expected 0xaf1040a2, device location
>>> [0x1~1000], logical extent 0x0~1000, object
>>> #-1:7b3f43c4:::osd_superblock:0#
>>> 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable
>>> to read osd superblock
>>> 2018-02-21 21:12:18.038009 7f3479fe2d00  1
>>> bluestore(/var/lib/ceph/osd/ceph-7) umount
>>> 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
>>> shutdown
>>> 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
>>> 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
>>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:217]
>>> Shutdown: ca
>>> nceling all background work
>>> 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
>>> 2018/02/21-21:12:18.041514)
>>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.
>>> 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
>>> level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
>>> MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
>>> 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
>>> hutdown in progress: Database shutdown or Column
>>> 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
>>> 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
>>> "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
>>> "output_level": 1, "num_output_files": 1, "total_ou
>>> tput_size": 902552, "num_input_records": 4470, "num_output_records":
>>> 4377, "num_subcompactions": 1, "num_single_delete_mismatches": 0,
>>> "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
>>> 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
>>> {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
>>> "file_number": 249}
>>> 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
>>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:343]
>>> Shutdown com
>>> plete
>>> 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
>>> 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
>>> shutdown
>>> 2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
>>> shutdown
>>> 2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
>>> shutdown
>>> 2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
>>> /dev/vg0/wal-b) close
>>> 2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
>>> /dev/vg0/db-b) close
>>> 2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
>>> /var/lib/ceph/osd/ceph-7/block) close
>>> 2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
>>> /var/lib/ceph/osd/ceph-7/block) close
>>> 2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed:
>>> (22) Invalid argument
>>>
>>>
>>> On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani <
>>> behnam.loghm...@gmail.com> wrote:
>>>
 but disks pass all the tests with smartctl, badblocks and there isn't
 any error on disks. because the ssd has contain WAL/DB of OSDs it's
 difficult to test it on 

Re: [ceph-users] mon service failed to start

2018-02-22 Thread Behnam Loghmani
Hi Brian,

The issue started with failing mon service and after that both OSDs on that
node failed to start.
Mon service is on SSD disk and WAL/DB of OSDs on that SSD too with lvm.
I have changed SSD disk with new one, and changing SATA port and cable but
the problem is still remaining.
All disk tests are fine and disk doesn't have any error.

On Wed, Feb 21, 2018 at 10:03 PM, Brian :  wrote:

> Hello
>
> Wasn't this originally an issue with mon store now you are getting a
> checksum error from an OSD? I think some hardware here in this node is just
> hosed.
>
>
> On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani <
> behnam.loghm...@gmail.com> wrote:
>
>> Hi there,
>>
>> I changed SATA port and cable of SSD disk and also update ceph to version
>> 12.2.3 and rebuild OSDs
>> but when recovery starts OSDs failed with this error:
>>
>>
>> 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 
>> bluestore(/var/lib/ceph/osd/ceph-7)
>> _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
>> expected 0xaf1040a2, device location [0x1~1000], logical extent
>> 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
>> 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable
>> to read osd superblock
>> 2018-02-21 21:12:18.038009 7f3479fe2d00  1 
>> bluestore(/var/lib/ceph/osd/ceph-7)
>> umount
>> 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
>> shutdown
>> 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
>> 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
>> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MA
>> CHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:217]
>> Shutdown: ca
>> nceling all background work
>> 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
>> 2018/02/21-21:12:18.041514) [/home/jenkins-build/build/wor
>> kspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
>> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.
>> 3/rpm/el7/BUILD/ceph-12.
>> 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
>> level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
>> MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
>> 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
>> hutdown in progress: Database shutdown or Column
>> 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
>> 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
>> "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
>> "output_level": 1, "num_output_files": 1, "total_ou
>> tput_size": 902552, "num_input_records": 4470, "num_output_records":
>> 4377, "num_subcompactions": 1, "num_single_delete_mismatches": 0,
>> "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
>> 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
>> {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
>> "file_number": 249}
>> 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
>> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
>> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MA
>> CHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:343]
>> Shutdown com
>> plete
>> 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
>> 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
>> shutdown
>> 2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
>> shutdown
>> 2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
>> shutdown
>> 2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
>> /dev/vg0/wal-b) close
>> 2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
>> /dev/vg0/db-b) close
>> 2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
>> /var/lib/ceph/osd/ceph-7/block) close
>> 2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
>> /var/lib/ceph/osd/ceph-7/block) close
>> 2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed:
>> (22) Invalid argument
>>
>>
>> On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani <
>> behnam.loghm...@gmail.com> wrote:
>>
>>> but disks pass all the tests with smartctl, badblocks and there isn't
>>> any error on disks. because the ssd has contain WAL/DB of OSDs it's
>>> difficult to test it on other cluster nodes
>>>
>>> On Wed, Feb 21, 2018 at 4:58 PM,  wrote:
>>>
 Could the problem be related with some faulty hardware
 (RAID-controller, port, cable) but not disk? Does "faulty" disk works OK on
 other server?

 Behnam Loghmani wrote on 21/02/18 16:09:

> Hi there,
>
> I changed the SSD on the problematic node with the new one and
> reconfigure OSDs and MON service on it.
> but the problem occurred again with:
>

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Brian :
Hello

Wasn't this originally an issue with mon store now you are getting a
checksum error from an OSD? I think some hardware here in this node is just
hosed.


On Wed, Feb 21, 2018 at 5:46 PM, Behnam Loghmani 
wrote:

> Hi there,
>
> I changed SATA port and cable of SSD disk and also update ceph to version
> 12.2.3 and rebuild OSDs
> but when recovery starts OSDs failed with this error:
>
>
> 2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7)
> _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
> expected 0xaf1040a2, device location [0x1~1000], logical extent
> 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
> 2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable to
> read osd superblock
> 2018-02-21 21:12:18.038009 7f3479fe2d00  1 bluestore(/var/lib/ceph/osd/ceph-7)
> umount
> 2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
> shutdown
> 2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
> 2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:217]
> Shutdown: ca
> nceling all background work
> 2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
> 2018/02/21-21:12:18.041514) [/home/jenkins-build/build/wor
> kspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABL
> E_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.
> 2.3/rpm/el7/BUILD/ceph-12.
> 2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
> level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
> MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
> 0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
> hutdown in progress: Database shutdown or Column
> 2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
> 2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
> "job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
> "output_level": 1, "num_output_files": 1, "total_ou
> tput_size": 902552, "num_input_records": 4470, "num_output_records": 4377,
> "num_subcompactions": 1, "num_single_delete_mismatches": 0,
> "num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
> 2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
> "file_number": 249}
> 2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/ceph-12.2.3/src/rocksdb/db/db_impl.cc:343]
> Shutdown com
> plete
> 2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
> 2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
> shutdown
> 2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
> shutdown
> 2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
> shutdown
> 2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
> /dev/vg0/wal-b) close
> 2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
> /dev/vg0/db-b) close
> 2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
> /var/lib/ceph/osd/ceph-7/block) close
> 2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
> /var/lib/ceph/osd/ceph-7/block) close
> 2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed:
> (22) Invalid argument
>
>
> On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani <
> behnam.loghm...@gmail.com> wrote:
>
>> but disks pass all the tests with smartctl, badblocks and there isn't any
>> error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
>> to test it on other cluster nodes
>>
>> On Wed, Feb 21, 2018 at 4:58 PM,  wrote:
>>
>>> Could the problem be related with some faulty hardware (RAID-controller,
>>> port, cable) but not disk? Does "faulty" disk works OK on other server?
>>>
>>> Behnam Loghmani wrote on 21/02/18 16:09:
>>>
 Hi there,

 I changed the SSD on the problematic node with the new one and
 reconfigure OSDs and MON service on it.
 but the problem occurred again with:

 "rocksdb: submit_transaction error: Corruption: block checksum mismatch
 code = 2"

 I get fully confused now.



 On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
 behnam.loghm...@gmail.com > wrote:

 Hi Caspar,

 I checked the filesystem and there isn't any error on filesystem.
 The disk is SSD and it doesn't any attribute related to Wear level
 in smartctl and filesystem is
 mounted with default options and 

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
Hi there,

I changed SATA port and cable of SSD disk and also update ceph to version
12.2.3 and rebuild OSDs
but when recovery starts OSDs failed with this error:


2018-02-21 21:12:18.037974 7f3479fe2d00 -1 bluestore(/var/lib/ceph/osd/ceph-7)
_verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x84c097b0,
expected 0xaf1040a2, device location [0x1~1000], logical extent
0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
2018-02-21 21:12:18.038002 7f3479fe2d00 -1 osd.7 0 OSD::init() : unable to
read osd superblock
2018-02-21 21:12:18.038009 7f3479fe2d00  1 bluestore(/var/lib/ceph/osd/ceph-7)
umount
2018-02-21 21:12:18.038282 7f3479fe2d00  1 stupidalloc 0x0x55e99236c620
shutdown
2018-02-21 21:12:18.038308 7f3479fe2d00  1 freelist shutdown
2018-02-21 21:12:18.038336 7f3479fe2d00  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
ceph-12.2.3/src/rocksdb/db/db_impl.cc:217] Shutdown: ca
nceling all background work
2018-02-21 21:12:18.041561 7f3465561700  4 rocksdb: (Original Log Time
2018/02/21-21:12:18.041514) [/home/jenkins-build/build/
workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/
AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/
release/12.2.3/rpm/el7/BUILD/ceph-12.
2.3/src/rocksdb/db/compaction_job.cc:621] [default] compacted to: base
level 1 max bytes base 268435456 files[5 0 0 0 0 0 0] max score 0.00,
MB/sec: 2495.2 rd, 10.1 wr, level 1, files in(5, 0) out(1) MB in(213.6,
0.0) out(0.9), read-write-amplify(1.0) write-amplify(0.0) S
hutdown in progress: Database shutdown or Column
2018-02-21 21:12:18.041569 7f3465561700  4 rocksdb: (Original Log Time
2018/02/21-21:12:18.041545) EVENT_LOG_v1 {"time_micros": 1519234938041530,
"job": 3, "event": "compaction_finished", "compaction_time_micros": 89747,
"output_level": 1, "num_output_files": 1, "total_ou
tput_size": 902552, "num_input_records": 4470, "num_output_records": 4377,
"num_subcompactions": 1, "num_single_delete_mismatches": 0,
"num_single_delete_fallthrough": 44, "lsm_state": [5, 0, 0, 0, 0, 0, 0]}
2018-02-21 21:12:18.041663 7f3479fe2d00  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1519234938041657, "job": 4, "event": "table_file_deletion",
"file_number": 249}
2018-02-21 21:12:18.042144 7f3479fe2d00  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
centos7/MACHINE_SIZE/huge/release/12.2.3/rpm/el7/BUILD/
ceph-12.2.3/src/rocksdb/db/db_impl.cc:343] Shutdown com
plete
2018-02-21 21:12:18.043474 7f3479fe2d00  1 bluefs umount
2018-02-21 21:12:18.043775 7f3479fe2d00  1 stupidalloc 0x0x55e991f05d40
shutdown
2018-02-21 21:12:18.043784 7f3479fe2d00  1 stupidalloc 0x0x55e991f05db0
shutdown
2018-02-21 21:12:18.043786 7f3479fe2d00  1 stupidalloc 0x0x55e991f05e20
shutdown
2018-02-21 21:12:18.043826 7f3479fe2d00  1 bdev(0x55e992254600
/dev/vg0/wal-b) close
2018-02-21 21:12:18.301531 7f3479fe2d00  1 bdev(0x55e992255800
/dev/vg0/db-b) close
2018-02-21 21:12:18.545488 7f3479fe2d00  1 bdev(0x55e992254400
/var/lib/ceph/osd/ceph-7/block) close
2018-02-21 21:12:18.650473 7f3479fe2d00  1 bdev(0x55e992254000
/var/lib/ceph/osd/ceph-7/block) close
2018-02-21 21:12:18.93 7f3479fe2d00 -1  ** ERROR: osd init failed: (22)
Invalid argument


On Wed, Feb 21, 2018 at 5:06 PM, Behnam Loghmani 
wrote:

> but disks pass all the tests with smartctl, badblocks and there isn't any
> error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
> to test it on other cluster nodes
>
> On Wed, Feb 21, 2018 at 4:58 PM,  wrote:
>
>> Could the problem be related with some faulty hardware (RAID-controller,
>> port, cable) but not disk? Does "faulty" disk works OK on other server?
>>
>> Behnam Loghmani wrote on 21/02/18 16:09:
>>
>>> Hi there,
>>>
>>> I changed the SSD on the problematic node with the new one and
>>> reconfigure OSDs and MON service on it.
>>> but the problem occurred again with:
>>>
>>> "rocksdb: submit_transaction error: Corruption: block checksum mismatch
>>> code = 2"
>>>
>>> I get fully confused now.
>>>
>>>
>>>
>>> On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
>>> behnam.loghm...@gmail.com > wrote:
>>>
>>> Hi Caspar,
>>>
>>> I checked the filesystem and there isn't any error on filesystem.
>>> The disk is SSD and it doesn't any attribute related to Wear level
>>> in smartctl and filesystem is
>>> mounted with default options and no discard.
>>>
>>> my ceph structure on this node is like this:
>>>
>>> it has osd,mon,rgw services
>>> 1 SSD for OS and WAL/DB
>>> 2 HDD
>>>
>>> OSDs are created by ceph-volume lvm.
>>>
>>> the whole SSD is on 1 vg.
>>> OS is on root lv
>>> OSD.1 DB is on db-a
>>> OSD.1 WAL is on wal-a
>>> OSD.2 DB is on db-b
>>> OSD.2 WAL is on wal-b
>>>
>>> output of lvs:
>>>
>>> 

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
but disks pass all the tests with smartctl, badblocks and there isn't any
error on disks. because the ssd has contain WAL/DB of OSDs it's difficult
to test it on other cluster nodes

On Wed, Feb 21, 2018 at 4:58 PM,  wrote:

> Could the problem be related with some faulty hardware (RAID-controller,
> port, cable) but not disk? Does "faulty" disk works OK on other server?
>
> Behnam Loghmani wrote on 21/02/18 16:09:
>
>> Hi there,
>>
>> I changed the SSD on the problematic node with the new one and
>> reconfigure OSDs and MON service on it.
>> but the problem occurred again with:
>>
>> "rocksdb: submit_transaction error: Corruption: block checksum mismatch
>> code = 2"
>>
>> I get fully confused now.
>>
>>
>>
>> On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani <
>> behnam.loghm...@gmail.com > wrote:
>>
>> Hi Caspar,
>>
>> I checked the filesystem and there isn't any error on filesystem.
>> The disk is SSD and it doesn't any attribute related to Wear level in
>> smartctl and filesystem is
>> mounted with default options and no discard.
>>
>> my ceph structure on this node is like this:
>>
>> it has osd,mon,rgw services
>> 1 SSD for OS and WAL/DB
>> 2 HDD
>>
>> OSDs are created by ceph-volume lvm.
>>
>> the whole SSD is on 1 vg.
>> OS is on root lv
>> OSD.1 DB is on db-a
>> OSD.1 WAL is on wal-a
>> OSD.2 DB is on db-b
>> OSD.2 WAL is on wal-b
>>
>> output of lvs:
>>
>>data-a data-a -wi-a-
>>data-b data-b -wi-a-
>>db-a   vg0-wi-a-
>>db-b   vg0-wi-a-
>>root   vg0-wi-ao
>>wal-a  vg0-wi-a-
>>wal-b  vg0-wi-a-
>>
>> after making a heavy write on the radosgw, OSD.1 and OSD.2 has
>> stopped with "block checksum
>> mismatch" error.
>> Now on this node MON and OSDs services has stopped working with this
>> error
>>
>> I think my issue is related to this bug:
>> http://tracker.ceph.com/issues/22102
>> 
>>
>> I ran
>> #ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
>> but it returns the same error:
>>
>> *** Caught signal (Aborted) **
>>   in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block
>> checksum mismatch
>>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
>> luminous (stable)
>>   1: (()+0x3eb0b1) [0x55f779e6e0b1]
>>   2: (()+0xf5e0) [0x7fbf61ae15e0]
>>   3: (gsignal()+0x37) [0x7fbf604d31f7]
>>   4: (abort()+0x148) [0x7fbf604d48e8]
>>   5: (RocksDBStore::get(std::string const&, char const*, unsigned
>> long,
>> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>>   6: (BlueStore::Collection::get_onode(ghobject_t const&,
>> bool)+0x545) [0x55f779cd8f75]
>>   7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>>   8: (main()+0xde0) [0x55f779baab90]
>>   9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>>   10: (()+0x1bc59f) [0x55f779c3f59f]
>> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
>> (Aborted) **
>>   in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>>
>>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
>> luminous (stable)
>>   1: (()+0x3eb0b1) [0x55f779e6e0b1]
>>   2: (()+0xf5e0) [0x7fbf61ae15e0]
>>   3: (gsignal()+0x37) [0x7fbf604d31f7]
>>   4: (abort()+0x148) [0x7fbf604d48e8]
>>   5: (RocksDBStore::get(std::string const&, char const*, unsigned
>> long,
>> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>>   6: (BlueStore::Collection::get_onode(ghobject_t const&,
>> bool)+0x545) [0x55f779cd8f75]
>>   7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>>   8: (main()+0xde0) [0x55f779baab90]
>>   9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>>   10: (()+0x1bc59f) [0x55f779c3f59f]
>>   NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>>  -1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort:
>> Corruption: block checksum mismatch
>>   0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
>> (Aborted) **
>>   in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>>
>>   ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
>> luminous (stable)
>>   1: (()+0x3eb0b1) [0x55f779e6e0b1]
>>   2: (()+0xf5e0) [0x7fbf61ae15e0]
>>   3: (gsignal()+0x37) [0x7fbf604d31f7]
>>   4: (abort()+0x148) [0x7fbf604d48e8]
>>   5: (RocksDBStore::get(std::string const&, char const*, unsigned
>> long,
>> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>>   6: (BlueStore::Collection::get_onode(ghobject_t const&,
>> bool)+0x545) [0x55f779cd8f75]
>>   7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>>   8: (main()+0xde0) [0x55f779baab90]
>>   9: (__libc_start_main()+0xf5) 

Re: [ceph-users] mon service failed to start

2018-02-21 Thread knawnd
Could the problem be related with some faulty hardware (RAID-controller, port, cable) but not disk? 
Does "faulty" disk works OK on other server?


Behnam Loghmani wrote on 21/02/18 16:09:

Hi there,

I changed the SSD on the problematic node with the new one and reconfigure OSDs 
and MON service on it.
but the problem occurred again with:

"rocksdb: submit_transaction error: Corruption: block checksum mismatch code = 
2"

I get fully confused now.



On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani > wrote:


Hi Caspar,

I checked the filesystem and there isn't any error on filesystem.
The disk is SSD and it doesn't any attribute related to Wear level in 
smartctl and filesystem is
mounted with default options and no discard.

my ceph structure on this node is like this:

it has osd,mon,rgw services
1 SSD for OS and WAL/DB
2 HDD

OSDs are created by ceph-volume lvm.

the whole SSD is on 1 vg.
OS is on root lv
OSD.1 DB is on db-a
OSD.1 WAL is on wal-a
OSD.2 DB is on db-b
OSD.2 WAL is on wal-b

output of lvs:

   data-a data-a -wi-a-
   data-b data-b -wi-a-
   db-a   vg0    -wi-a-
   db-b   vg0    -wi-a-
   root   vg0    -wi-ao
   wal-a  vg0    -wi-a-
   wal-b  vg0    -wi-a-

after making a heavy write on the radosgw, OSD.1 and OSD.2 has stopped with 
"block checksum
mismatch" error.
Now on this node MON and OSDs services has stopped working with this error

I think my issue is related to this bug: 
http://tracker.ceph.com/issues/22102


I ran
#ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
but it returns the same error:

*** Caught signal (Aborted) **
  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block 
checksum mismatch
  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
  1: (()+0x3eb0b1) [0x55f779e6e0b1]
  2: (()+0xf5e0) [0x7fbf61ae15e0]
  3: (gsignal()+0x37) [0x7fbf604d31f7]
  4: (abort()+0x148) [0x7fbf604d48e8]
  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) 
[0x55f779cd8f75]
  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
  8: (main()+0xde0) [0x55f779baab90]
  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
  10: (()+0x1bc59f) [0x55f779c3f59f]
2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
  in thread 7fbf6c923d00 thread_name:ceph-bluestore-

  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
  1: (()+0x3eb0b1) [0x55f779e6e0b1]
  2: (()+0xf5e0) [0x7fbf61ae15e0]
  3: (gsignal()+0x37) [0x7fbf604d31f7]
  4: (abort()+0x148) [0x7fbf604d48e8]
  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) 
[0x55f779cd8f75]
  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
  8: (main()+0xde0) [0x55f779baab90]
  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
  10: (()+0x1bc59f) [0x55f779c3f59f]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this.

     -1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: 
block checksum mismatch
  0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal 
(Aborted) **
  in thread 7fbf6c923d00 thread_name:ceph-bluestore-

  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)
  1: (()+0x3eb0b1) [0x55f779e6e0b1]
  2: (()+0xf5e0) [0x7fbf61ae15e0]
  3: (gsignal()+0x37) [0x7fbf604d31f7]
  4: (abort()+0x148) [0x7fbf604d48e8]
  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545) 
[0x55f779cd8f75]
  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
  8: (main()+0xde0) [0x55f779baab90]
  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
  10: (()+0x1bc59f) [0x55f779c3f59f]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
to interpret this.



Could you please help me to recover this node or find a way to prove SSD 
disk problem.

Best regards,
Behnam Loghmani




On Mon, Feb 19, 2018 at 1:35 PM, Caspar Smit > wrote:

Hi Behnam,

I would firstly recommend running a filesystem check on the monitor 
disk first to see if
there are any inconsistencies.

Is the disk 

Re: [ceph-users] mon service failed to start

2018-02-21 Thread Behnam Loghmani
Hi there,

I changed the SSD on the problematic node with the new one and reconfigure
OSDs and MON service on it.
but the problem occurred again with:

"rocksdb: submit_transaction error: Corruption: block checksum mismatch
code = 2"

I get fully confused now.



On Tue, Feb 20, 2018 at 5:16 PM, Behnam Loghmani 
wrote:

> Hi Caspar,
>
> I checked the filesystem and there isn't any error on filesystem.
> The disk is SSD and it doesn't any attribute related to Wear level in
> smartctl and filesystem is mounted with default options and no discard.
>
> my ceph structure on this node is like this:
>
> it has osd,mon,rgw services
> 1 SSD for OS and WAL/DB
> 2 HDD
>
> OSDs are created by ceph-volume lvm.
>
> the whole SSD is on 1 vg.
> OS is on root lv
> OSD.1 DB is on db-a
> OSD.1 WAL is on wal-a
> OSD.2 DB is on db-b
> OSD.2 WAL is on wal-b
>
> output of lvs:
>
>   data-a data-a -wi-a-
>
>   data-b data-b -wi-a-
>   db-a   vg0-wi-a-
>
>   db-b   vg0-wi-a-
>
>   root   vg0-wi-ao
>
>   wal-a  vg0-wi-a-
>
>   wal-b  vg0-wi-a-
>
> after making a heavy write on the radosgw, OSD.1 and OSD.2 has stopped
> with "block checksum mismatch" error.
> Now on this node MON and OSDs services has stopped working with this error
>
> I think my issue is related to this bug: http://tracker.ceph.com/
> issues/22102
>
> I ran
> #ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
> but it returns the same error:
>
> *** Caught signal (Aborted) **
>  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block
> checksum mismatch
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x3eb0b1) [0x55f779e6e0b1]
>  2: (()+0xf5e0) [0x7fbf61ae15e0]
>  3: (gsignal()+0x37) [0x7fbf604d31f7]
>  4: (abort()+0x148) [0x7fbf604d48e8]
>  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
> [0x55f779cd8f75]
>  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>  8: (main()+0xde0) [0x55f779baab90]
>  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>  10: (()+0x1bc59f) [0x55f779c3f59f]
> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
>  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x3eb0b1) [0x55f779e6e0b1]
>  2: (()+0xf5e0) [0x7fbf61ae15e0]
>  3: (gsignal()+0x37) [0x7fbf604d31f7]
>  4: (abort()+0x148) [0x7fbf604d48e8]
>  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
> [0x55f779cd8f75]
>  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>  8: (main()+0xde0) [0x55f779baab90]
>  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>  10: (()+0x1bc59f) [0x55f779c3f59f]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> -1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption:
> block checksum mismatch
>  0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
> (Aborted) **
>  in thread 7fbf6c923d00 thread_name:ceph-bluestore-
>
>  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>  1: (()+0x3eb0b1) [0x55f779e6e0b1]
>  2: (()+0xf5e0) [0x7fbf61ae15e0]
>  3: (gsignal()+0x37) [0x7fbf604d31f7]
>  4: (abort()+0x148) [0x7fbf604d48e8]
>  5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
> ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
>  6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
> [0x55f779cd8f75]
>  7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
>  8: (main()+0xde0) [0x55f779baab90]
>  9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
>  10: (()+0x1bc59f) [0x55f779c3f59f]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
>
>
> Could you please help me to recover this node or find a way to prove SSD
> disk problem.
>
> Best regards,
> Behnam Loghmani
>
>
>
>
> On Mon, Feb 19, 2018 at 1:35 PM, Caspar Smit 
> wrote:
>
>> Hi Behnam,
>>
>> I would firstly recommend running a filesystem check on the monitor disk
>> first to see if there are any inconsistencies.
>>
>> Is the disk where the monitor is running on a spinning disk or SSD?
>>
>> If SSD you should check the Wear level stats through smartctl.
>> Maybe trim (discard) enabled on the filesystem mount? (discard could
>> cause problems/corruption in combination with certain SSD firmwares)
>>
>> Caspar
>>
>> 2018-02-16 23:03 GMT+01:00 Behnam Loghmani :
>>
>>> I checked the disk that monitor is on it with smartctl and it didn't
>>> return any error and it doesn't have any 

Re: [ceph-users] mon service failed to start

2018-02-20 Thread Behnam Loghmani
Hi Caspar,

I checked the filesystem and there isn't any error on filesystem.
The disk is SSD and it doesn't any attribute related to Wear level in
smartctl and filesystem is mounted with default options and no discard.

my ceph structure on this node is like this:

it has osd,mon,rgw services
1 SSD for OS and WAL/DB
2 HDD

OSDs are created by ceph-volume lvm.

the whole SSD is on 1 vg.
OS is on root lv
OSD.1 DB is on db-a
OSD.1 WAL is on wal-a
OSD.2 DB is on db-b
OSD.2 WAL is on wal-b

output of lvs:

  data-a data-a -wi-a-
  data-b data-b -wi-a-
  db-a   vg0
-wi-a-
  db-b   vg0
-wi-a-
  root   vg0
-wi-ao
  wal-a  vg0
-wi-a-
  wal-b  vg0-wi-a-

after making a heavy write on the radosgw, OSD.1 and OSD.2 has stopped with
"block checksum mismatch" error.
Now on this node MON and OSDs services has stopped working with this error

I think my issue is related to this bug:
http://tracker.ceph.com/issues/22102

I ran
#ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-1 --deep 1
but it returns the same error:

*** Caught signal (Aborted) **
 in thread 7fbf6c923d00 thread_name:ceph-bluestore-
2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block
checksum mismatch
 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
 1: (()+0x3eb0b1) [0x55f779e6e0b1]
 2: (()+0xf5e0) [0x7fbf61ae15e0]
 3: (gsignal()+0x37) [0x7fbf604d31f7]
 4: (abort()+0x148) [0x7fbf604d48e8]
 5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
[0x55f779cd8f75]
 7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
 8: (main()+0xde0) [0x55f779baab90]
 9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
 10: (()+0x1bc59f) [0x55f779c3f59f]
2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal (Aborted) **
 in thread 7fbf6c923d00 thread_name:ceph-bluestore-

 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
 1: (()+0x3eb0b1) [0x55f779e6e0b1]
 2: (()+0xf5e0) [0x7fbf61ae15e0]
 3: (gsignal()+0x37) [0x7fbf604d31f7]
 4: (abort()+0x148) [0x7fbf604d48e8]
 5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
[0x55f779cd8f75]
 7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
 8: (main()+0xde0) [0x55f779baab90]
 9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
 10: (()+0x1bc59f) [0x55f779c3f59f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

-1> 2018-02-20 16:44:30.128787 7fbf6c923d00 -1 abort: Corruption: block
checksum mismatch
 0> 2018-02-20 16:44:30.131334 7fbf6c923d00 -1 *** Caught signal
(Aborted) **
 in thread 7fbf6c923d00 thread_name:ceph-bluestore-

 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
 1: (()+0x3eb0b1) [0x55f779e6e0b1]
 2: (()+0xf5e0) [0x7fbf61ae15e0]
 3: (gsignal()+0x37) [0x7fbf604d31f7]
 4: (abort()+0x148) [0x7fbf604d48e8]
 5: (RocksDBStore::get(std::string const&, char const*, unsigned long,
ceph::buffer::list*)+0x1ce) [0x55f779d2b5ce]
 6: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x545)
[0x55f779cd8f75]
 7: (BlueStore::_fsck(bool, bool)+0x1bb5) [0x55f779cf1a75]
 8: (main()+0xde0) [0x55f779baab90]
 9: (__libc_start_main()+0xf5) [0x7fbf604bfc05]
 10: (()+0x1bc59f) [0x55f779c3f59f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.



Could you please help me to recover this node or find a way to prove SSD
disk problem.

Best regards,
Behnam Loghmani




On Mon, Feb 19, 2018 at 1:35 PM, Caspar Smit  wrote:

> Hi Behnam,
>
> I would firstly recommend running a filesystem check on the monitor disk
> first to see if there are any inconsistencies.
>
> Is the disk where the monitor is running on a spinning disk or SSD?
>
> If SSD you should check the Wear level stats through smartctl.
> Maybe trim (discard) enabled on the filesystem mount? (discard could cause
> problems/corruption in combination with certain SSD firmwares)
>
> Caspar
>
> 2018-02-16 23:03 GMT+01:00 Behnam Loghmani :
>
>> I checked the disk that monitor is on it with smartctl and it didn't
>> return any error and it doesn't have any Current_Pending_Sector.
>> Do you recommend any disk checks to make sure that this disk has problem
>> and then I can send the report to the provider for replacing the disk
>>
>> On Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum 
>> wrote:
>>
>>> The disk that the monitor is on...there isn't anything for you to
>>> configure about a monitor WAL though so I'm not sure how that enters into
>>> it?
>>>
>>> On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <
>>> behnam.loghm...@gmail.com> wrote:
>>>
 Thanks for your reply

 Do you mean, that's the problem with the disk I use 

Re: [ceph-users] mon service failed to start

2018-02-19 Thread Caspar Smit
Hi Behnam,

I would firstly recommend running a filesystem check on the monitor disk
first to see if there are any inconsistencies.

Is the disk where the monitor is running on a spinning disk or SSD?

If SSD you should check the Wear level stats through smartctl.
Maybe trim (discard) enabled on the filesystem mount? (discard could cause
problems/corruption in combination with certain SSD firmwares)

Caspar

2018-02-16 23:03 GMT+01:00 Behnam Loghmani :

> I checked the disk that monitor is on it with smartctl and it didn't
> return any error and it doesn't have any Current_Pending_Sector.
> Do you recommend any disk checks to make sure that this disk has problem
> and then I can send the report to the provider for replacing the disk
>
> On Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum 
> wrote:
>
>> The disk that the monitor is on...there isn't anything for you to
>> configure about a monitor WAL though so I'm not sure how that enters into
>> it?
>>
>> On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <
>> behnam.loghm...@gmail.com> wrote:
>>
>>> Thanks for your reply
>>>
>>> Do you mean, that's the problem with the disk I use for WAL and DB?
>>>
>>> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum 
>>> wrote:
>>>

 On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <
 behnam.loghm...@gmail.com> wrote:

> Hi there,
>
> I have a Ceph cluster version 12.2.2 on CentOS 7.
>
> It is a testing cluster and I have set it up 2 weeks ago.
> after some days, I see that one of the three mons has stopped(out of
> quorum) and I can't start it anymore.
> I checked the mon service log and the output shows this error:
>
> """
> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
> store state
> rocksdb: submit_transaction_sync error: Corruption: block checksum
> mismatch
>

 This bit is the important one. Your disk is bad and it’s feeding back
 corrupted data.




> code = 2 Rocksdb transaction:
>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
> MonitorDBStore::clear(std::set&)' thread
> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
> MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
> 581: FAILE
> D assert(r >= 0)
> """
>
> the only solution I found is to remove this mon from quorum and remove
> all mon data and re-add this mon to quorum again.
> and ceph goes to the healthy status again.
>
> but now after some days this mon has stopped and I face the same
> problem again.
>
> My cluster setup is:
> 4 osd hosts
> total 8 osds
> 3 mons
> 1 rgw
>
> this cluster has setup with ceph-volume lvm and wal/db separation on
> logical volumes.
>
> Best regards,
> Behnam Loghmani
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

>>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
I checked the disk that monitor is on it with smartctl and it didn't return
any error and it doesn't have any Current_Pending_Sector.
Do you recommend any disk checks to make sure that this disk has problem
and then I can send the report to the provider for replacing the disk

On Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum  wrote:

> The disk that the monitor is on...there isn't anything for you to
> configure about a monitor WAL though so I'm not sure how that enters into
> it?
>
> On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <
> behnam.loghm...@gmail.com> wrote:
>
>> Thanks for your reply
>>
>> Do you mean, that's the problem with the disk I use for WAL and DB?
>>
>> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum 
>> wrote:
>>
>>>
>>> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <
>>> behnam.loghm...@gmail.com> wrote:
>>>
 Hi there,

 I have a Ceph cluster version 12.2.2 on CentOS 7.

 It is a testing cluster and I have set it up 2 weeks ago.
 after some days, I see that one of the three mons has stopped(out of
 quorum) and I can't start it anymore.
 I checked the mon service log and the output shows this error:

 """
 mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
 store state
 rocksdb: submit_transaction_sync error: Corruption: block checksum
 mismatch

>>>
>>> This bit is the important one. Your disk is bad and it’s feeding back
>>> corrupted data.
>>>
>>>
>>>
>>>
 code = 2 Rocksdb transaction:
  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
 LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
 MonitorDBStore::clear(std::set&)' thread
 7f45a1e52e40 time 2018-02-16 17:37:07.040846
 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
 centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/
 ceph-12.2.2/src/mon/MonitorDBStore.h: 581: FAILE
 D assert(r >= 0)
 """

 the only solution I found is to remove this mon from quorum and remove
 all mon data and re-add this mon to quorum again.
 and ceph goes to the healthy status again.

 but now after some days this mon has stopped and I face the same
 problem again.

 My cluster setup is:
 4 osd hosts
 total 8 osds
 3 mons
 1 rgw

 this cluster has setup with ceph-volume lvm and wal/db separation on
 logical volumes.

 Best regards,
 Behnam Loghmani


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-16 Thread Gregory Farnum
The disk that the monitor is on...there isn't anything for you to configure
about a monitor WAL though so I'm not sure how that enters into it?

On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani 
wrote:

> Thanks for your reply
>
> Do you mean, that's the problem with the disk I use for WAL and DB?
>
> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum 
> wrote:
>
>>
>> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <
>> behnam.loghm...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I have a Ceph cluster version 12.2.2 on CentOS 7.
>>>
>>> It is a testing cluster and I have set it up 2 weeks ago.
>>> after some days, I see that one of the three mons has stopped(out of
>>> quorum) and I can't start it anymore.
>>> I checked the mon service log and the output shows this error:
>>>
>>> """
>>> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
>>> store state
>>> rocksdb: submit_transaction_sync error: Corruption: block checksum
>>> mismatch
>>>
>>
>> This bit is the important one. Your disk is bad and it’s feeding back
>> corrupted data.
>>
>>
>>
>>
>>> code = 2 Rocksdb transaction:
>>>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
>>> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
>>> MonitorDBStore::clear(std::set&)' thread
>>> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
>>> 581: FAILE
>>> D assert(r >= 0)
>>> """
>>>
>>> the only solution I found is to remove this mon from quorum and remove
>>> all mon data and re-add this mon to quorum again.
>>> and ceph goes to the healthy status again.
>>>
>>> but now after some days this mon has stopped and I face the same problem
>>> again.
>>>
>>> My cluster setup is:
>>> 4 osd hosts
>>> total 8 osds
>>> 3 mons
>>> 1 rgw
>>>
>>> this cluster has setup with ceph-volume lvm and wal/db separation on
>>> logical volumes.
>>>
>>> Best regards,
>>> Behnam Loghmani
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
Thanks for your reply

Do you mean, that's the problem with the disk I use for WAL and DB?

On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum  wrote:

>
> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani 
> wrote:
>
>> Hi there,
>>
>> I have a Ceph cluster version 12.2.2 on CentOS 7.
>>
>> It is a testing cluster and I have set it up 2 weeks ago.
>> after some days, I see that one of the three mons has stopped(out of
>> quorum) and I can't start it anymore.
>> I checked the mon service log and the output shows this error:
>>
>> """
>> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent
>> store state
>> rocksdb: submit_transaction_sync error: Corruption: block checksum
>> mismatch
>>
>
> This bit is the important one. Your disk is bad and it’s feeding back
> corrupted data.
>
>
>
>
>> code = 2 Rocksdb transaction:
>>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
>> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
>> centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
>> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
>> MonitorDBStore::clear(std::set&)' thread
>> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
>> 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
>> centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/
>> ceph-12.2.2/src/mon/MonitorDBStore.h: 581: FAILE
>> D assert(r >= 0)
>> """
>>
>> the only solution I found is to remove this mon from quorum and remove
>> all mon data and re-add this mon to quorum again.
>> and ceph goes to the healthy status again.
>>
>> but now after some days this mon has stopped and I face the same problem
>> again.
>>
>> My cluster setup is:
>> 4 osd hosts
>> total 8 osds
>> 3 mons
>> 1 rgw
>>
>> this cluster has setup with ceph-volume lvm and wal/db separation on
>> logical volumes.
>>
>> Best regards,
>> Behnam Loghmani
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon service failed to start

2018-02-16 Thread Gregory Farnum
On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani 
wrote:

> Hi there,
>
> I have a Ceph cluster version 12.2.2 on CentOS 7.
>
> It is a testing cluster and I have set it up 2 weeks ago.
> after some days, I see that one of the three mons has stopped(out of
> quorum) and I can't start it anymore.
> I checked the mon service log and the output shows this error:
>
> """
> mon.XX@-1(probing) e4 preinit clean up potentially inconsistent store
> state
> rocksdb: submit_transaction_sync error: Corruption: block checksum
> mismatch
>

This bit is the important one. Your disk is bad and it’s feeding back
corrupted data.




> code = 2 Rocksdb transaction:
>  0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
> MonitorDBStore::clear(std::set&)' thread
> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
> 581: FAILE
> D assert(r >= 0)
> """
>
> the only solution I found is to remove this mon from quorum and remove all
> mon data and re-add this mon to quorum again.
> and ceph goes to the healthy status again.
>
> but now after some days this mon has stopped and I face the same problem
> again.
>
> My cluster setup is:
> 4 osd hosts
> total 8 osds
> 3 mons
> 1 rgw
>
> this cluster has setup with ceph-volume lvm and wal/db separation on
> logical volumes.
>
> Best regards,
> Behnam Loghmani
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mon service failed to start

2018-02-16 Thread Behnam Loghmani
Hi there,

I have a Ceph cluster version 12.2.2 on CentOS 7.

It is a testing cluster and I have set it up 2 weeks ago.
after some days, I see that one of the three mons has stopped(out of
quorum) and I can't start it anymore.
I checked the mon service log and the output shows this error:

"""
mon.XX@-1(probing) e4 preinit clean up potentially inconsistent store
state
rocksdb: submit_transaction_sync error: Corruption: block checksum mismatch
code = 2 Rocksdb transaction:
 0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
MonitorDBStore::clear(std::set&)' thread
7f45a1e52e40 time 2018-02-16 17:37:07.040846
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
581: FAILE
D assert(r >= 0)
"""

the only solution I found is to remove this mon from quorum and remove all
mon data and re-add this mon to quorum again.
and ceph goes to the healthy status again.

but now after some days this mon has stopped and I face the same problem
again.

My cluster setup is:
4 osd hosts
total 8 osds
3 mons
1 rgw

this cluster has setup with ceph-volume lvm and wal/db separation on
logical volumes.

Best regards,
Behnam Loghmani
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com