Re: [ceph-users] mds daemon damaged

2018-07-13 Thread Dan van der Ster
Hi Kevin,

Are your OSDs bluestore or filestore?

-- dan

On Thu, Jul 12, 2018 at 11:30 PM Kevin  wrote:
>
> Sorry for the long posting but trying to cover everything
>
> I woke up to find my cephfs filesystem down. This was in the logs
>
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc
> 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head
>
> I had one standby MDS, but as far as I can tell it did not fail over.
> This was in the logs
>
> (insufficient standby MDS daemons available)
>
> Currently my ceph looks like this
>cluster:
>  id: ..
>  health: HEALTH_ERR
>  1 filesystem is degraded
>  1 mds daemon damaged
>
>services:
>  mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>  mgr: ids27(active)
>  mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>  osd: 5 osds: 5 up, 5 in
>
>data:
>  pools:   3 pools, 202 pgs
>  objects: 1013k objects, 4018 GB
>  usage:   12085 GB used, 6544 GB / 18630 GB avail
>  pgs: 201 active+clean
>   1   active+clean+scrubbing+deep
>
>io:
>  client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>
> I started trying to get the damaged MDS back online
>
> Based on this page
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is
> unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
>
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is
> unreadableErrors: 0
>
> cephfs-journal-tool journal reset - (I think this command might have
> worked)
>
> Next up, tried to reset the filesystem
>
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>
> Each time same errors
>
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared:
> MDS_DAMAGE (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27
> assigned to filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal
> 0x200: (5) Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds
> daemon damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc
> 0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1
> filesystem is degraded; 1 mds daemon damaged
>
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
>
> Command worked, but each time I run the reset command the same errors
> above appear
>
> Online searches say the object read error has to be removed. But there's
> no object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
>
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
> completes but still have the same issue above
>
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby
> and has data it should become live. If it was not
> I assume we will lose the filesystem at this point
>
> Why didn't the standby MDS failover?
>
> Just looking for any way to recover the cephfs, thanks!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-13 Thread Oliver Freyermuth
Hi Kevin,

Am 13.07.2018 um 04:21 schrieb Kevin:
> That thread looks exactly like what I'm experiencing. Not sure why my 
> repeated googles didn't find it!

maybe the thread was still too "fresh" for Google's indexing. 

> 
> I'm running 12.2.6 and CentOS 7
> 
> And yes, I recently upgraded from jewel to luminous following the 
> instructions of changing the repo and then updating. Everything has been 
> working fine up until this point
> 
> Given that previous thread I feel at a bit of a loss as to what to try now 
> since that thread ended with no resolution I could see.

I hope the thread is still continuing, given that another affected person just 
commented on it. 
We also planned to upgrade our production cluster to 12.2.6 (also on CentOS 7) 
in the weekend since we are affected by two Ceph-fuse bugs 
causing inconsistency of directory contents since months which have been fixed 
in 12.2.6, 
but given this situation, we'll rather live with that a bit longer and hold off 
on the update... 

> 
> Thanks for pointing that out though, it seems like almost the exact same 
> situation
> 
> On 2018-07-12 18:23, Oliver Freyermuth wrote:
>> Hi,
>>
>> all this sounds an awful lot like:
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html
>> In htat case, things started with an update to 12.2.6. Which version
>> are you running?
>>
>> Cheers,
>> Oliver
>>
>> Am 12.07.2018 um 23:30 schrieb Kevin:
>>> Sorry for the long posting but trying to cover everything
>>>
>>> I woke up to find my cephfs filesystem down. This was in the logs
>>>
>>> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a 
>>> != expected 0x1c08241c on 2:292cf221:::200.:head
>>>
>>> I had one standby MDS, but as far as I can tell it did not fail over. This 
>>> was in the logs
>>>
>>> (insufficient standby MDS daemons available)
>>>
>>> Currently my ceph looks like this
>>>   cluster:
>>>     id: ..
>>>     health: HEALTH_ERR
>>>     1 filesystem is degraded
>>>     1 mds daemon damaged
>>>
>>>   services:
>>>     mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>>>     mgr: ids27(active)
>>>     mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>>>     osd: 5 osds: 5 up, 5 in
>>>
>>>   data:
>>>     pools:   3 pools, 202 pgs
>>>     objects: 1013k objects, 4018 GB
>>>     usage:   12085 GB used, 6544 GB / 18630 GB avail
>>>     pgs: 201 active+clean
>>>  1   active+clean+scrubbing+deep
>>>
>>>   io:
>>>     client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>>>
>>> I started trying to get the damaged MDS back online
>>>
>>> Based on this page 
>>> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>>>
>>> # cephfs-journal-tool journal export backup.bin
>>> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
>>> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
>>> readable, attempt object-by-object dump with `rados`
>>> Error ((5) Input/output error)
>>>
>>> # cephfs-journal-tool event recover_dentries summary
>>> Events by type:
>>> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
>>> unreadableErrors: 0
>>>
>>> cephfs-journal-tool journal reset - (I think this command might have worked)
>>>
>>> Next up, tried to reset the filesystem
>>>
>>> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>>>
>>> Each time same errors
>>>
>>> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE 
>>> (was: 1 mds daemon damaged)
>>> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned 
>>> to filesystem test-cephfs-1 as rank 0
>>> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: 
>>> (5) Input/output error
>>> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon 
>>> damaged (MDS_DAMAGE)
>>> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a 
>>> != expected 0x1c08241c on 2:292cf221:::200.:head
>>> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem 
>>> is degraded; 1 mds daemon damaged
>>>
>>> Tried to 'fail' mds.ds27
>>> # ceph mds fail ds27
>>> # failed mds gid 1929168
>>>
>>> Command worked, but each time I run the reset command the same errors above 
>>> appear
>>>
>>> Online searches say the object read error has to be removed. But there's no 
>>> object listed. This web page is the closest to the issue
>>> http://tracker.ceph.com/issues/20863
>>>
>>> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
>>> completes but still have the same issue above
>>>
>>> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and 
>>> has data it should become live. If it was not
>>> I assume we will lose the filesystem at this point
>>>
>>> Why didn't the standby MDS failover?
>>>
>>> Just looking for any way to recover the cephfs, thanks!
>>>
>>> 

Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Oliver Freyermuth
Hi,

all this sounds an awful lot like:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-July/027992.html
In htat case, things started with an update to 12.2.6. Which version are you 
running? 

Cheers,
Oliver

Am 12.07.2018 um 23:30 schrieb Kevin:
> Sorry for the long posting but trying to cover everything
> 
> I woke up to find my cephfs filesystem down. This was in the logs
> 
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a != 
> expected 0x1c08241c on 2:292cf221:::200.:head
> 
> I had one standby MDS, but as far as I can tell it did not fail over. This 
> was in the logs
> 
> (insufficient standby MDS daemons available)
> 
> Currently my ceph looks like this
>   cluster:
>     id: ..
>     health: HEALTH_ERR
>     1 filesystem is degraded
>     1 mds daemon damaged
> 
>   services:
>     mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
>     mgr: ids27(active)
>     mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
>     osd: 5 osds: 5 up, 5 in
> 
>   data:
>     pools:   3 pools, 202 pgs
>     objects: 1013k objects, 4018 GB
>     usage:   12085 GB used, 6544 GB / 18630 GB avail
>     pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
> 
>   io:
>     client:   0 B/s rd, 0 op/s rd, 0 op/s wr
> 
> I started trying to get the damaged MDS back online
> 
> Based on this page 
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> 
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
> 
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
> unreadableErrors: 0
> 
> cephfs-journal-tool journal reset - (I think this command might have worked)
> 
> Next up, tried to reset the filesystem
> 
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
> 
> Each time same errors
> 
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE 
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned to 
> filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200: (5) 
> Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon 
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a != 
> expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is 
> degraded; 1 mds daemon damaged
> 
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
> 
> Command worked, but each time I run the reset command the same errors above 
> appear
> 
> Online searches say the object read error has to be removed. But there's no 
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
> 
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
> completes but still have the same issue above
> 
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and 
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
> 
> Why didn't the standby MDS failover?
> 
> Just looking for any way to recover the cephfs, thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 3:55 PM, Patrick Donnelly  wrote:
>> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
>> completes but still have the same issue above
>>
>> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
>> has data it should become live. If it was not
>> I assume we will lose the filesystem at this point
>>
>> Why didn't the standby MDS failover?
>>
>> Just looking for any way to recover the cephfs, thanks!
>
> I think it's time to do a scrub on the PG containing that object.

Sorry didn't read the part of the email that said you did that :) Did
you confirm that after the deep scrub finished that the pg is
active+clean? It looks like you're still scrubbing that PG.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds daemon damaged

2018-07-12 Thread Patrick Donnelly
On Thu, Jul 12, 2018 at 2:30 PM, Kevin  wrote:
> Sorry for the long posting but trying to cover everything
>
> I woke up to find my cephfs filesystem down. This was in the logs
>
> 2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head

Being that this came from the OSD, you should look to resolve that
problem. What you've done below is blow the journal away which hasn't
helped you any because (a) now your journal is probably lost without a
lot of manual intervention and (b) the "new" journal is still written
to the same bad backing device/file so it's probably still unusable as
you found out.

> I had one standby MDS, but as far as I can tell it did not fail over. This
> was in the logs

If a rank becomes damaged, standbys will not take over. You must mark
it repaired first.

> (insufficient standby MDS daemons available)
>
> Currently my ceph looks like this
>   cluster:
> id: ..
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 mds daemon damaged
>
>   services:
> mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
> mgr: ids27(active)
> mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
> osd: 5 osds: 5 up, 5 in
>
>   data:
> pools:   3 pools, 202 pgs
> objects: 1013k objects, 4018 GB
> usage:   12085 GB used, 6544 GB / 18630 GB avail
> pgs: 201 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   0 B/s rd, 0 op/s rd, 0 op/s wr
>
> I started trying to get the damaged MDS back online
>
> Based on this page
> http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> # cephfs-journal-tool journal export backup.bin
> 2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is unreadable
> 2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not
> readable, attempt object-by-object dump with `rados`
> Error ((5) Input/output error)
>
> # cephfs-journal-tool event recover_dentries summary
> Events by type:
> 2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is
> unreadableErrors: 0
>
> cephfs-journal-tool journal reset - (I think this command might have worked)
>
> Next up, tried to reset the filesystem
>
> ceph fs reset test-cephfs-1 --yes-i-really-mean-it
>
> Each time same errors
>
> 2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: MDS_DAMAGE
> (was: 1 mds daemon damaged)
> 2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 assigned
> to filesystem test-cephfs-1 as rank 0
> 2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 0x200:
> (5) Input/output error
> 2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds daemon
> damaged (MDS_DAMAGE)
> 2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 0x6fc2f65a
> != expected 0x1c08241c on 2:292cf221:::200.:head
> 2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 filesystem is
> degraded; 1 mds daemon damaged
>
> Tried to 'fail' mds.ds27
> # ceph mds fail ds27
> # failed mds gid 1929168
>
> Command worked, but each time I run the reset command the same errors above
> appear
>
> Online searches say the object read error has to be removed. But there's no
> object listed. This web page is the closest to the issue
> http://tracker.ceph.com/issues/20863
>
> Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it
> completes but still have the same issue above
>
> Final option is to attempt removing mds.ds27. If mds.ds29 was a standby and
> has data it should become live. If it was not
> I assume we will lose the filesystem at this point
>
> Why didn't the standby MDS failover?
>
> Just looking for any way to recover the cephfs, thanks!

I think it's time to do a scrub on the PG containing that object.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds daemon damaged

2018-07-12 Thread Kevin

Sorry for the long posting but trying to cover everything

I woke up to find my cephfs filesystem down. This was in the logs

2018-07-11 05:54:10.398171 osd.1 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head


I had one standby MDS, but as far as I can tell it did not fail over. 
This was in the logs


(insufficient standby MDS daemons available)

Currently my ceph looks like this
  cluster:
id: ..
health: HEALTH_ERR
1 filesystem is degraded
1 mds daemon damaged

  services:
mon: 6 daemons, quorum ds26,ds27,ds2b,ds2a,ds28,ds29
mgr: ids27(active)
mds: test-cephfs-1-0/1/1 up , 3 up:standby, 1 damaged
osd: 5 osds: 5 up, 5 in

  data:
pools:   3 pools, 202 pgs
objects: 1013k objects, 4018 GB
usage:   12085 GB used, 6544 GB / 18630 GB avail
pgs: 201 active+clean
 1   active+clean+scrubbing+deep

  io:
client:   0 B/s rd, 0 op/s rd, 0 op/s wr

I started trying to get the damaged MDS back online

Based on this page 
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#disaster-recovery-experts


# cephfs-journal-tool journal export backup.bin
2018-07-12 13:35:15.675964 7f3e1389bf00 -1 Header 200. is 
unreadable
2018-07-12 13:35:15.675977 7f3e1389bf00 -1 journal_export: Journal not 
readable, attempt object-by-object dump with `rados`

Error ((5) Input/output error)

# cephfs-journal-tool event recover_dentries summary
Events by type:
2018-07-12 13:36:03.000590 7fc398a18f00 -1 Header 200. is 
unreadableErrors: 0


cephfs-journal-tool journal reset - (I think this command might have 
worked)


Next up, tried to reset the filesystem

ceph fs reset test-cephfs-1 --yes-i-really-mean-it

Each time same errors

2018-07-12 11:56:35.760449 mon.ds26 [INF] Health check cleared: 
MDS_DAMAGE (was: 1 mds daemon damaged)
2018-07-12 11:56:35.856737 mon.ds26 [INF] Standby daemon mds.ds27 
assigned to filesystem test-cephfs-1 as rank 0
2018-07-12 11:56:35.947801 mds.ds27 [ERR] Error recovering journal 
0x200: (5) Input/output error
2018-07-12 11:56:36.900807 mon.ds26 [ERR] Health check failed: 1 mds 
daemon damaged (MDS_DAMAGE)
2018-07-12 11:56:35.945544 osd.0 [ERR] 2.4 full-object read crc 
0x6fc2f65a != expected 0x1c08241c on 2:292cf221:::200.:head
2018-07-12 12:00:00.000142 mon.ds26 [ERR] overall HEALTH_ERR 1 
filesystem is degraded; 1 mds daemon damaged


Tried to 'fail' mds.ds27
# ceph mds fail ds27
# failed mds gid 1929168

Command worked, but each time I run the reset command the same errors 
above appear


Online searches say the object read error has to be removed. But there's 
no object listed. This web page is the closest to the issue

http://tracker.ceph.com/issues/20863

Recommends fixing error by hand. Tried running deep scrub on pg 2.4, it 
completes but still have the same issue above


Final option is to attempt removing mds.ds27. If mds.ds29 was a standby 
and has data it should become live. If it was not

I assume we will lose the filesystem at this point

Why didn't the standby MDS failover?

Just looking for any way to recover the cephfs, thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com