Re: [ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent

2018-02-22 Thread Yoann Moulin
Le 22/02/2018 à 05:23, Brad Hubbard a écrit :
> On Wed, Feb 21, 2018 at 6:40 PM, Yoann Moulin  wrote:
>> Hello,
>>
>> I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible 
>> playbook), a few days after, ceph status told me "PG_DAMAGED
>> Possible data damage: 1 pg inconsistent", I tried to repair the PG without 
>> success, I tried to stop the OSD, flush the journal and restart the
>> OSDs but the OSD refuse to start due to a bad journal. I decided to destroy 
>> the OSD and recreated it from scratch. After that, everything seemed
>> to be all right, but, I just saw now I have exactly the same error again on 
>> the same PG on the same OSD (78).
>>
>>> $ ceph health detail
>>> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
>>> OSD_SCRUB_ERRORS 3 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>> pg 11.5f is active+clean+inconsistent, acting [78,154,170]
>>
>>> $ ceph -s
>>>   cluster:
>>> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>>> health: HEALTH_ERR
>>> 3 scrub errors
>>> Possible data damage: 1 pg inconsistent
>>>
>>>   services:
>>> mon: 3 daemons, quorum 
>>> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>>> mgr: iccluster001(active), standbys: iccluster009, iccluster017
>>> mds: cephfs-3/3/3 up  
>>> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active}
>>> osd: 180 osds: 180 up, 180 in
>>> rgw: 6 daemons active
>>>
>>>   data:
>>> pools:   29 pools, 10432 pgs
>>> objects: 82862k objects, 171 TB
>>> usage:   515 TB used, 465 TB / 980 TB avail
>>> pgs: 10425 active+clean
>>>  6 active+clean+scrubbing+deep
>>>  1 active+clean+inconsistent
>>>
>>>   io:
>>> client:   21538 B/s wr, 0 op/s rd, 33 op/s wr
>>
>>> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
>>> (stable)
>>
>> Short log :
>>
>>> 2018-02-21 09:08:33.408396 7fb7b8222700  0 log_channel(cluster) log [DBG] : 
>>> 11.5f repair starts
>>> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>>> 11.5f shard 78: soid 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- 
>>> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 
>>> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd  od d46bb5a1 
>>> alloc_hint [0 0 0])
>>> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>>> 11.5f shard 154: soid 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>>>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>>>  od d46bb5a1 alloc_hint [0 0 0])
>>> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>>> 11.5f shard 170: soid 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>>>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>>>  od d46bb5a1 alloc_hint [0 0 0])
>>> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>>> 11.5f soid 
>>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: 
>>> failed to pick suitable auth object
>>> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>>> 11.5f repair 3 errors, 0 fixed
>>
>> I set "debug_osd 20/20" on osd.78 and start the repair again, the log file 
>> is here :
>>
>> ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1
>>
>> What can I do in that situation ?
> 
> Take a look and see if http://tracker.ceph.com/issues/21388 is
> relevant as well as the debugging and advice therein.

Indeed, it looks like similar to my issue.

I sent a comment directly on tracker, Thanks.

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent

2018-02-21 Thread Brad Hubbard
On Wed, Feb 21, 2018 at 6:40 PM, Yoann Moulin  wrote:
> Hello,
>
> I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible 
> playbook), a few days after, ceph status told me "PG_DAMAGED
> Possible data damage: 1 pg inconsistent", I tried to repair the PG without 
> success, I tried to stop the OSD, flush the journal and restart the
> OSDs but the OSD refuse to start due to a bad journal. I decided to destroy 
> the OSD and recreated it from scratch. After that, everything seemed
> to be all right, but, I just saw now I have exactly the same error again on 
> the same PG on the same OSD (78).
>
>> $ ceph health detail
>> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
>> OSD_SCRUB_ERRORS 3 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 11.5f is active+clean+inconsistent, acting [78,154,170]
>
>> $ ceph -s
>>   cluster:
>> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>> health: HEALTH_ERR
>> 3 scrub errors
>> Possible data damage: 1 pg inconsistent
>>
>>   services:
>> mon: 3 daemons, quorum 
>> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>> mgr: iccluster001(active), standbys: iccluster009, iccluster017
>> mds: cephfs-3/3/3 up  
>> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active}
>> osd: 180 osds: 180 up, 180 in
>> rgw: 6 daemons active
>>
>>   data:
>> pools:   29 pools, 10432 pgs
>> objects: 82862k objects, 171 TB
>> usage:   515 TB used, 465 TB / 980 TB avail
>> pgs: 10425 active+clean
>>  6 active+clean+scrubbing+deep
>>  1 active+clean+inconsistent
>>
>>   io:
>> client:   21538 B/s wr, 0 op/s rd, 33 op/s wr
>
>> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
>> (stable)
>
> Short log :
>
>> 2018-02-21 09:08:33.408396 7fb7b8222700  0 log_channel(cluster) log [DBG] : 
>> 11.5f repair starts
>> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f shard 78: soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- 
>> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 
>> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd  od d46bb5a1 
>> alloc_hint [0 0 0])
>> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f shard 154: soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>>  od d46bb5a1 alloc_hint [0 0 0])
>> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f shard 170: soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>>  od d46bb5a1 alloc_hint [0 0 0])
>> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f soid 
>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: 
>> failed to pick suitable auth object
>> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
>> 11.5f repair 3 errors, 0 fixed
>
> I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is 
> here :
>
> ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1
>
> What can I do in that situation ?

Take a look and see if http://tracker.ceph.com/issues/21388 is
relevant as well as the debugging and advice therein.

>
> Thanks for your help.
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent

2018-02-21 Thread Yoann Moulin
Hello,

I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible 
playbook), a few days after, ceph status told me "PG_DAMAGED
Possible data damage: 1 pg inconsistent", I tried to repair the PG without 
success, I tried to stop the OSD, flush the journal and restart the
OSDs but the OSD refuse to start due to a bad journal. I decided to destroy the 
OSD and recreated it from scratch. After that, everything seemed
to be all right, but, I just saw now I have exactly the same error again on the 
same PG on the same OSD (78).

> $ ceph health detail
> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 3 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 11.5f is active+clean+inconsistent, acting [78,154,170]

> $ ceph -s
>   cluster:
> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
> health: HEALTH_ERR
> 3 scrub errors
> Possible data damage: 1 pg inconsistent
>  
>   services:
> mon: 3 daemons, quorum 
> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
> mgr: iccluster001(active), standbys: iccluster009, iccluster017
> mds: cephfs-3/3/3 up  
> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active}
> osd: 180 osds: 180 up, 180 in
> rgw: 6 daemons active
>  
>   data:
> pools:   29 pools, 10432 pgs
> objects: 82862k objects, 171 TB
> usage:   515 TB used, 465 TB / 980 TB avail
> pgs: 10425 active+clean
>  6 active+clean+scrubbing+deep
>  1 active+clean+inconsistent
>  
>   io:
> client:   21538 B/s wr, 0 op/s rd, 33 op/s wr

> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
> (stable)

Short log :

> 2018-02-21 09:08:33.408396 7fb7b8222700  0 log_channel(cluster) log [DBG] : 
> 11.5f repair starts
> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f shard 78: soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- 
> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 
> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd  od d46bb5a1 
> alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f shard 154: soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>  od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f shard 170: soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head 
> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544
>  osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd 
>  od d46bb5a1 alloc_hint [0 0 0])
> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f soid 
> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: 
> failed to pick suitable auth object
> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : 
> 11.5f repair 3 errors, 0 fixed

I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is 
here :

ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1

What can I do in that situation ?

Thanks for your help.

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com