Re: [ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent
Le 22/02/2018 à 05:23, Brad Hubbard a écrit : > On Wed, Feb 21, 2018 at 6:40 PM, Yoann Moulinwrote: >> Hello, >> >> I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible >> playbook), a few days after, ceph status told me "PG_DAMAGED >> Possible data damage: 1 pg inconsistent", I tried to repair the PG without >> success, I tried to stop the OSD, flush the journal and restart the >> OSDs but the OSD refuse to start due to a bad journal. I decided to destroy >> the OSD and recreated it from scratch. After that, everything seemed >> to be all right, but, I just saw now I have exactly the same error again on >> the same PG on the same OSD (78). >> >>> $ ceph health detail >>> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent >>> OSD_SCRUB_ERRORS 3 scrub errors >>> PG_DAMAGED Possible data damage: 1 pg inconsistent >>> pg 11.5f is active+clean+inconsistent, acting [78,154,170] >> >>> $ ceph -s >>> cluster: >>> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4 >>> health: HEALTH_ERR >>> 3 scrub errors >>> Possible data damage: 1 pg inconsistent >>> >>> services: >>> mon: 3 daemons, quorum >>> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch >>> mgr: iccluster001(active), standbys: iccluster009, iccluster017 >>> mds: cephfs-3/3/3 up >>> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active} >>> osd: 180 osds: 180 up, 180 in >>> rgw: 6 daemons active >>> >>> data: >>> pools: 29 pools, 10432 pgs >>> objects: 82862k objects, 171 TB >>> usage: 515 TB used, 465 TB / 980 TB avail >>> pgs: 10425 active+clean >>> 6 active+clean+scrubbing+deep >>> 1 active+clean+inconsistent >>> >>> io: >>> client: 21538 B/s wr, 0 op/s rd, 33 op/s wr >> >>> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous >>> (stable) >> >> Short log : >> >>> 2018-02-21 09:08:33.408396 7fb7b8222700 0 log_channel(cluster) log [DBG] : >>> 11.5f repair starts >>> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >>> 11.5f shard 78: soid >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head >>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- >>> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 >>> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd od d46bb5a1 >>> alloc_hint [0 0 0]) >>> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >>> 11.5f shard 154: soid >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head >>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 >>> osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd >>> od d46bb5a1 alloc_hint [0 0 0]) >>> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >>> 11.5f shard 170: soid >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head >>> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 >>> osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd >>> od d46bb5a1 alloc_hint [0 0 0]) >>> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >>> 11.5f soid >>> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: >>> failed to pick suitable auth object >>> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >>> 11.5f repair 3 errors, 0 fixed >> >> I set "debug_osd 20/20" on osd.78 and start the repair again, the log file >> is here : >> >> ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1 >> >> What can I do in that situation ? > > Take a look and see if http://tracker.ceph.com/issues/21388 is > relevant as well as the debugging and advice therein. Indeed, it looks like similar to my issue. I sent a comment directly on tracker, Thanks. Best regards, -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent
On Wed, Feb 21, 2018 at 6:40 PM, Yoann Moulinwrote: > Hello, > > I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible > playbook), a few days after, ceph status told me "PG_DAMAGED > Possible data damage: 1 pg inconsistent", I tried to repair the PG without > success, I tried to stop the OSD, flush the journal and restart the > OSDs but the OSD refuse to start due to a bad journal. I decided to destroy > the OSD and recreated it from scratch. After that, everything seemed > to be all right, but, I just saw now I have exactly the same error again on > the same PG on the same OSD (78). > >> $ ceph health detail >> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent >> OSD_SCRUB_ERRORS 3 scrub errors >> PG_DAMAGED Possible data damage: 1 pg inconsistent >> pg 11.5f is active+clean+inconsistent, acting [78,154,170] > >> $ ceph -s >> cluster: >> id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4 >> health: HEALTH_ERR >> 3 scrub errors >> Possible data damage: 1 pg inconsistent >> >> services: >> mon: 3 daemons, quorum >> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch >> mgr: iccluster001(active), standbys: iccluster009, iccluster017 >> mds: cephfs-3/3/3 up >> {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active} >> osd: 180 osds: 180 up, 180 in >> rgw: 6 daemons active >> >> data: >> pools: 29 pools, 10432 pgs >> objects: 82862k objects, 171 TB >> usage: 515 TB used, 465 TB / 980 TB avail >> pgs: 10425 active+clean >> 6 active+clean+scrubbing+deep >> 1 active+clean+inconsistent >> >> io: >> client: 21538 B/s wr, 0 op/s rd, 33 op/s wr > >> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous >> (stable) > > Short log : > >> 2018-02-21 09:08:33.408396 7fb7b8222700 0 log_channel(cluster) log [DBG] : >> 11.5f repair starts >> 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >> 11.5f shard 78: soid >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head >> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- >> b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 >> dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd od d46bb5a1 >> alloc_hint [0 0 0]) >> 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >> 11.5f shard 154: soid >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head >> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 >> osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd >> od d46bb5a1 alloc_hint [0 0 0]) >> 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >> 11.5f shard 170: soid >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head >> omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 >> osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd >> od d46bb5a1 alloc_hint [0 0 0]) >> 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >> 11.5f soid >> 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: >> failed to pick suitable auth object >> 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : >> 11.5f repair 3 errors, 0 fixed > > I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is > here : > > ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1 > > What can I do in that situation ? Take a look and see if http://tracker.ceph.com/issues/21388 is relevant as well as the debugging and advice therein. > > Thanks for your help. > > -- > Yoann Moulin > EPFL IC-IT > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG_DAMAGED Possible data damage: 1 pg inconsistent
Hello, I migrated my cluster from jewel to luminous 3 weeks ago (using ceph-ansible playbook), a few days after, ceph status told me "PG_DAMAGED Possible data damage: 1 pg inconsistent", I tried to repair the PG without success, I tried to stop the OSD, flush the journal and restart the OSDs but the OSD refuse to start due to a bad journal. I decided to destroy the OSD and recreated it from scratch. After that, everything seemed to be all right, but, I just saw now I have exactly the same error again on the same PG on the same OSD (78). > $ ceph health detail > HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent > OSD_SCRUB_ERRORS 3 scrub errors > PG_DAMAGED Possible data damage: 1 pg inconsistent > pg 11.5f is active+clean+inconsistent, acting [78,154,170] > $ ceph -s > cluster: > id: f9dfd27f-c704-4d53-9aa0-4a23d655c7c4 > health: HEALTH_ERR > 3 scrub errors > Possible data damage: 1 pg inconsistent > > services: > mon: 3 daemons, quorum > iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch > mgr: iccluster001(active), standbys: iccluster009, iccluster017 > mds: cephfs-3/3/3 up > {0=iccluster022.iccluster.epfl.ch=up:active,1=iccluster006.iccluster.epfl.ch=up:active,2=iccluster014.iccluster.epfl.ch=up:active} > osd: 180 osds: 180 up, 180 in > rgw: 6 daemons active > > data: > pools: 29 pools, 10432 pgs > objects: 82862k objects, 171 TB > usage: 515 TB used, 465 TB / 980 TB avail > pgs: 10425 active+clean > 6 active+clean+scrubbing+deep > 1 active+clean+inconsistent > > io: > client: 21538 B/s wr, 0 op/s rd, 33 op/s wr > ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous > (stable) Short log : > 2018-02-21 09:08:33.408396 7fb7b8222700 0 log_channel(cluster) log [DBG] : > 11.5f repair starts > 2018-02-21 09:08:33.727277 7fb7b8222700 -1 log_channel(cluster) log [ERR] : > 11.5f shard 78: soid > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head > omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9- > b494-57bdb48fab4e.314528.19:head(98394'20014544 osd.78.0:1623704 > dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd od d46bb5a1 > alloc_hint [0 0 0]) > 2018-02-21 09:08:33.727290 7fb7b8222700 -1 log_channel(cluster) log [ERR] : > 11.5f shard 154: soid > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head > omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 > osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd > od d46bb5a1 alloc_hint [0 0 0]) > 2018-02-21 09:08:33.727293 7fb7b8222700 -1 log_channel(cluster) log [ERR] : > 11.5f shard 170: soid > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head > omap_digest 0x29fdd712 != omap_digest 0xd46bb5a1 from auth oi > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head(98394'20014544 > osd.78.0:1623704 dirty|omap|data_digest|omap_digest s 0 uv 20014543 dd > od d46bb5a1 alloc_hint [0 0 0]) > 2018-02-21 09:08:33.727295 7fb7b8222700 -1 log_channel(cluster) log [ERR] : > 11.5f soid > 11:fb71fe10:::.dir.c9724aff-5fa0-4dd9-b494-57bdb48fab4e.314528.19:head: > failed to pick suitable auth object > 2018-02-21 09:08:33.727333 7fb7b8222700 -1 log_channel(cluster) log [ERR] : > 11.5f repair 3 errors, 0 fixed I set "debug_osd 20/20" on osd.78 and start the repair again, the log file is here : ceph-post-file: 1ccac8ea-0947-4fe4-90b1-32d1048548f1 What can I do in that situation ? Thanks for your help. -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com