Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
Excuse the top-posting. When looking at the logs it helps to filter by the actual thread that crashed. $ grep 7f08af3b6700 ceph-osd.27.log.last.error.txt|tail -15 -1001> 2019-10-30 12:55:41.498823 7f08af3b6700 1 -- 129.20.199.93:6803/977508 --> 129.20.199.7:0/2975967502 -- osd_op_reply(283046730 rbd_data.384d296b8b4567.0f99 [set-alloc-hint object_size 4194304 write_size 4194304,write 3145728~4096] v194345'6696469 uv6696469 ondisk = 0) v8 -- 0x5598ed521440 con 0 -651> 2019-10-30 12:55:42.211634 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -565> 2019-10-30 12:55:42.775786 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.2:6823/3002168 -- MOSDScrubReserve(5.2d8 REJECT e194345) v1 -- 0x5598ed7e4000 con 0 -457> 2019-10-30 12:55:43.390134 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'4406723, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -435> 2019-10-30 12:55:43.850768 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'1735861, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -335> 2019-10-30 12:55:44.637635 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'7602452, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -325> 2019-10-30 12:55:44.682357 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.1:6802/3802 -- osd_repop(client.108792126.1:283046901 6.369 e194345/194339 6:96f81e66:::rbd_data.384d296b8b4567.0f99:head v 194345'6696470) v2 -- 0x5598ee591600 con 0 -324> 2019-10-30 12:55:44.682450 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.2:6821/6004637 -- osd_repop(client.108792126.1:283046901 6.369 e194345/194339 6:96f81e66:::rbd_data.384d296b8b4567.0f99:head v 194345'6696470) v2 -- 0x5598cf2ad600 con 0 -323> 2019-10-30 12:55:44.682510 7f08af3b6700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 194345'6696470, trimmed: , trimmed_dups: , clear_divergent_priors: 0 -20> 2019-10-30 12:55:46.366704 7f08af3b6700 1 -- 129.20.177.3:6802/977508 --> 129.20.177.2:6806/1848108 -- pg_scan(digest 2.1d9 2:9b97b661:::rb.0.a7bb39.238e1f29.00107c9b:head-MAX e 194345/194345) v2 -- 0x5598efc0bb80 con 0 0> 2019-10-30 12:55:46.496423 7f08af3b6700 -1 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7f08af3b6700 time 2019-10-30 12:55:46.487842 2019-10-30 12:55:46.557930 7f08af3b6700 -1 *** Caught signal (Aborted) ** in thread 7f08af3b6700 thread_name:tp_osd_tp 0> 2019-10-30 12:55:46.557930 7f08af3b6700 -1 *** Caught signal (Aborted) ** in thread 7f08af3b6700 thread_name:tp_osd_tp Since PrimaryLogPG::on_local_recover() prints the object id when the function is entered at debug level 10 I'd suggest gathering a log at a higher 'debug_osd' level (I'd suggest 20) to be sure about what object is causing the issue. 334 void PrimaryLogPG::on_local_recover( 335 const hobject_t , 336 const ObjectRecoveryInfo &_recovery_info, 337 ObjectContextRef obc, 338 bool is_delete, 339 ObjectStore::Transaction *t 340 ) 341 { 342 dout(10) << __func__ << ": " << hoid << dendl; On Wed, Oct 30, 2019 at 11:43 PM Jérémy Gardais wrote: > > The "best" health i was able to get was : > HEALTH_ERR norecover flag(s) set; 1733/37482459 objects misplaced (0.005%); 5 > scrub errors; Possible data damage: 2 pgs inconsistent; Degraded data > redundancy: 7461/37482459 objects degraded (0.020%), 24 pgs degraded, 2 pgs > undersized > OSDMAP_FLAGS norecover flag(s) set > OBJECT_MISPLACED 1733/37482459 objects misplaced (0.005%) > OSD_SCRUB_ERRORS 5 scrub errors > PG_DAMAGED Possible data damage: 2 pgs inconsistent > pg 2.2ba is active+clean+inconsistent, acting [42,29,30] > pg 2.2bb is active+clean+inconsistent, acting [25,42,18] > PG_DEGRADED Degraded data redundancy: 7461/37482459 objects degraded > (0.020%), 24 pgs degraded, 2 pgs undersized > pg 2.3e is active+recovery_wait+degraded, acting [27,31,5] > pg 2.9d is active+recovery_wait+degraded, acting [27,22,37] > pg 2.a3 is active+recovery_wait+degraded, acting [27,30,35] > pg 2.136 is active+recovery_wait+degraded, acting [27,18,22] > pg 2.150 is active+recovery_wait+degraded, acting [27,19,35] > pg 2.15e is active+recovery_wait+degraded, acting [27,11,36] > pg 2.1d9 is stuck undersized for 14023.243179, current state > active+undersized+degraded+remapped+backfill_wait, last acting [25,30] > pg 2.20f is
Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
The "best" health i was able to get was : HEALTH_ERR norecover flag(s) set; 1733/37482459 objects misplaced (0.005%); 5 scrub errors; Possible data damage: 2 pgs inconsistent; Degraded data redundancy: 7461/37482459 objects degraded (0.020%), 24 pgs degraded, 2 pgs undersized OSDMAP_FLAGS norecover flag(s) set OBJECT_MISPLACED 1733/37482459 objects misplaced (0.005%) OSD_SCRUB_ERRORS 5 scrub errors PG_DAMAGED Possible data damage: 2 pgs inconsistent pg 2.2ba is active+clean+inconsistent, acting [42,29,30] pg 2.2bb is active+clean+inconsistent, acting [25,42,18] PG_DEGRADED Degraded data redundancy: 7461/37482459 objects degraded (0.020%), 24 pgs degraded, 2 pgs undersized pg 2.3e is active+recovery_wait+degraded, acting [27,31,5] pg 2.9d is active+recovery_wait+degraded, acting [27,22,37] pg 2.a3 is active+recovery_wait+degraded, acting [27,30,35] pg 2.136 is active+recovery_wait+degraded, acting [27,18,22] pg 2.150 is active+recovery_wait+degraded, acting [27,19,35] pg 2.15e is active+recovery_wait+degraded, acting [27,11,36] pg 2.1d9 is stuck undersized for 14023.243179, current state active+undersized+degraded+remapped+backfill_wait, last acting [25,30] pg 2.20f is active+recovery_wait+degraded, acting [27,30,2] pg 2.2a1 is active+recovery_wait+degraded, acting [27,18,35] pg 2.2b7 is active+recovery_wait+degraded, acting [27,18,36] pg 2.386 is active+recovery_wait+degraded, acting [27,42,17] pg 2.391 is active+recovery_wait+degraded, acting [27,15,36] pg 2.448 is stuck undersized for 51520.798900, current state active+recovery_wait+undersized+degraded+remapped, last acting [27,38] pg 2.456 is active+recovery_wait+degraded, acting [27,5,43] pg 2.45a is active+recovery_wait+degraded, acting [27,43,36] pg 2.45f is active+recovery_wait+degraded, acting [27,16,36] pg 2.46c is active+recovery_wait+degraded, acting [27,30,38] pg 2.4bf is active+recovery_wait+degraded, acting [27,39,18] pg 2.522 is active+recovery_wait+degraded, acting [27,17,3] pg 2.535 is active+recovery_wait+degraded, acting [27,29,36] pg 2.55a is active+recovery_wait+degraded, acting [27,29,18] pg 5.23f is active+recovery_wait+degraded, acting [27,39,18] pg 5.356 is active+recovery_wait+degraded, acting [27,36,15] pg 5.4a6 is active+recovery_wait+degraded, acting [29,40,30] After that, the flapping started again : 2019-10-30 12:55:46.772593 mon.r730xd1 [INF] osd.38 failed (root=default,datacenter=IPR,room=11B,rack=baie2,host=r740xd1) (connection refused reported by osd.22) 2019-10-30 12:55:46.850239 mon.r730xd1 [INF] osd.27 failed (root=default,datacenter=IPR,room=11B,rack=baie2,host=r730xd3) (connection refused reported by osd.19) 2019-10-30 12:55:56.714029 mon.r730xd1 [WRN] Health check update: 2 osds down (OSD_DOWN) Setting "norecover" flag allow these 2 OSDs to recover up state and limit the flapping states and many backfills. In both osd.27 and osd.38 logs, i still find these logs before one FAILED assert : -2> 2019-10-30 12:52:31.999571 7fb5c125b700 5 -- 129.20.177.3:6802/870834 >> 129.20.177.3:6808/810999 conn(0x564c31d31000 :6802 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=97 cs=7 l=0). rx osd.25 seq 2600 0x564c3a02e1c0 MOSDPGPush(2.1d9 194334/194298 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 -1> 2019-10-30 12:52:31.999633 7fb5c125b700 1 -- 129.20.177.3:6802/870834 <== osd.25 129.20.177.3:6808/810999 2600 MOSDPGPush(2.1d9 194334/194298 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 909+0+0 (4137937376 0 0) 0x564c3a02e1c0 con 0x564c31d31000 0> 2019-10-30 12:52:32.008605 7fb59b03d700 -1 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void
Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
Thus spake Brad Hubbard (bhubb...@redhat.com) on mercredi 30 octobre 2019 à 12:50:50: > Maybe you should set nodown and noout while you do these maneuvers? > That will minimise peering and recovery (data movement). As the commands don't take too long, i just had a few slow requests before the osd was back online. Thanks for the nodown|noout tip. > > snapid 22772 from osd.29 and osd.42 : > > ceph-objectstore-tool --pgid 2.2ba --data-path /var/lib/ceph/osd/ceph-29/ > > '["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]' > > remove > > ceph-objectstore-tool --pgid 2.2ba --data-path /var/lib/ceph/osd/ceph-42/ > > '["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]' > > remove > > That looks right. Done preceded by some dump, get-attrs,… commands. Yeah, not sure about the real interest, but just to be cautious ^^ The PG still looks inconsistent. I asked for a deep-scrub 2.2ba, still waiting. `list-inconsistent-obj` and `list-inconsistent-snapset` returns "No scrub information available for pg 2.2ba" for the moment. I also tried to manage pg 2.371 with : ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-27/ '["2.371",{"oid":"rbd_data.0c16b76b8b4567.000420bb","key":"","snapid":22822,"hash":3394498417,"max":0,"pool":2,"namespace":"","max":0}]' remove This one doesn't looks inconsistent anymore but i also asked for a deep-scrup. > You should probably try and work out what caused the issue and take > steps to minimise the likelihood of a recurrence. This is not expected > behaviour in a correctly configured and stable environment. Yes… I wait a little bit to see what happens with these commands first and keep an eye on the cluster health and logs… -- Gardais Jérémy Institut de Physique de Rennes Université Rennes 1 Téléphone: 02-23-23-68-60 Mail & bonnes pratiques: http://fr.wikipedia.org/wiki/Nétiquette --- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
On Tue, Oct 29, 2019 at 9:09 PM Jérémy Gardais wrote: > > Thus spake Brad Hubbard (bhubb...@redhat.com) on mardi 29 octobre 2019 à > 08:20:31: > > Yes, try and get the pgs healthy, then you can just re-provision the down > > OSDs. > > > > Run a scrub on each of these pgs and then use the commands on the > > following page to find out more information for each case. > > > > https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/ > > > > Focus on the commands 'list-missing', 'list-inconsistent-obj', and > > 'list-inconsistent-snapset'. > > > > Let us know if you get stuck. > > > > P.S. There are several threads about these sorts of issues in this > > mailing list that should turn up when doing a web search. > > I found this thread : > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg53116.html That looks like the same issue. > > And i start to get additionnals informations to solve PG 2.2ba : > 1. rados list-inconsistent-snapset 2.2ba --format=json-pretty > { > "epoch": 192223, > "inconsistents": [ > { > "name": "rbd_data.b4537a2ae8944a.425f", > "nspace": "", > "locator": "", > "snap": 22772, > "errors": [ > "headless" > ] > }, > { > "name": "rbd_data.b4537a2ae8944a.425f", > "nspace": "", > "locator": "", > "snap": "head", > "snapset": { > "snap_context": { > "seq": 22806, > "snaps": [ > 22805, > 22804, > 22674, > 22619, > 20536, > 17248, > 14270 > ] > }, > "head_exists": 1, > "clones": [ > { > "snap": 17248, > "size": 4194304, > "overlap": "[0~2269184,2277376~1916928]", > "snaps": [ > 17248 > ] > }, > { > "snap": 20536, > "size": 4194304, > "overlap": "[0~2269184,2277376~1916928]", > "snaps": [ > 20536 > ] > }, > { > "snap": 22625, > "size": 4194304, > "overlap": "[0~2269184,2277376~1916928]", > "snaps": [ > 22619 > ] > }, > { > "snap": 22674, > "size": 4194304, > "overlap": "[266240~4096]", > "snaps": [ > 22674 > ] > }, > { > "snap": 22805, > "size": 4194304, > "overlap": > "[0~942080,958464~901120,1875968~16384,1908736~360448,2285568~1908736]", > "snaps": [ > 22805, > 22804 > ] > } > ] > }, > "errors": [ > "extra_clones" > ], > "extra clones": [ > 22772 > ] > } > ] > } > > 2.a ceph-objectstore-tool from osd.29 and osd.42 : > ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-29/ --pgid 2.2ba > --op list rbd_data.b4537a2ae8944a.425f > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":20536,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22625,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22674,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22805,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":-2,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] > > 2.b ceph-objectstore-tool from osd.30 : > ceph-objectstore-tool --data-path
Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
Thus spake Brad Hubbard (bhubb...@redhat.com) on mardi 29 octobre 2019 à 08:20:31: > Yes, try and get the pgs healthy, then you can just re-provision the down > OSDs. > > Run a scrub on each of these pgs and then use the commands on the > following page to find out more information for each case. > > https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/ > > Focus on the commands 'list-missing', 'list-inconsistent-obj', and > 'list-inconsistent-snapset'. > > Let us know if you get stuck. > > P.S. There are several threads about these sorts of issues in this > mailing list that should turn up when doing a web search. I found this thread : https://www.mail-archive.com/ceph-users@lists.ceph.com/msg53116.html And i start to get additionnals informations to solve PG 2.2ba : 1. rados list-inconsistent-snapset 2.2ba --format=json-pretty { "epoch": 192223, "inconsistents": [ { "name": "rbd_data.b4537a2ae8944a.425f", "nspace": "", "locator": "", "snap": 22772, "errors": [ "headless" ] }, { "name": "rbd_data.b4537a2ae8944a.425f", "nspace": "", "locator": "", "snap": "head", "snapset": { "snap_context": { "seq": 22806, "snaps": [ 22805, 22804, 22674, 22619, 20536, 17248, 14270 ] }, "head_exists": 1, "clones": [ { "snap": 17248, "size": 4194304, "overlap": "[0~2269184,2277376~1916928]", "snaps": [ 17248 ] }, { "snap": 20536, "size": 4194304, "overlap": "[0~2269184,2277376~1916928]", "snaps": [ 20536 ] }, { "snap": 22625, "size": 4194304, "overlap": "[0~2269184,2277376~1916928]", "snaps": [ 22619 ] }, { "snap": 22674, "size": 4194304, "overlap": "[266240~4096]", "snaps": [ 22674 ] }, { "snap": 22805, "size": 4194304, "overlap": "[0~942080,958464~901120,1875968~16384,1908736~360448,2285568~1908736]", "snaps": [ 22805, 22804 ] } ] }, "errors": [ "extra_clones" ], "extra clones": [ 22772 ] } ] } 2.a ceph-objectstore-tool from osd.29 and osd.42 : ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-29/ --pgid 2.2ba --op list rbd_data.b4537a2ae8944a.425f ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":20536,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22625,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22674,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":22805,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":-2,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}] 2.b ceph-objectstore-tool from osd.30 : ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-30/ --pgid 2.2ba --op list rbd_data.b4537a2ae8944a.425f ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
Re: [ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
Yes, try and get the pgs healthy, then you can just re-provision the down OSDs. Run a scrub on each of these pgs and then use the commands on the following page to find out more information for each case. https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/ Focus on the commands 'list-missing', 'list-inconsistent-obj', and 'list-inconsistent-snapset'. Let us know if you get stuck. P.S. There are several threads about these sorts of issues in this mailing list that should turn up when doing a web search. On Tue, Oct 29, 2019 at 5:06 AM Jérémy Gardais wrote: > > Hello, > > From several weeks, i have some OSDs flapping before ending out of the > cluster by Ceph… > I was hoping some Ceph's magic and just gave it sometime to auto heal > (and be able to do all the side work…) but it was a bad idea (what a > surprise :D). Also got some inconsistents PGs, but i was waiting a quiet > health cluster before trying to fix them. > > Now that i have more time, i also have 6 OSDs down+out on my 5 nodes > cluster and 1~2 OSDs still flapping from time to time, i asking myself > if these PGs might be the (one ?) source of my problem. > > The last OSD error on osd.28 gave these logs : > -2> 2019-10-28 12:57:47.346460 7fefbdc4d700 5 -- 129.20.177.2:6811/47803 > >> 129.20.177.3:6808/4141402 conn(0x55de8211a000 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2058 cs=1 l=0). rx osd.25 > seq 169 0x55dea57b3600 MOSDPGPush(2.1d9 191810/191810 > [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926, version: > 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, > omap_entries_size: 0, attrset_size: 1, recovery_info: > ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926@127481'7241006, > size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), > after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, > data_complete:true, omap_recovered_to:, omap_complete:true, error:false), > before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, > data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) > v3 > -1> 2019-10-28 12:57:47.346517 7fefbdc4d700 1 -- 129.20.177.2:6811/47803 > <== osd.25 129.20.177.3:6808/4141402 169 MOSDPGPush(2.1d9 191810/191810 > [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926, version: > 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, > omap_entries_size: 0, attrset_size: 1, recovery_info: > ObjectRecoveryInfo(2:9b97b818:::rbd_data.c16b76b8b4567.0001426e:5926@127481'7241006, > size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), > after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, > data_complete:true, omap_recovered_to:, omap_complete:true, error:false), > before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, > data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) > v3 909+0+0 (1239474936 0 0) 0x55dea57b3600 con 0x55de8211a000 > 0> 2019-10-28 12:57:47.353680 7fef99441700 -1 > /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void > PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, > ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fef99441700 time > 2019-10-28 12:57:47.347132 > /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: 354: FAILED > assert(recovery_info.oi.legacy_snaps.size()) > > ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x102) [0x55de72039f32] > 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo > const&, std::shared_ptr, bool, > ObjectStore::Transaction*)+0x135b) [0x55de71be330b] > 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, > ObjectStore::Transaction*)+0x31d) [0x55de71d4fadd] > 4: (ReplicatedBackend::_do_push(boost::intrusive_ptr)+0x18f) > [0x55de71d4fd7f] > 5: > (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2d1) > [0x55de71d5ff11] > 6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) > [0x55de71c7d030] > 7: (PrimaryLogPG::do_request(boost::intrusive_ptr&, > ThreadPool::TPHandle&)+0x5f1) [0x55de71be87b1] > 8: (OSD::dequeue_op(boost::intrusive_ptr, > boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f7) > [0x55de71a63e97] > 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr > const&)+0x57) [0x55de71cf5077] > 10: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x108c) [0x55de71a94e1c] > 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d) > [0x55de7203fbbd] > 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55de72041b80] > 13: (()+0x8064) [0x7fefc12b5064] > 14: (clone()+0x6d) [0x7fefc03a962d] > NOTE: a copy of the executable, or `objdump -rdS ` is needed to >
[ceph-users] Inconsistents + FAILED assert(recovery_info.oi.legacy_snaps.size())
Hello, From several weeks, i have some OSDs flapping before ending out of the cluster by Ceph… I was hoping some Ceph's magic and just gave it sometime to auto heal (and be able to do all the side work…) but it was a bad idea (what a surprise :D). Also got some inconsistents PGs, but i was waiting a quiet health cluster before trying to fix them. Now that i have more time, i also have 6 OSDs down+out on my 5 nodes cluster and 1~2 OSDs still flapping from time to time, i asking myself if these PGs might be the (one ?) source of my problem. The last OSD error on osd.28 gave these logs : -2> 2019-10-28 12:57:47.346460 7fefbdc4d700 5 -- 129.20.177.2:6811/47803 >> 129.20.177.3:6808/4141402 conn(0x55de8211a000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2058 cs=1 l=0). rx osd.25 seq 169 0x55dea57b3600 MOSDPGPush(2.1d9 191810/191810 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 -1> 2019-10-28 12:57:47.346517 7fefbdc4d700 1 -- 129.20.177.2:6811/47803 <== osd.25 129.20.177.3:6808/4141402 169 MOSDPGPush(2.1d9 191810/191810 [PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.0001426e:5926, version: 127481'7241006, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 1, recovery_info: ObjectRecoveryInfo(2:9b97b818:::rbd_data.c16b76b8b4567.0001426e:5926@127481'7241006, size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true, error:false), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3 909+0+0 (1239474936 0 0) 0x55dea57b3600 con 0x55de8211a000 0> 2019-10-28 12:57:47.353680 7fef99441700 -1 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fef99441700 time 2019-10-28 12:57:47.347132 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: 354: FAILED assert(recovery_info.oi.legacy_snaps.size()) ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55de72039f32] 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr, bool, ObjectStore::Transaction*)+0x135b) [0x55de71be330b] 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x31d) [0x55de71d4fadd] 4: (ReplicatedBackend::_do_push(boost::intrusive_ptr)+0x18f) [0x55de71d4fd7f] 5: (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2d1) [0x55de71d5ff11] 6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) [0x55de71c7d030] 7: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x5f1) [0x55de71be87b1] 8: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f7) [0x55de71a63e97] 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr const&)+0x57) [0x55de71cf5077] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x108c) [0x55de71a94e1c] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d) [0x55de7203fbbd] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55de72041b80] 13: (()+0x8064) [0x7fefc12b5064] 14: (clone()+0x6d) [0x7fefc03a962d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5