Re: [ceph-users] Omap issues - metadata creating too many
Hi, i had the default - so it was on(according to ceph kb). turned it off, but the issue persists. i noticed Bryan Stillwell(cc-ing him) had the same issue (reported about it yesterday) - tried his tips about compacting, but it doesn't do anything, however i have to add to his last point, this happens even with bluestore. Is there anything we can do to clean up the omap manually? Josef On 18/12/2018 23:19, J. Eric Ivancich wrote: On 12/17/18 9:18 AM, Josef Zelenka wrote: Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three nodes have an additional SSD i added to have more space to rebalance the metadata). CUrrently, the cluster is used mainly as a radosgw storage, with 28tb data in total, replication 2x for both the metadata and data pools(a cephfs isntance is running alongside there, but i don't think it's the perpetrator - this happenned likely before we had it). All pools aside from the data pool of the cephfs and data pool of the radosgw are located on the SSD's. Now, the interesting thing - at random times, the metadata OSD's fill up their entire capacity with OMAP data and go to r/o mode and we have no other option currently than deleting them and re-creating. The fillup comes at a random time, it doesn't seem to be triggered by anything and it isn't caused by some data influx. It seems like some kind of a bug to me to be honest, but i'm not certain - anyone else seen this behavior with their radosgw? Thanks a lot Hi Josef, Do you have rgw_dynamic_resharding turned on? Try turning it off and see if the behavior continues. One theory is that dynamic resharding is triggered and possibly not completing. This could add a lot of data to omap for the incomplete bucket index shards. After a delay it tries resharding again, possibly failing again, and adding more data to the omap. This continues. If this is the ultimate issue we have some commits on the upstream luminous branch that are designed to address this set of issues. But we should first see if this is the cause. Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Omap issues - metadata creating too many
Hi everyone, i'm running a Luminous 12.2.5 cluster with 6 hosts on ubuntu 16.04 - 12 HDDs for data each, plus 2 SSD metadata OSDs(three nodes have an additional SSD i added to have more space to rebalance the metadata). CUrrently, the cluster is used mainly as a radosgw storage, with 28tb data in total, replication 2x for both the metadata and data pools(a cephfs isntance is running alongside there, but i don't think it's the perpetrator - this happenned likely before we had it). All pools aside from the data pool of the cephfs and data pool of the radosgw are located on the SSD's. Now, the interesting thing - at random times, the metadata OSD's fill up their entire capacity with OMAP data and go to r/o mode and we have no other option currently than deleting them and re-creating. The fillup comes at a random time, it doesn't seem to be triggered by anything and it isn't caused by some data influx. It seems like some kind of a bug to me to be honest, but i'm not certain - anyone else seen this behavior with their radosgw? Thanks a lot Josef Zelenka Cloudevelops ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs incomplete and inactive
The fullratio was ignored, that's why that happenned most likely. I can't delete pgs, because it's only kb's worth of space - the osd is 40gb, 39.8 gb is taken up by omap - that's why i can't move/extract. Any clue on how to compress/move away the omap dir? On 27/08/18 12:34, Paul Emmerich wrote: Don't ever let an OSD run 100% full, that's usually bad news. Two ways to salvage this: 1. You can try to extract the PGs with ceph-objectstore-tool and inject them into another OSD; Ceph will find them and recover 2. You seem to be using Filestore, so you should easily be able to just delete a whole PG on the full OSD's file system to make space (preferably one that is already recovered and active+clean even without the dead OSD) Paul 2018-08-27 10:44 GMT+02:00 Josef Zelenka : Hi, i've had a very ugly thing happen to me over the weekend. Some of my OSDs in a root that handles metadata pools overflowed to 100% disk usage due to omap size(even though i had 97% full ratio, which is odd) and refused to start. There were some pgs on those OSDs that went away with them. I have tried compacting the omap, moving files away etc, but nothing - i can't export the pgs, i get errors like this: 2018-08-27 04:42:33.436182 7fcb53382580 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1535359353436170, "job": 1, "event": "recovery_started", "log_files": [5504, 5507]} 2018-08-27 04:42:33.436194 7fcb53382580 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log #5504 mode 2 2018-08-27 04:42:35.422502 7fcb53382580 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work 2018-08-27 04:42:35.431613 7fcb53382580 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete 2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space left on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No space left on device Mount failed with '(1) Operation not permitted' 2018-08-27 04:42:35.432945 7fcb53382580 -1 filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing rocksdb : I decided to take the loss and mark the osds as lost and remove them from the cluster, however, it left 4 pgs hanging in incomplete + inactive state, which apparently prevents my radosgw from starting. Is there another way to export/import the pgs into their new osds/recreate them? I'm running Luminous 12.2.5 on Ubuntu 16.04. Thanks Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs incomplete and inactive
Hi, i've had a very ugly thing happen to me over the weekend. Some of my OSDs in a root that handles metadata pools overflowed to 100% disk usage due to omap size(even though i had 97% full ratio, which is odd) and refused to start. There were some pgs on those OSDs that went away with them. I have tried compacting the omap, moving files away etc, but nothing - i can't export the pgs, i get errors like this: 2018-08-27 04:42:33.436182 7fcb53382580 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1535359353436170, "job": 1, "event": "recovery_started", "log_files": [5504, 5507]} 2018-08-27 04:42:33.436194 7fcb53382580 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl_open.cc:482] Recovering log #5504 mode 2 2018-08-27 04:42:35.422502 7fcb53382580 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work 2018-08-27 04:42:35.431613 7fcb53382580 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete 2018-08-27 04:42:35.431716 7fcb53382580 -1 rocksdb: IO error: No space left on device/var/lib/ceph/osd/ceph-5//current/omap/005507.sst: No space left on device Mount failed with '(1) Operation not permitted' 2018-08-27 04:42:35.432945 7fcb53382580 -1 filestore(/var/lib/ceph/osd/ceph-5/) mount(1723): Error initializing rocksdb : I decided to take the loss and mark the osds as lost and remove them from the cluster, however, it left 4 pgs hanging in incomplete + inactive state, which apparently prevents my radosgw from starting. Is there another way to export/import the pgs into their new osds/recreate them? I'm running Luminous 12.2.5 on Ubuntu 16.04. Thanks Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD had suicide timed out
THe only reason that i could think of is some kind of a network issue, even though different clusters run on the same switch with the same settings and we don't register any issues on there. One thing i recall - one of my colleagues was testing something out on this cluster and after he finished, he deleted a big(few million objects) bucket. Is it possible that there are some orphaned files from that action that break our osds somehow? Can't think of anything else. Josef On 09/08/18 04:07, Brad Hubbard wrote: If, in the above case, osd 13 was not too busy to respond (resource shortage) then you need to find out why else osd 5, etc. could not contact it. On Wed, Aug 8, 2018 at 6:47 PM, Josef Zelenka wrote: Checked the system load on the host with the OSD that is suiciding currently and it's fine, however i can see a noticeably higher IO (around 700), though that seems rather like a symptom of the constant flapping/attempting to come up to me(it's an SSD based Ceph so this shouldn't cause much harm to it). Had a look at one of the osds sending the you_died messages and it seems it's attempting to contact osd.13, but ultimately fails. 8/0 13574/13574/13574) [5,11] r=0 lpr=13574 crt=13592'3654839 lcod 13592'3654838 mlcod 13592'3654838 active+clean] publish_stats_to_osd 13593:9552151 2018-08-08 10:45:16.112344 7effa1d8c700 15 osd.5 pg_epoch: 13593 pg[14.6( v 13592'3654839 (13294'3653334,13592'3654839] local-lis/les=13574/13575 n=945 ec=126/126 lis/c 13574/13574 les/c/f 13575/13578/0 13574/13574/13574) [5,11] r=0 lpr=13574 crt=13592'3654839 lcod 13592'3654838 mlcod 13592'3654838 active+clean] publish_stats_to_osd 13593:9552152 2018-08-08 10:45:16.679484 7eff9a57d700 15 osd.5 pg_epoch: 13593 pg[11.15( v 13575'34486 (9987'32956,13575'34486] local-lis/les=13574/13575 n=1 ec=115/115 lis/c 13574/13574 les/c/f 13575/13575/0 13574/13574/13574) [5,10] r=0 lpr=13574 crt=13572'34485 lcod 13572'34485 mlcod 13572'34485 active+clean] publish_stats_to_osd 13593:2966967 2018-08-08 10:45:17.818135 7effb95a4700 1 -- 10.12.125.1:6803/1319081 <== osd.13 10.12.125.3:0/735946 18 osd_ping(ping e13589 stamp 2018-08-08 10:45:17.817238) v4 2004+0+0 (4218069135 0 0) 0x55bb638ba800 con 0x55bb65e79800 2018-08-08 10:45:17.818176 7effb9da5700 1 -- 10.12.3.15:6809/1319081 <== osd.13 10.12.3.17:0/735946 18 osd_ping(ping e13589 stamp 2018-08-08 10:45:17.817238) v4 2004+0+0 (4218069135 0 0) 0x55bb63cd8c00 con 0x55bb65e7b000 2018-08-08 10:45:18.919053 7effb95a4700 1 -- 10.12.125.1:6803/1319081 <== osd.13 10.12.125.3:0/735946 19 osd_ping(ping e13589 stamp 2018-08-08 10:45:18.918149) v4 2004+0+0 (1428835292 0 0) 0x55bb638bb200 con 0x55bb65e79800 2018-08-08 10:45:18.919598 7effb9da5700 1 -- 10.12.3.15:6809/1319081 <== osd.13 10.12.3.17:0/735946 19 osd_ping(ping e13589 stamp 2018-08-08 10:45:18.918149) v4 2004+0+0 (1428835292 0 0) 0x55bb63cd8a00 con 0x55bb65e7b000 2018-08-08 10:45:21.679563 7eff9a57d700 15 osd.5 pg_epoch: 13593 pg[11.15( v 13575'34486 (9987'32956,13575'34486] local-lis/les=13574/13575 n=1 ec=115/115 lis/c 13574/13574 les/c/f 13575/13575/0 13574/13574/13574) [5,10] r=0 lpr=13574 crt=13572'34485 lcod 13572'34485 mlcod 13572'34485 active+clean] publish_stats_to_osd 13593:2966968 2018-08-08 10:45:23.020715 7effb95a4700 1 -- 10.12.125.1:6803/1319081 <== osd.13 10.12.125.3:0/735946 20 osd_ping(ping e13589 stamp 2018-08-08 10:45:23.018994) v4 2004+0+0 (1018071233 0 0) 0x55bb63bb7200 con 0x55bb65e79800 2018-08-08 10:45:23.020837 7effb9da5700 1 -- 10.12.3.15:6809/1319081 <== osd.13 10.12.3.17:0/735946 20 osd_ping(ping e13589 stamp 2018-08-08 10:45:23.018994) v4 2004+0+0 (1018071233 0 0) 0x55bb63cd8c00 con 0x55bb65e7b000 2018-08-08 10:45:26.679513 7eff8e565700 15 osd.5 pg_epoch: 13593 pg[11.15( v 13575'34486 (9987'32956,13575'34486] local-lis/les=13574/13575 n=1 ec=115/115 lis/c 13574/13574 les/c/f 13575/13575/0 13574/13574/13574) [5,10] r=0 lpr=13574 crt=13572'34485 lcod 13572'34485 mlcod 13572'34485 active+clean] publish_stats_to_osd 13593:2966969 2018-08-08 10:45:28.921091 7effb95a4700 1 -- 10.12.125.1:6803/1319081 <== osd.13 10.12.125.3:0/735946 21 osd_ping(ping e13589 stamp 2018-08-08 10:45:28.920140) v4 2004+0+0 (2459835898 0 0) 0x55bb638ba800 con 0x55bb65e79800 2018-08-08 10:45:28.922026 7effb9da5700 1 -- 10.12.3.15:6809/1319081 <== osd.13 10.12.3.17:0/735946 21 osd_ping(ping e13589 stamp 2018-08-08 10:45:28.920140) v4 2004+0+0 (2459835898 0 0) 0x55bb63cd8c00 con 0x55bb65e7b000 2018-08-08 10:45:31.679828 7eff9a57d700 15 osd.5 pg_epoch: 13593 pg[11.15( v 13575'34486 (9987'32956,13575'34486] local-lis/les=13574/13575 n=1 ec=115/115 lis/c 13574/13574 les/c/f 13575/13575/0 13574/13574/13574) [5,10] r=0 lpr=13574 crt=13572'34485 lcod 13572'34485 mlcod 13572'34485 active+clean] publish_stats_to_osd 13593:2966970 2018-08-08 10:45:33.022697 7effb95a4700 1 -- 10.12.125.1:6803/1319081 <== osd.13 10.12.125.3:0/7
Re: [ceph-users] OSD had suicide timed out
Thank you for your suggestion, tried it, really seems like the other osds think the osd is dead(if I understand this right), however the networking seems absolutely fine between the nodes(no issues in graphs etc). -13> 2018-08-08 09:13:58.466119 7fe053d41700 1 -- 10.12.3.17:0/706864 <== osd.12 10.12.3.17:6807/4624236 81 osd_ping(ping_reply e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (687351303 0 0) 0x55731eb73e00 con 0x55731e7d4800 -12> 2018-08-08 09:13:58.466140 7fe054542700 1 -- 10.12.3.17:0/706864 <== osd.11 10.12.3.16:6812/19232 81 osd_ping(ping_reply e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (687351303 0 0) 0x55733c391200 con 0x55731e7a5800 -11> 2018-08-08 09:13:58.466147 7fe053540700 1 -- 10.12.125.3:0/706864 <== osd.11 10.12.125.2:6811/19232 82 osd_ping(you_died e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (3111562112 0 0) 0x55731eb66800 con 0x55731e7a4000 -10> 2018-08-08 09:13:58.466164 7fe054542700 1 -- 10.12.3.17:0/706864 <== osd.11 10.12.3.16:6812/19232 82 osd_ping(you_died e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (3111562112 0 0) 0x55733c391200 con 0x55731e7a5800 -9> 2018-08-08 09:13:58.466164 7fe053d41700 1 -- 10.12.3.17:0/706864 <== osd.12 10.12.3.17:6807/4624236 82 osd_ping(you_died e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (3111562112 0 0) 0x55731eb73e00 con 0x55731e7d4800 -8> 2018-08-08 09:13:58.466176 7fe053540700 1 -- 10.12.3.17:0/706864 <== osd.9 10.12.3.16:6813/10016600 81 osd_ping(ping_reply e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (687351303 0 0) 0x55731eb66800 con 0x55731e732000 -7> 2018-08-08 09:13:58.466200 7fe053d41700 1 -- 10.12.3.17:0/706864 <== osd.10 10.12.3.16:6810/2017908 81 osd_ping(ping_reply e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (687351303 0 0) 0x55731eb73e00 con 0x55731e796800 -6> 2018-08-08 09:13:58.466208 7fe053540700 1 -- 10.12.3.17:0/706864 <== osd.9 10.12.3.16:6813/10016600 82 osd_ping(you_died e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (3111562112 0 0) 0x55731eb66800 con 0x55731e732000 -5> 2018-08-08 09:13:58.466222 7fe053d41700 1 -- 10.12.3.17:0/706864 <== osd.10 10.12.3.16:6810/2017908 82 osd_ping(you_died e13452 stamp 2018-08-08 09:13:58.464608) v4 2004+0+0 (3111562112 0 0) 0x55731eb73e00 con 0x55731e796800 -4> 2018-08-08 09:13:59.748336 7fe040531700 1 -- 10.12.3.17:6802/706864 --> 10.12.3.16:6800/1677830 -- mgrreport(unknown.13 +0-0 packed 742 osd_metrics=1) v5 -- 0x55731fa4af00 con 0 -3> 2018-08-08 09:13:59.748538 7fe040531700 1 -- 10.12.3.17:6802/706864 --> 10.12.3.16:6800/1677830 -- pg_stats(64 pgs tid 0 v 0) v1 -- 0x55733cbf4c00 con 0 -2> 2018-08-08 09:14:00.953804 7fe0525a1700 1 heartbeat_map is_healthy 'OSD::peering_tp thread 0x7fe03f52f700' had timed out after 15 -1> 2018-08-08 09:14:00.953857 7fe0525a1700 1 heartbeat_map is_healthy 'OSD::peering_tp thread 0x7fe03f52f700' had suicide timed out after 150 0> 2018-08-08 09:14:00.970742 7fe03f52f700 -1 *** Caught signal (Aborted) ** Could it be that the suiciding OSDs are rejecting the ping somehow? I'm quite confused as on what's really going on here, it seems completely random to me. On 08/08/18 01:51, Brad Hubbard wrote: Try to work out why the other osds are saying this one is down. Is it because this osd is too busy to respond or something else. debug_ms = 1 will show you some message debugging which may help. On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka wrote: To follow up, I did some further digging with debug_osd=20/20 and it appears as if there's no traffic to the OSD, even though it comes UP for the cluster (this started happening on another OSD in the cluster today, same stuff): -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560 handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566 -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560 handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566 -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560 handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566 -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559 share_map_peer 0x56308a9d already has epoch 12566 -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559 share_map_peer 0x56308abb9800 already has epoch 12566 -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559 share_map_peer 0x56308abb1000 already has epoch 12566 -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560 handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566 -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560 handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566 -19> 20
Re: [ceph-users] OSD had suicide timed out
To follow up, I did some further digging with debug_osd=20/20 and it appears as if there's no traffic to the OSD, even though it comes UP for the cluster (this started happening on another OSD in the cluster today, same stuff): -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560 handle_osd_ping osd.17 10.12.3.17:6811/19661 says i am down in 12566 -26> 2018-08-07 14:10:55.146542 7f9fcebce700 10 osd.0 12560 handle_osd_ping osd.12 10.12.125.3:6807/4624236 says i am down in 12566 -25> 2018-08-07 14:10:55.146551 7f9fcf3cf700 10 osd.0 12560 handle_osd_ping osd.13 10.12.3.17:6805/186262 says i am down in 12566 -24> 2018-08-07 14:10:55.146564 7f9fce3cd700 20 osd.0 12559 share_map_peer 0x56308a9d already has epoch 12566 -23> 2018-08-07 14:10:55.146576 7f9fcebce700 20 osd.0 12559 share_map_peer 0x56308abb9800 already has epoch 12566 -22> 2018-08-07 14:10:55.146590 7f9fcf3cf700 20 osd.0 12559 share_map_peer 0x56308abb1000 already has epoch 12566 -21> 2018-08-07 14:10:55.146600 7f9fce3cd700 10 osd.0 12560 handle_osd_ping osd.15 10.12.125.3:6813/49064793 says i am down in 12566 -20> 2018-08-07 14:10:55.146609 7f9fcebce700 10 osd.0 12560 handle_osd_ping osd.16 10.12.3.17:6801/1018363 says i am down in 12566 -19> 2018-08-07 14:10:55.146619 7f9fcf3cf700 10 osd.0 12560 handle_osd_ping osd.11 10.12.3.16:6812/19232 says i am down in 12566 -18> 2018-08-07 14:10:55.146643 7f9fcf3cf700 20 osd.0 12559 share_map_peer 0x56308a9d already has epoch 12566 -17> 2018-08-07 14:10:55.146653 7f9fcf3cf700 10 osd.0 12560 handle_osd_ping osd.15 10.12.3.17:6812/49064793 says i am down in 12566 -16> 2018-08-07 14:10:55.448468 7f9fcabdd700 10 osd.0 12560 tick_without_osd_lock -15> 2018-08-07 14:10:55.448491 7f9fcabdd700 20 osd.0 12559 can_inc_scrubs_pending 0 -> 1 (max 1, active 0) -14> 2018-08-07 14:10:55.448497 7f9fcabdd700 20 osd.0 12560 scrub_time_permit should run between 0 - 24 now 14 = yes -13> 2018-08-07 14:10:55.448525 7f9fcabdd700 20 osd.0 12560 scrub_load_below_threshold loadavg 2.31 < daily_loadavg 2.68855 and < 15m avg 2.63 = yes -12> 2018-08-07 14:10:55.448535 7f9fcabdd700 20 osd.0 12560 sched_scrub load_is_low=1 -11> 2018-08-07 14:10:55.448555 7f9fcabdd700 10 osd.0 12560 sched_scrub 15.112 scheduled at 2018-08-07 15:03:15.052952 > 2018-08-07 14:10:55.448494 -10> 2018-08-07 14:10:55.448563 7f9fcabdd700 20 osd.0 12560 sched_scrub done -9> 2018-08-07 14:10:55.448565 7f9fcabdd700 10 osd.0 12559 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 bytes; target 25 obj/sec or 5120 k bytes/sec -8> 2018-08-07 14:10:55.448568 7f9fcabdd700 20 osd.0 12559 promote_throttle_recalibrate new_prob 1000 -7> 2018-08-07 14:10:55.448569 7f9fcabdd700 10 osd.0 12559 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000 -6> 2018-08-07 14:10:55.507159 7f9faab9d700 20 osd.0 op_wq(5) _process empty q, waiting -5> 2018-08-07 14:10:55.812434 7f9fb5bb3700 20 osd.0 op_wq(7) _process empty q, waiting -4> 2018-08-07 14:10:56.236584 7f9fcd42e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f9fa7396700' had timed out after 60 -3> 2018-08-07 14:10:56.236618 7f9fcd42e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f9fb33ae700' had timed out after 60 -2> 2018-08-07 14:10:56.236621 7f9fcd42e700 1 heartbeat_map is_healthy 'OSD::peering_tp thread 0x7f9fba3bc700' had timed out after 15 -1> 2018-08-07 14:10:56.236640 7f9fcd42e700 1 heartbeat_map is_healthy 'OSD::peering_tp thread 0x7f9fba3bc700' had suicide timed out after 150 0> 2018-08-07 14:10:56.245420 7f9fba3bc700 -1 *** Caught signal (Aborted) ** in thread 7f9fba3bc700 thread_name:tp_peering THe osd cyclically crashes and comes back up. I tried modifying the recovery etc timeouts, but no luck - the situation is still the same. Regarding the radosgw, across all nodes, after starting the rgw process, i only get this: 2018-08-07 14:32:17.852785 7f482dcaf700 2 RGWDataChangesLog::ChangesRenewThread: start I found this thread in the ceph mailing list (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018956.html) but I'm not sure if this is the same thing(albeit, it's the same error), as I don't use s3 acls/expiration in my cluster(if it's set to a default, I'm not aware of it) On 06/08/18 16:30, Josef Zelenka wrote: Hi, i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration is 3 nodes, 6 drives each(though i have encountered this on a different cluster, similar hardware, only the drives were HDD instead of SSD - same usage). I have recently seen a bug(?) where one of the OSDs suddenly spikes in iops and constantly restarts(trying to load the journal/filemap apparently) which renders the radosgw(primary usage of this cluster) unable t
[ceph-users] OSD had suicide timed out
Hi, i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration is 3 nodes, 6 drives each(though i have encountered this on a different cluster, similar hardware, only the drives were HDD instead of SSD - same usage). I have recently seen a bug(?) where one of the OSDs suddenly spikes in iops and constantly restarts(trying to load the journal/filemap apparently) which renders the radosgw(primary usage of this cluster) unable to write. The only thing that helps here is stopping the OSD, but that helps only until another one does the similar thing. Any clue on the cause of this? LOgs of the osd when it crashes below. THanks Josef -9920> 2018-08-06 12:12:10.588227 7f8e7afcb700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60 -9919> 2018-08-06 12:12:10.607070 7f8e7a7ca700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had timed out after 60 -- -1> 2018-08-06 14:12:52.428994 7f8e7982b700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8e56f9a700' had suicide timed out after 150 0> 2018-08-06 14:12:52.432088 7f8e56f9a700 -1 *** Caught signal (Aborted) ** in thread 7f8e56f9a700 thread_name:tp_osd_tp ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (()+0xa7cab4) [0x55868269aab4] 2: (()+0x11390) [0x7f8e7e51d390] 3: (()+0x1026d) [0x7f8e7e51c26d] 4: (pthread_mutex_lock()+0x7d) [0x7f8e7e515dbd] 5: (Mutex::Lock(bool)+0x49) [0x5586826bb899] 6: (PG::lock(bool) const+0x33) [0x55868216ace3] 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x844) [0x558682101044] 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) [0x5586826e27f4] 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5586826e5830] 10: (()+0x76ba) [0x7f8e7e5136ba] 11: (clone()+0x6d) [0x7f8e7d58a41d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 0 lockdep 0/ 0 context 0/ 0 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 0 buffer 0/ 0 timer 0/ 0 filer 0/ 1 striper 0/ 0 objecter 0/ 0 rados 0/ 0 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 0 journaler 0/ 0 objectcacher 0/ 0 client 0/ 0 osd 0/ 0 optracker 0/ 0 objclass 0/ 0 filestore 0/ 0 journal 0/ 0 ms 0/ 0 mon 0/ 0 monc 0/ 0 paxos 0/ 0 tp 0/ 0 auth 1/ 5 crypto 0/ 0 finisher 1/ 1 reserver 1/ 5 heartbeatmap 0/ 0 perfcounter 0/ 0 rgw 1/10 civetweb 1/ 5 javaclient 0/ 0 asok 0/ 0 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 1 max_new 1000 log_file /var/log/ceph/ceph-osd.7.log --- end dump of recent events --- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best way to replace OSD
Hi, our procedure is usually(assured that the cluster was ok the failure, with 2 replicas as crush rule) 1.Stop the OSD process(to keep it from coming up and down and putting load on the cluster) 2. Wait for the "Reweight" to come to 0(happens after 5 min i think - can be set manually but i let it happen by itself) 3. remove the osd from cluster(ceph auth del, ceph osd crush remove, ceph osd rm) 4. note down the journal partitions if needed 5. umount drive, replace the disk with new one 6. ensure permissions are set to ceph:ceph in /dev 7. mklabel gpt on the new drive 8. create the new osd with ceph-disk prepare(automatically adds it to the crushmap) your procedure sounds reasonable to me, as far as i'm concerned you shouldn't have to wait for rebalancing after you remove the osd. all this might not be 100% per ceph books but it works for us :) Josef On 06/08/18 16:15, Iztok Gregori wrote: Hi Everyone, Which is the best way to replace a failing (SMART Health Status: HARDWARE IMPENDING FAILURE) OSD hard disk? Normally I will: 1. set the OSD as out 2. wait for rebalancing 3. stop the OSD on the osd-server (unmount if needed) 4. purge the OSD from CEPH 5. physically replace the disk with the new one 6. with ceph-deploy: 6a zap the new disk (just in case) 6b create the new OSD 7. add the new osd to the crush map. 8. wait for rebalancing. My questions are: - Is my procedure reasonable? - What if I skip the #2 and instead to wait for rebalancing I directly purge the OSD? - Is better to reweight the OSD before take it out? I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain is host. Thanks, Iztok ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure coded pools - overhead, data distribution
hdd 3.63860 1.0 3725G 2906G 819G 78.01 1.30 119 69 hdd 1.81929 1.0 1862G 1316G 546G 70.66 1.18 53 70 hdd 1.81929 0.95000 1862G 1224G 638G 65.73 1.10 49 120 hdd 5.45789 1.0 5588G 3900G 1687G 69.80 1.17 151 147 hdd 5.45789 1.0 5588G 869G 4719G 15.56 0.26 8 11 hdd 5.45799 1.0 5588G 4174G 1414G 74.70 1.25 162 20 hdd 7.27699 1.0 7451G 5430G 2021G 72.88 1.22 221 71 hdd 1.81929 1.0 1862G 1300G 562G 69.83 1.17 55 72 hdd 1.81929 1.0 1862G 1093G 769G 58.70 0.98 44 73 hdd 3.63860 1.0 3725G 2706G 1019G 72.63 1.21 112 74 hdd 1.81929 1.0 1862G 1295G 567G 69.54 1.16 50 75 hdd 1.81929 1.0 1862G 1127G 735G 60.53 1.01 45 116 hdd 7.27730 1.0 7451G 4775G 2676G 64.09 1.07 181 121 hdd 3.63860 1.0 3725G 2163G 1562G 58.06 0.97 84 148 hdd 5.45789 1.0 5588G 832G 4756G 14.90 0.25 14 149 hdd 5.45789 1.0 5588G 776G 4812G 13.89 0.23 13 19 hdd 7.27699 1.0 7451G 5664G 1787G 76.01 1.27 206 76 hdd 1.81929 1.0 1862G 1476G 386G 79.26 1.33 59 77 hdd 1.81929 0.95000 1862G 1513G 349G 81.24 1.36 60 78 hdd 1.81929 1.0 1862G 1503G 359G 80.70 1.35 65 79 hdd 3.63860 1.0 3725G 2705G 1020G 72.62 1.21 104 81 hdd 1.81929 1.0 1862G 1315G 547G 70.63 1.18 50 108 hdd 1.81929 1.0 1862G 1706G 156G 91.59 1.53 61 140 hdd 10.91399 1.0 11175G 8090G 3085G 72.39 1.21 313 150 hdd 5.45789 1.0 5588G 939G 4649G 16.81 0.28 16 151 hdd 5.45789 1.0 5588G 900G 4688G 16.10 0.27 12 122 hdd 1.81929 1.0 1862G 1731G 131G 92.93 1.55 73 135 hdd 1.81929 1.0 1862G 1605G 256G 86.21 1.44 65 136 hdd 1.81929 1.0 1862G 1441G 421G 77.36 1.29 58 137 hdd 1.81929 1.0 1862G 1693G 169G 90.93 1.52 70 138 hdd 1.81929 1.0 1862G 1275G 587G 68.46 1.14 49 139 hdd 1.81929 1.0 1862G 1705G 157G 91.54 1.53 66 141 hdd 10.91399 1.0 11175G 8657G 2518G 77.47 1.30 299 152 hdd 5.45789 1.0 5588G 999G 4589G 17.88 0.30 11 10 hdd 5.45799 1.0 5588G 3825G 1763G 68.45 1.14 156 82 hdd 5.45789 1.0 5588G 3839G 1749G 68.69 1.15 152 83 hdd 1.81929 1.0 1862G 1231G 631G 66.11 1.11 49 84 hdd 1.81929 1.0 1862G 1273G 589G 68.37 1.14 49 85 hdd 1.81929 1.0 1862G 1429G 432G 76.76 1.28 59 114 hdd 5.45789 1.0 5588G 3455G 2133G 61.83 1.03 138 123 hdd 5.45789 1.0 5588G 3678G 1910G 65.82 1.10 146 142 hdd 10.91399 1.0 11175G 7359G 3816G 65.85 1.10 298 153 hdd 5.45789 1.0 5588G 986G 4602G 17.64 0.30 17 144 hdd 0 1.0 5588G 1454M 5587G 0.03 0 0 145 hdd 0 1.0 5588G 1446M 5587G 0.03 0 0 146 hdd 0 1.0 5588G 1455M 5587G 0.03 0 0 TOTAL 579T 346T 232T 59.80 MIN/MAX VAR: 0/1.55 STDDEV: 23.04 THanks in advance for any help, i find it very hard to wrap my head around this. Josef Zelenka Cloudevelops ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NFS-ganesha with RGW
Hi, thanks for the quick reply. As for 1. I mentioned that i'm running ubuntu 16.04, kernel 4.4.0-121 - as it seems the platform package(nfs-ganesha-ceph) does not include the rgw fsal. 2. Nfsd was running - after rebooting i managed to get ganesha to bind, rpcbind is running, though i still can't mount the rgw due to timeouts. I suspect my conf might be wrong, but i'm not sure how to make sure it is. I've set up my ganesha.conf with the FSAL and RGW block - do i need anything else? EXPORT { Export_ID=1; Path = "/"; Pseudo = "/"; Access_Type = RW; SecType = "sys"; NFS_Protocols = 4; Transport_Protocols = TCP; # optional, permit unsquashed access by client "root" user #Squash = No_Root_Squash; FSAL { Name = RGW; User_Id = access key/secret>; Access_Key_Id = ""; Secret_Access_Key = ""; } RGW { cluster = "ceph"; name = "client.radosgw.radosgw-s2"; ceph_conf = "/etc/ceph/ceph.conf"; init_args = "-d --debug-rgw=16"; } } Josef On 30/05/18 13:18, Matt Benjamin wrote: Hi Josef, 1. You do need the Ganesha fsal driver to be present; I don't know your platform and os version, so I couldn't look up what packages you might need to install (or if the platform package does not build the RGW fsal) 2. The most common reason for ganesha.nfsd to fail to bind to a port is that a Linux kernel nfsd is already running--can you make sure that's not the case; meanwhile you -do- need rpcbind to be running Matt On Wed, May 30, 2018 at 6:03 AM, Josef Zelenka wrote: Hi everyone, i'm currently trying to set up a NFS-ganesha instance that mounts a RGW storage, however i'm not succesful in this. I'm running Ceph Luminous 12.2.4 and ubuntu 16.04. I tried compiling ganesha from source(latest version), however i didn't manage to get the mount running with that, as ganesha refused to bind to the ipv6 interface - i assume this is a ganesha issue, but i didn't find any relevant info on what might cause this - my network setup should allow for that. Then i installed ganesha-2.6 from the official repos, set up the config for RGW as per the official howto http://docs.ceph.com/docs/master/radosgw/nfs/, but i'm getting: Could not dlopen module:/usr/lib/x86_64-linux-gnu/ganesha/libfsalrgw.so Error:/usr/lib/x86_64-linux-gnu/ganesha/libfsalrgw.so: cannot open shared object file: No such file or directory and lo and behold, the libfsalrgw.so isn't present in the folder. I installed the nfs-ganesha and nfs-ganesha-fsal packages. I tried googling around, but i didn't find any relevant info or walkthroughs for this setup, so i'm asking - was anyone succesful in setting this up? I can see that even the redhat solution is still in progress, so i'm not sure if this even works. Thanks for any help, Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] NFS-ganesha with RGW
Hi everyone, i'm currently trying to set up a NFS-ganesha instance that mounts a RGW storage, however i'm not succesful in this. I'm running Ceph Luminous 12.2.4 and ubuntu 16.04. I tried compiling ganesha from source(latest version), however i didn't manage to get the mount running with that, as ganesha refused to bind to the ipv6 interface - i assume this is a ganesha issue, but i didn't find any relevant info on what might cause this - my network setup should allow for that. Then i installed ganesha-2.6 from the official repos, set up the config for RGW as per the official howto http://docs.ceph.com/docs/master/radosgw/nfs/, but i'm getting: Could not dlopen module:/usr/lib/x86_64-linux-gnu/ganesha/libfsalrgw.so Error:/usr/lib/x86_64-linux-gnu/ganesha/libfsalrgw.so: cannot open shared object file: No such file or directory and lo and behold, the libfsalrgw.so isn't present in the folder. I installed the nfs-ganesha and nfs-ganesha-fsal packages. I tried googling around, but i didn't find any relevant info or walkthroughs for this setup, so i'm asking - was anyone succesful in setting this up? I can see that even the redhat solution is still in progress, so i'm not sure if this even works. Thanks for any help, Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Issues with RBD when rebooting
Hi, we are running a jewel cluster (54OSDs, six nodes, ubuntu 16.04) that serves as a backend for openstack(newton) VMs. TOday we had to reboot one of the nodes(replicated pool, x2) and some of our VMs oopsed with issues with their FS(mainly database VMs, postgresql) - is there a reason for this to happen? if data is replicated, the VMs shouldn't even notice we rebooted one of the nodes, right? Maybe i just don't understand how this works correctly, but i hope someone around here can either tell me why this is happenning or how to fix it. Thanks Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs write fail when node goes down
Client's kernel is 4.4.0. Regarding the hung osd request, i'll have to check, the issue is gone now, so i'm not sure if i'll find what you are suggesting. It's rather odd, because Ceph's failover worked for us every time, so i'm trying to figure out whether it is a ceph or app issue. On 15/05/18 02:57, Yan, Zheng wrote: On Mon, May 14, 2018 at 5:37 PM, Josef Zelenka <josef.zele...@cloudevelops.com> wrote: Hi everyone, we've encountered an unusual thing in our setup(4 nodes, 48 OSDs, 3 monitors - ceph Jewel, Ubuntu 16.04 with kernel 4.4.0). Yesterday, we were doing a HW upgrade of the nodes, so they went down one by one - the cluster was in good shape during the upgrade, as we've done this numerous times and we're quite sure that the redundancy wasn't screwed up while doing this. However, during this upgrade one of the clients that does backups to cephfs(mounted via the kernel driver) failed to write the backup file correctly to the cluster with the following trace after we turned off one of the nodes: [2585732.529412] 8800baa279a8 813fb2df 880236230e00 8802339c [2585732.529414] 8800baa28000 88023fc96e00 7fff 8800baa27b20 [2585732.529415] 81840ed0 8800baa279c0 818406d5 [2585732.529417] Call Trace: [2585732.529505] [] ? cpumask_next_and+0x2f/0x40 [2585732.529558] [] ? bit_wait+0x60/0x60 [2585732.529560] [] schedule+0x35/0x80 [2585732.529562] [] schedule_timeout+0x1b5/0x270 [2585732.529607] [] ? kvm_clock_get_cycles+0x1e/0x20 [2585732.529609] [] ? bit_wait+0x60/0x60 [2585732.529611] [] io_schedule_timeout+0xa4/0x110 [2585732.529613] [] bit_wait_io+0x1b/0x70 [2585732.529614] [] __wait_on_bit_lock+0x4e/0xb0 [2585732.529652] [] __lock_page+0xbb/0xe0 [2585732.529674] [] ? autoremove_wake_function+0x40/0x40 [2585732.529676] [] pagecache_get_page+0x17d/0x1c0 [2585732.529730] [] ? ceph_pool_perm_check+0x48/0x700 [ceph] [2585732.529732] [] grab_cache_page_write_begin+0x26/0x40 [2585732.529738] [] ceph_write_begin+0x48/0xe0 [ceph] [2585732.529739] [] generic_perform_write+0xce/0x1c0 [2585732.529763] [] ? file_update_time+0xc9/0x110 [2585732.529769] [] ceph_write_iter+0xf89/0x1040 [ceph] [2585732.529792] [] ? __alloc_pages_nodemask+0x159/0x2a0 [2585732.529808] [] new_sync_write+0x9b/0xe0 [2585732.529811] [] __vfs_write+0x26/0x40 [2585732.529812] [] vfs_write+0xa9/0x1a0 [2585732.529814] [] SyS_write+0x55/0xc0 [2585732.529817] [] entry_SYSCALL_64_fastpath+0x16/0x71 is there any hang osd request in /sys/kernel/debug/ceph//osdc? I have encountered this behavior on Luminous, but not on Jewel. Anyone who has a clue why the write fails? As far as i'm concerned, it should always work if all the PGs are available. Thanks Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs write fail when node goes down
Hi everyone, we've encountered an unusual thing in our setup(4 nodes, 48 OSDs, 3 monitors - ceph Jewel, Ubuntu 16.04 with kernel 4.4.0). Yesterday, we were doing a HW upgrade of the nodes, so they went down one by one - the cluster was in good shape during the upgrade, as we've done this numerous times and we're quite sure that the redundancy wasn't screwed up while doing this. However, during this upgrade one of the clients that does backups to cephfs(mounted via the kernel driver) failed to write the backup file correctly to the cluster with the following trace after we turned off one of the nodes: [2585732.529412] 8800baa279a8 813fb2df 880236230e00 8802339c [2585732.529414] 8800baa28000 88023fc96e00 7fff 8800baa27b20 [2585732.529415] 81840ed0 8800baa279c0 818406d5 [2585732.529417] Call Trace: [2585732.529505] [] ? cpumask_next_and+0x2f/0x40 [2585732.529558] [] ? bit_wait+0x60/0x60 [2585732.529560] [] schedule+0x35/0x80 [2585732.529562] [] schedule_timeout+0x1b5/0x270 [2585732.529607] [] ? kvm_clock_get_cycles+0x1e/0x20 [2585732.529609] [] ? bit_wait+0x60/0x60 [2585732.529611] [] io_schedule_timeout+0xa4/0x110 [2585732.529613] [] bit_wait_io+0x1b/0x70 [2585732.529614] [] __wait_on_bit_lock+0x4e/0xb0 [2585732.529652] [] __lock_page+0xbb/0xe0 [2585732.529674] [] ? autoremove_wake_function+0x40/0x40 [2585732.529676] [] pagecache_get_page+0x17d/0x1c0 [2585732.529730] [] ? ceph_pool_perm_check+0x48/0x700 [ceph] [2585732.529732] [] grab_cache_page_write_begin+0x26/0x40 [2585732.529738] [] ceph_write_begin+0x48/0xe0 [ceph] [2585732.529739] [] generic_perform_write+0xce/0x1c0 [2585732.529763] [] ? file_update_time+0xc9/0x110 [2585732.529769] [] ceph_write_iter+0xf89/0x1040 [ceph] [2585732.529792] [] ? __alloc_pages_nodemask+0x159/0x2a0 [2585732.529808] [] new_sync_write+0x9b/0xe0 [2585732.529811] [] __vfs_write+0x26/0x40 [2585732.529812] [] vfs_write+0xa9/0x1a0 [2585732.529814] [] SyS_write+0x55/0xc0 [2585732.529817] [] entry_SYSCALL_64_fastpath+0x16/0x71 I have encountered this behavior on Luminous, but not on Jewel. Anyone who has a clue why the write fails? As far as i'm concerned, it should always work if all the PGs are available. Thanks Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW multisite sync issues
Hi everyone, i'm currently setting up RGW multisite(one cluster is jewel(primary), the other is luminous - this is only for testing, on prod we will have the same version - jewel on both), but i can't get bucket synchronization to work. Data gets synchronized fine when i upload it, but when i delete it from the primary cluster, it only deletes the metadata of the file on the secondary one, the files are still there(can see it in rados df - pool states the same). Also, none of the older buckets start synchronizing to the secondary cluster. It's been quite a headache so far. Anyone who knows what might be wrong? I can supply any needed info. THanks Josef Zelenka ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw halts writes during recovery, recovery info issues
forgot to mention - we are running jewel, 10.2.10 On 26/03/18 11:30, Josef Zelenka wrote: Hi everyone, i'm currently fighting an issue in a cluster we have for a customer. It's used for a lot of small files(113m currently) that are pulled via radosgw. We have 3 nodes, 24 OSDs in total. the index etc pools are migrated to a separate root called "ssd", that root is on only ssd drives - each node has one ssd in this root. We did this because we had an issue where if a normal OSD(an HDD) crashed, the entire rgw stopped working. Today, one of the SSDs crashed and after changing the drive, putting a new one in and starting recovery, RGW halted writes. Read worked ok, but we couldn't upload any more files to it. The non-data pools all have size set to 3, so there should still be 2 healthy copies of the index data. Also, when recovery started, no recovery i/o was shown in the ceph -s output, so we checked it through df, after the ssd backfilled, ceph -s went from X degraded pgs back to OK instantly. Does anyone know how to fix these? i don't think writes should be halted during recovery. Thanks Josef Z ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping faulty pg to file on cephfs
Oh, sorry, forgot to mention - this cluster is running jewel :( On 13/02/18 12:10, John Spray wrote: On Tue, Feb 13, 2018 at 10:38 AM, Josef Zelenka <josef.zele...@cloudevelops.com> wrote: Hi everyone, one of the clusters we are running for a client recently had a power outage, it's currently in a working state, however 3 pgs were left inconsistent atm, with this type of error in the log(when i attempt to ceph pg repair it) 2018-02-13 09:47:17.534912 7f3735626700 -1 log_channel(cluster) log [ERR] : repair 15.1e32 15:4c7eed31:::10002110e12.004b:head on disk size (0) does not match object info size (4194304) adjusted for ondisk to (4194304) i know this can be fixed by truncating the ondisk object to the expected size, but it clearly means we've lost some data. This cluster is used for cephfs only, so i'd like to find which files on the cephfs were affected. I know the OSDs for that pg, i know which pg and which object was affected, so i hope it's possible. I found a 2015 entry in the mailing list, that does the reverse thing (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005384.html), as in - map file to pg/object. I have 230TB of data in that cluster in a lot of files, so mapping them all would take a long time. I hope there is a way to do this, if people here have any idea/experience with this, it'd be great. We added a tool in luminous that does this: http://docs.ceph.com/docs/master/cephfs/disaster-recovery/#finding-files-affected-by-lost-data-pgs John Thanks Josef Zelenka ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mapping faulty pg to file on cephfs
Hi everyone, one of the clusters we are running for a client recently had a power outage, it's currently in a working state, however 3 pgs were left inconsistent atm, with this type of error in the log(when i attempt to ceph pg repair it) 2018-02-13 09:47:17.534912 7f3735626700 -1 log_channel(cluster) log [ERR] : repair 15.1e32 15:4c7eed31:::10002110e12.004b:head on disk size (0) does not match object info size (4194304) adjusted for ondisk to (4194304) i know this can be fixed by truncating the ondisk object to the expected size, but it clearly means we've lost some data. This cluster is used for cephfs only, so i'd like to find which files on the cephfs were affected. I know the OSDs for that pg, i know which pg and which object was affected, so i hope it's possible. I found a 2015 entry in the mailing list, that does the reverse thing (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005384.html), as in - map file to pg/object. I have 230TB of data in that cluster in a lot of files, so mapping them all would take a long time. I hope there is a way to do this, if people here have any idea/experience with this, it'd be great. Thanks Josef Zelenka ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Inconsistent PG - failed to pick suitable auth object
Hi everyone, i'm having issues with one of our clusters, regarding a seemingly unfixable inconsistent pg. We are running ubuntu 16.04, ceph 10.2.7, 96 osds on 8 nodes. After a power outage, we had some inconsistent pgs, i managed to fix all of them but this one, here's an excerpt from the logs(it's outputting this everytime i issue a ceph pg repair command) 2018-01-29 12:49:35.126066 7f09ffd1e700 -1 log_channel(cluster) log [ERR] : 3.c04 shard 44: soid 3:203d2906:::benchmark_data_mon3_3417_object685:head data_digest 0x8d3f3b5b != data_digest 0xdbdd31f0 from auth oi 3:203d2906:::benchmark_data_mon3_3417_object685:head(112873'834220 client.79854137.0:686 dirty|data_digest|omap_digest s 65536 uv 834220 dd dbdd31f0 od ) 2018-01-29 12:49:35.126087 7f09ffd1e700 -1 log_channel(cluster) log [ERR] : 3.c04 shard 97: soid 3:203d2906:::benchmark_data_mon3_3417_object685:head data_digest 0x8d3f3b5b != data_digest 0xdbdd31f0 from auth oi 3:203d2906:::benchmark_data_mon3_3417_object685:head(112873'834220 client.79854137.0:686 dirty|data_digest|omap_digest s 65536 uv 834220 dd dbdd31f0 od ), attr name mismatch '_', attr name mismatch 'snapset' 2018-01-29 12:49:35.126091 7f09ffd1e700 -1 log_channel(cluster) log [ERR] : 3.c04 soid 3:203d2906:::benchmark_data_mon3_3417_object685:head: failed to pick suitable auth object 2018-01-29 12:49:35.126164 7f09ffd1e700 -1 log_channel(cluster) log [ERR] : deep-scrub 3.c04 3:203d2906:::benchmark_data_mon3_3417_object685:head no '_' attr 2018-01-29 12:49:35.126170 7f09ffd1e700 -1 log_channel(cluster) log [ERR] : deep-scrub 3.c04 3:203d2906:::benchmark_data_mon3_3417_object685:head no 'snapset' attr 2018-01-29 12:50:11.670123 7f09f3d06700 -1 log_channel(cluster) log [ERR] : 3.c04 deep-scrub 5 errors 2018-01-29 13:30:13.839317 7f596c5d2700 -1 log_channel(cluster) log [ERR] : 3.c04 shard 44: soid 3:203d2906:::benchmark_data_mon3_3417_object685:head data_digest 0x8d3f3b5b != data_digest 0xdbdd31f0 from auth oi 3:203d2906:::benchmark_data_mon3_3417_object685:head(112873'834220 client.79854137.0:686 dirty|data_digest|omap_digest s 65536 uv 834220 dd dbdd31f0 od ) 2018-01-29 13:30:13.839335 7f596c5d2700 -1 log_channel(cluster) log [ERR] : 3.c04 shard 97 missing 3:203d2906:::benchmark_data_mon3_3417_object685:head 2018-01-29 13:30:13.839339 7f596c5d2700 -1 log_channel(cluster) log [ERR] : 3.c04 soid 3:203d2906:::benchmark_data_mon3_3417_object685:head: failed to pick suitable auth object 2018-01-29 13:30:52.850323 7f596c5d2700 -1 log_channel(cluster) log [ERR] : 3.c04 repair stat mismatch, got 4084/4085 objects, 0/0 clones, 4084/4084 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 16824119169/16824119169 bytes, 0/0 hit_set_archive bytes. 2018-01-29 13:30:52.850379 7f596c5d2700 -1 log_channel(cluster) log [ERR] : 3.c04 repair 3 errors, 1 fixed 2018-01-29 13:51:33.138881 7f59605ba700 -1 log_channel(cluster) log [ERR] : 3.c04 shard 44: soid 3:203d2906:::benchmark_data_mon3_3417_object685:head data_digest 0x8d3f3b5b != data_digest 0xdbdd31f0 from auth oi 3:203d2906:::benchmark_data_mon3_3417_object685:head(112873'834220 client.79854137.0:686 dirty|data_digest|omap_digest s 65536 uv 834220 dd dbdd31f0 od ) 2018-01-29 13:51:33.138895 7f59605ba700 -1 log_channel(cluster) log [ERR] : 3.c04 shard 97 missing 3:203d2906:::benchmark_data_mon3_3417_object685:head 2018-01-29 13:51:33.138898 7f59605ba700 -1 log_channel(cluster) log [ERR] : 3.c04 soid 3:203d2906:::benchmark_data_mon3_3417_object685:head: failed to pick suitable auth object when i try to find info about the object itself, i get this(after a deep scrub) rados list-inconsistent-obj 3.c04 --format=json-pretty { "epoch": 114466, "inconsistents": [] } i tried deleting the object from the primary and repairing, truncating the object to the same size on both primary and secondary and even copying the identical object from the secondary to the primary, but nothing seems to work. any pointers regarding this? thanks Josef Zelenka ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to get bucket or object's ACL?
hi, this should be possible via the s3cmd tool. |s3cmd info s3:/// s3cmd info s3://PP-2015-Tut/ here is more info - https://kunallillaney.github.io/s3cmd-tutorial/ i have succesfully used this tool in the past for ACL management, so i hope it's gonna work for you too. JZ | On 29/01/18 11:23, 13605702...@163.com wrote: hi how to get the bucket or object's ACL in command line? thanks 13605702...@163.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster crash - FAILED assert(interval.last > last)
I have posted logs/strace from our osds with details to a ticket in the ceph bug tracker - see here http://tracker.ceph.com/issues/21142. You can see where exactly the OSDs crash etc, this can be of help if someone decides to debug it. JZ On 10/01/18 22:05, Josef Zelenka wrote: Hi, today we had a disasterous crash - we are running a 3 node, 24 osd in total cluster (8 each) with SSDs for blockdb, HDD for bluestore data. This cluster is used as a radosgw backend, for storing a big number of thumbnails for a file hosting site - around 110m files in total. We were adding an interface to the nodes which required a restart, but after restarting one of the nodes, a lot of the OSDs were kicked out of the cluster and rgw stopped working. We have a lot of pgs down and unfound atm. OSDs can't be started(aside from some, that's a mystery) with this error - FAILED assert ( interval.last > last) - they just periodically restart. So far, the cluster is broken and we can't seem to bring it back up. We tried fscking the osds via the ceph objectstore tool, but it was no good. The root of all this seems to be in the FAILED assert(interval.last > last) error, however i can't find any info regarding this or how to fix it. Did someone here also encounter it? We're running luminous on ubuntu 16.04. Thanks Josef Zelenka Cloudevelops ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to speed up backfill
Hi, our recovery slowed down significantly towards the end, however it was still about five times faster than the original speed.We suspected that this is caused somehow by threading (more objects transferred - more threads used), but this is only an assumption. On 11/01/18 05:02, shadow_lin wrote: Hi, I had tried these two method and for backfilling it seems only osd-max-backfills works. How was your recovery speed when it comes to the last few pgs or objects? 2018-01-11 shadow_lin *发件人:*Josef Zelenka <josef.zele...@cloudevelops.com> *发送时间:*2018-01-11 04:53 *主题:*Re: [ceph-users] How to speed up backfill *收件人:*"shadow_lin"<shadow_...@163.com> *抄送:* Hi, i had the same issue a few days back, i tried playing around with these two: ceph tell 'osd.*' injectargs '--osd-max-backfills ' ceph tell 'osd.*' injectargs '--osd-recovery-max-active ' and it helped greatly(increased our recovery speed 20x), but be careful to not overload your systems. On 10/01/18 17:50, shadow_lin wrote: Hi all, I am playing with setting for backfill to try to find how to control the speed of backfill. Now I only find "osd max backfills" can have effect the backfill speed. But after all pg need to be backfilled begin backfilling I can't find any way to speed up backfills. Especailly when it comes to the last pg to recover, the speed is only a few MB/s(when there are multi pg are backfilled the speed could be more than 600MB/s in my test) I am a little confused about the setting of backfills and recovery.Though backfilling is a kind of recovery but It seems recovery setting is only about to replay pg logs to do recover pg. Would change "osd recovery max active" or other recovery setting have any effect on backfilling? I did tried "osd recovery op priority" and "osd recovery max active" with no luck. Any advice would be greatly appreciated.Thanks 2018-01-11 lin.yunfan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to speed up backfill
On 10/01/18 21:53, Josef Zelenka wrote: Hi, i had the same issue a few days back, i tried playing around with these two: ceph tell 'osd.*' injectargs '--osd-max-backfills ' ceph tell 'osd.*' injectargs '--osd-recovery-max-active ' and it helped greatly(increased our recovery speed 20x), but be careful to not overload your systems. On 10/01/18 17:50, shadow_lin wrote: Hi all, I am playing with setting for backfill to try to find how to control the speed of backfill. Now I only find "osd max backfills" can have effect the backfill speed. But after all pg need to be backfilled begin backfilling I can't find any way to speed up backfills. Especailly when it comes to the last pg to recover, the speed is only a few MB/s(when there are multi pg are backfilled the speed could be more than 600MB/s in my test) I am a little confused about the setting of backfills and recovery.Though backfilling is a kind of recovery but It seems recovery setting is only about to replay pg logs to do recover pg. Would change "osd recovery max active" or other recovery setting have any effect on backfilling? I did tried "osd recovery op priority" and "osd recovery max active" with no luck. Any advice would be greatly appreciated.Thanks 2018-01-11 lin.yunfan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cluster crash - FAILED assert(interval.last > last)
Hi, today we had a disasterous crash - we are running a 3 node, 24 osd in total cluster (8 each) with SSDs for blockdb, HDD for bluestore data. This cluster is used as a radosgw backend, for storing a big number of thumbnails for a file hosting site - around 110m files in total. We were adding an interface to the nodes which required a restart, but after restarting one of the nodes, a lot of the OSDs were kicked out of the cluster and rgw stopped working. We have a lot of pgs down and unfound atm. OSDs can't be started(aside from some, that's a mystery) with this error - FAILED assert ( interval.last > last) - they just periodically restart. So far, the cluster is broken and we can't seem to bring it back up. We tried fscking the osds via the ceph objectstore tool, but it was no good. The root of all this seems to be in the FAILED assert(interval.last > last) error, however i can't find any info regarding this or how to fix it. Did someone here also encounter it? We're running luminous on ubuntu 16.04. Thanks Josef Zelenka Cloudevelops ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] determining the source of io in the cluster
Hi everyone, we have recently deployed a Luminous(12.2.1) cluster on Ubuntu - three osd nodes and three monitors, every osd has 3x 2TB SSD + an NVMe drive for a blockdb. We use it as a backend for our Openstack cluster, so we store volumes there. IN the last few days, the read op/s rose to around 10k-25k constantly(it fluctuates between those two) and it doesn't seem to go down. I can see, that the io/read ops come from the pool where we store VM volumes, but i can't source this issue to a particular volume. Is that even possible? Any experiences with debugging this? Any info or advice is greatly appreciated. Thanks Josef Zelenka Cloudevelops ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] A new SSD for journals - everything sucks?
Hello everyone, lately, we've had issues with buying SSDs that we use for journaling(Kingston stopped making them) - Kingston V300 - so we decided to start using a different model and started researching which one would be the best price/value for us. We compared five models, to check if they are compatible with our needs - SSDNow v300, HyperX Fury,SSDNOw KC400, SSDNow UV400 and SSDNow A400. the best one is still the V300, with the highest iops of 59 001. Second best and still useable was the HyperX Fury with 45000 iops. The other three had terrible results, the max iops we got were around 13 000 with the dsync and direct flags. We also tested Samsung SSDs(the EVO series) and we got similarly bad results. To get to the root of my question - i am pretty sure we are not the only ones affected by the v300's death. Is there anyone else out there with some benchmarking data/knowledge about some good price/performance SSDs for ceph journaling? I can also share the complete benchmarking data my coworker made, if someone is interested. Thanks Josef Zelenka ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Large amount of files - cephfs?
Hi everyone, thanks for the advice, we consulted it and we're gonna test it out with cephfs first. Object storage is a possibility if it misbehaves. Hopefully it will go well :) On 28/09/17 08:20, Henrik Korkuc wrote: On 17-09-27 14:57, Josef Zelenka wrote: Hi, we are currently working on a ceph solution for one of our customers. They run a file hosting and they need to store approximately 100 million of pictures(thumbnails). Their current code works with FTP, that they use as a storage. We thought that we could use cephfs for this, but i am not sure how it would behave with that many files, how would the performance be affected etc. Is cephfs useable in this scenario, or would radosgw+swift be better(they'd likely have to rewrite some of the code, so we'd prefer not to do this)? We already have some experience with cephfs for storing bigger files, streaming etc so i'm not completely new to this, but i thought it'd be better to ask more experiened users. Some advice on this would be greatly appreciated, thanks, Josef Depending on your OSD count, you should be able to put 100mil of files there. As others mentioned, depending on your workload, metadata may be a bottleneck. If metadata is not a concern, then you just need to have enough OSDs to distribute RADOS objects. You should be fine with few millions objects per OSDs, going with tens of millions per OSD may be more problematic as you have larger memory usage, OSDs are slower, backfill/recovery is slow. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Large amount of files - cephfs?
Hi, we are currently working on a ceph solution for one of our customers. They run a file hosting and they need to store approximately 100 million of pictures(thumbnails). Their current code works with FTP, that they use as a storage. We thought that we could use cephfs for this, but i am not sure how it would behave with that many files, how would the performance be affected etc. Is cephfs useable in this scenario, or would radosgw+swift be better(they'd likely have to rewrite some of the code, so we'd prefer not to do this)? We already have some experience with cephfs for storing bigger files, streaming etc so i'm not completely new to this, but i thought it'd be better to ask more experiened users. Some advice on this would be greatly appreciated, thanks, Josef ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RADOSGW S3 api ACLs
Hello everyone, i've been struggling for the past few days with setting up ACLs for buckets on my radosgw. I want to use the buckets with the s3 API and i want them to have the ACL set up like this: every file that gets pushed into the bucket is automatically readable by everyone and writeable only by a specific user. Currently i was able to set the ACLs i want on existing files, but i want them to be set up in a way that will automatically do this, i.e the entire bucket. Can anyone shed some light on ACLs in S3 API and RGW? Thanks Josef Zelenka Cloudevelops ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com