Re: [ceph-users] Upgrading ceph and mapped rbds
I had several kernel mapped rbds as well as ceph-fuse mounted CephFS clients when I upgraded from Jewel to Luminous. I rolled out the client upgrades over a few weeks after the upgrade. I had tested that the client use cases I had would be fine running Jewel connecting to a Luminous cluster so there weren't any surprised for me when I did it in production. On Tue, Apr 3, 2018, 11:21 PM Konstantin Shalyginwrote: > > The VMs are XenServer VMs with virtual Disk saved at the NFS Server > which has the RBD mounted … So there is nor migration from my POV as there > is no second storage to migrate to ... > > > > All your pain is self-inflicted. > > Just FYI clients are not interrupted when you upgrade ceph. Client will > be interrupted only when update, so if you (suddenly) change crush > tunables, minimum_required_version for example (for this reason clients > must be upgraded before cluster). > > > > > k > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] amount of PGs/pools/OSDs for your openstack / Ceph
The general recommendation is to target around 100 PG/OSD. Have you tried the https://ceph.com/pgcalc/ tool? On Wed, 4 Apr 2018 at 21:38, Osama Hasebouwrote: > Hi Everyone, > > I would like to know what kind of setup had the Ceph community been using > for their Openstack's Ceph configuration when it comes to number of Pools & > OSDs and their PGs. > > Ceph documentation briefly mentions it for small cluster size, and I would > like to know from your experience, how much PGs have you created for your > openstack pools in reality for a ceph cluster ranging from 1-2 PB capacity > or 400-600 number of OSDs that performs well without issues. > > Hope to hear from you! > > Thanks. > > Regards, > Ossi > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bluestore OSD did not start at system-boot
On Thu, Apr 5, 2018 at 6:33 AM, Ansgar Jazdzewskiwrote: > hi folks, > > i just figured out that my ODS's did not start because the filsystem > is not mounted. Would love to see some ceph-volume logs (both ceph-volume.log and ceph-volume-systemd.log) because we do try several times with timeouts before giving up. If the filesystem is not available, the systemd units should keep trying for a while. > > So i wrote a script to Hack my way around it > # > #! /usr/bin/env bash > > DATA=( $(ceph-volume lvm list | grep -e 'osd id\|osd fsid' | awk > '{print $3}' | tr '\n' ' ') ) > > OSDS=$(( ${#DATA[@]}/2 )) > > for OSD in $(seq 0 $(($OSDS-1))); do > ceph-volume lvm activate "${DATA[( $OSD*2 )]}" "${DATA[( $OSD*2+1 )]}" > done > # > > i'am sure that this is not the way it should be!? so any help i > welcome to figure out why my BlueStore-OSD is not mounted at > boot-time. > > Thanks, > Ansgar > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how the files in /var/lib/ceph/osd/ceph-0 are generated
On Fri, Apr 6, 2018 at 10:27 PM, Jeffrey Zhangwrote: > Yes, I am using ceph-volume. > > And i found where the keyring comes from. > > bluestore will save all the information at the starting of disk > (BDEV_LABEL_BLOCK_SIZE=4096) > this area is used for saving labels, including keyring, whoami etc. Correct, this is documented here: http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#summary (see step 4): Recreate all the files needed with ceph-bluestore-tool prime-osd-dir by pointing it to the OSD block device. > > these can be read through ceph-bluestore-tool show-lable > > $ ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-0 > { > "/var/lib/ceph/osd/ceph-0/block": { > "osd_uuid": "c349b2ba-690f-4a36-b6f6-2cc0d0839f29", > "size": 2147483648, > "btime": "2018-04-04 10:22:25.216117", > "description": "main", > "bluefs": "1", > "ceph_fsid": "14941be9-c327-4a17-8b86-be50ee2f962e", > "kv_backend": "rocksdb", > "magic": "ceph osd volume v026", > "mkfs_done": "yes", > "osd_key": "AQDgNsRaVtsRIBAA6pmOf7y2GBufyE83nHwVvg==", > "ready": "ready", > "whoami": "0" > } > } > > So during mounting the /var/lib/ceph/osd/ceph-0, ceph will dump these > content into the > tmpfs folder. > > On Fri, Apr 6, 2018 at 10:21 PM, David Turner wrote: >> >> Likely the differences you're seeing of /dev/sdb1 and tmpfs have to do >> with how ceph-disk vs ceph-volume manage the OSDs and what their defaults >> are. ceph-disk will create partitions on devices while ceph-volume >> configures LVM on the block device. Also with bluestore you do not have a >> standard filesystem, so ceph-volume creates a mock folder to place the >> necessary information into /var/lib/ceph/osd/ceph-0 to track the information >> for the OSD and how to start it. >> >> On Wed, Apr 4, 2018 at 6:20 PM Gregory Farnum wrote: >>> >>> On Tue, Apr 3, 2018 at 6:30 PM Jeffrey Zhang >>> wrote: I am testing ceph Luminous, the environment is - centos 7.4 - ceph luminous ( ceph offical repo) - ceph-deploy 2.0 - bluestore + separate wal and db I found the ceph osd folder `/var/lib/ceph/osd/ceph-0` is mounted from tmpfs. But where the files in that folder come from? like `keyring`, `whoami`? >>> >>> >>> These are generated as part of the initialization process. I don't know >>> the exact commands involved, but the keyring for instance will draw from the >>> results of "ceph osd new" (which is invoked by one of the ceph-volume setup >>> commands). That and whoami are part of the basic information an OSD needs to >>> communicate with a monitor. >>> -Greg >>> $ ls -alh /var/lib/ceph/osd/ceph-0/ lrwxrwxrwx. 1 ceph ceph 24 Apr 3 16:49 block -> /dev/ceph-pool/osd0.data lrwxrwxrwx. 1 root root 22 Apr 3 16:49 block.db -> /dev/ceph-pool/osd0-db lrwxrwxrwx. 1 root root 23 Apr 3 16:49 block.wal -> /dev/ceph-pool/osd0-wal -rw---. 1 ceph ceph 37 Apr 3 16:49 ceph_fsid -rw---. 1 ceph ceph 37 Apr 3 16:49 fsid -rw---. 1 ceph ceph 55 Apr 3 16:49 keyring -rw---. 1 ceph ceph6 Apr 3 16:49 ready -rw---. 1 ceph ceph 10 Apr 3 16:49 type -rw---. 1 ceph ceph2 Apr 3 16:49 whoami I guess they may be loaded from bluestore. But I can not find any clue for this. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Separate --block.wal --block.db bluestore not working as expected.
On Sat, Apr 7, 2018 at 11:59 AM, Gary Verhulpwrote: > > > > > > I’m trying to create bluestore osds with separate --block.wal --block.db > devices on a write intensive SSD > > > > I’ve split the SSD (/dev/sda) into two partditions sda1 and sda2 for db and > wal > > > > > > I seems to me the osd uuid is getting changed and I’m only able to start the > last OSD > > > > Do I need to create a new partition or logical volume on the SSD for each > OSD? Correct! This is what is needed for each OSD. You are re-using the same partitions for the other OSD which is why you are getting the following message: 2018-04-06 19:45:43.730515 7fe91a9cfd00 -1 bluestore(/dev/sda1) _check_or_set_bdev_label bdev /dev/sda1 fsid eb6cbcb3-f644-4973-b745-0e4389ef719c does not match our fsid 9d7a103a-f590-4842-bd3d-e9da27c3fb09 > > > > I’m sure this is a simple fail in my understanding of how it is supposed to > be provisioned. > > Any advice would be appreciated. > > > > Thanks, > > Gary > > > > > > [root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdc > --block.wal /dev/sda2 --block.db /dev/sda1 > > Running command: sudo vgcreate --force --yes > ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034 /dev/sdc > > stdout: Physical volume "/dev/sdc" successfully created. > > stdout: Volume group "ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034" > successfully created > > Running command: sudo lvcreate --yes -l 100%FREE -n > osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09 > ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034 > > stdout: Logical volume "osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09" > created. > > Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1 > > Running command: chown -R ceph:ceph /dev/dm-2 > > Running command: sudo ln -s > /dev/ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034/osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09 > /var/lib/ceph/osd/ceph-1/block > > Running command: sudo ceph --cluster ceph --name client.bootstrap-osd > --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o > /var/lib/ceph/osd/ceph-1/activate.monmap > > stderr: got monmap epoch 1 > > Running command: ceph-authtool /var/lib/ceph/osd/ceph-1/keyring > --create-keyring --name osd.1 --add-key > AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== > > stdout: creating /var/lib/ceph/osd/ceph-1/keyring > > stdout: added entity osd.1 auth auth(auid = 18446744073709551615 > key=AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== with 0 caps) > > Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring > > Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/ > > Running command: chown -R ceph:ceph /dev/sda2 > > Running command: chown -R ceph:ceph /dev/sda1 > > Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore > --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --key > --bluestore-block-wal-path > /dev/sda2 --bluestore-block-db-path /dev/sda1 --osd-data > /var/lib/ceph/osd/ceph-1/ --osd-uuid 9d7a103a-f590-4842-bd3d-e9da27c3fb09 > --setuser ceph --setgroup ceph > > stderr: 2018-04-06 19:41:44.519662 7f734f2e4d00 -1 > bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode > label at offset 102: buffer::malformed_input: void > bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past > end of struct encoding > > stderr: 2018-04-06 19:41:44.520939 7f734f2e4d00 -1 > bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode > label at offset 102: buffer::malformed_input: void > bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past > end of struct encoding > > stderr: 2018-04-06 19:41:44.521190 7f734f2e4d00 -1 > bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode > label at offset 102: buffer::malformed_input: void > bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past > end of struct encoding > > stderr: 2018-04-06 19:41:44.521454 7f734f2e4d00 -1 > bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid > > stderr: 2018-04-06 19:41:47.307648 7f734f2e4d00 -1 key > AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== > > stderr: 2018-04-06 19:41:48.068161 7f734f2e4d00 -1 created object store > /var/lib/ceph/osd/ceph-1/ for osd.1 fsid > 1ff50434-64ad-42bd-9a70-1968e4a9a813 > > > > > > [root@osdhost osd]# ceph-bluestore-tool show-label --dev /dev/sda1 > > { > > "/dev/sda1": { > > "osd_uuid": "9d7a103a-f590-4842-bd3d-e9da27c3fb09", > > "size": 200043171840, > > "btime": "2018-04-06 19:41:44.523894", > > "description": "bluefs db" > > } > > } > > > > [root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdd > --block.wal /dev/sda2 --block.db /dev/sda1 > > Running command: sudo vgcreate --force --yes > ceph-cc91203d-de5c-4d27-8c48-a58663075e67 /dev/sdd > > stdout: Physical volume "/dev/sdd" successfully created. > > stdout: Volume group
Re: [ceph-users] Ceph scrub logs: _scan_snaps no head for $object?
How do you resolve these issues? Apr 7 22:39:21 c03 ceph-osd: 2018-04-07 22:39:21.928484 7f0826524700 -1 osd.13 pg_epoch: 19008 pg[17.13( v 19008'6019891 (19008'6018375,19008'6019891] local-lis/les=18980/18981 n=3825 ec=3636/3636 lis/c 18980/18980 les/c/f 18981/18982/0 18980/18980/18903) [4,13,0] r=1 lpr=18980 luod=0'0 crt=19008'6019891 lcod 19008'6019890 active] _scan_snaps no head for 17:cbf61056:::rbd_data.239f5274b0dc51.0ff2:15 (have MIN) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: Separate --block.wal --block.db bluestore not working as expected.
I’m trying to create bluestore osds with separate --block.wal --block.db devices on a write intensive SSD I’ve split the SSD (/dev/sda) into two partditions sda1 and sda2 for db and wal I seems to me the osd uuid is getting changed and I’m only able to start the last OSD Do I need to create a new partition or logical volume on the SSD for each OSD? I’m sure this is a simple fail in my understanding of how it is supposed to be provisioned. Any advice would be appreciated. Thanks, Gary [root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdc --block.wal /dev/sda2 --block.db /dev/sda1 Running command: sudo vgcreate --force --yes ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034 /dev/sdc stdout: Physical volume "/dev/sdc" successfully created. stdout: Volume group "ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034" successfully created Running command: sudo lvcreate --yes -l 100%FREE -n osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09 ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034 stdout: Logical volume "osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09" created. Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1 Running command: chown -R ceph:ceph /dev/dm-2 Running command: sudo ln -s /dev/ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034/osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09 /var/lib/ceph/osd/ceph-1/block Running command: sudo ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-1/activate.monmap stderr: got monmap epoch 1 Running command: ceph-authtool /var/lib/ceph/osd/ceph-1/keyring --create-keyring --name osd.1 --add-key AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== stdout: creating /var/lib/ceph/osd/ceph-1/keyring stdout: added entity osd.1 auth auth(auid = 18446744073709551615 key=AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== with 0 caps) Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/ Running command: chown -R ceph:ceph /dev/sda2 Running command: chown -R ceph:ceph /dev/sda1 Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --key --bluestore-block-wal-path /dev/sda2 --bluestore-block-db-path /dev/sda1 --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 9d7a103a-f590-4842-bd3d-e9da27c3fb09 --setuser ceph --setgroup ceph stderr: 2018-04-06 19:41:44.519662 7f734f2e4d00 -1 bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode label at offset 102: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding stderr: 2018-04-06 19:41:44.520939 7f734f2e4d00 -1 bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode label at offset 102: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding stderr: 2018-04-06 19:41:44.521190 7f734f2e4d00 -1 bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode label at offset 102: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding stderr: 2018-04-06 19:41:44.521454 7f734f2e4d00 -1 bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid stderr: 2018-04-06 19:41:47.307648 7f734f2e4d00 -1 key AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== stderr: 2018-04-06 19:41:48.068161 7f734f2e4d00 -1 created object store /var/lib/ceph/osd/ceph-1/ for osd.1 fsid 1ff50434-64ad-42bd-9a70-1968e4a9a813 [root@osdhost osd]# ceph-bluestore-tool show-label --dev /dev/sda1 { "/dev/sda1": { "osd_uuid": "9d7a103a-f590-4842-bd3d-e9da27c3fb09", "size": 200043171840, "btime": "2018-04-06 19:41:44.523894", "description": "bluefs db" } } [root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdd --block.wal /dev/sda2 --block.db /dev/sda1 Running command: sudo vgcreate --force --yes ceph-cc91203d-de5c-4d27-8c48-a58663075e67 /dev/sdd stdout: Physical volume "/dev/sdd" successfully created. stdout: Volume group "ceph-cc91203d-de5c-4d27-8c48-a58663075e67" successfully created Running command: sudo lvcreate --yes -l 100%FREE -n osd-block-eb6cbcb3-f644-4973-b745-0e4389ef719c ceph-cc91203d-de5c-4d27-8c48-a58663075e67 stdout: Logical volume "osd-block-eb6cbcb3-f644-4973-b745-0e4389ef719c" created. Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-6 Running command: chown -R ceph:ceph /dev/dm-8 Running command: sudo ln -s /dev/ceph-cc91203d-de5c-4d27-8c48-a58663075e67/osd-block-eb6cbcb3-f644-4973-b745-0e4389ef719c /var/lib/ceph/osd/ceph-6/block Running command: sudo ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority
ok now I understand, thanks for all this helpful answers! On Sat, Apr 7, 2018, 15:26 David Turnerwrote: > I'm seconding what Greg is saying There is no reason to set nobackfill > and norecover just for restarting OSDs. That will only cause the problems > you're seeing without giving you any benefit. There are reasons to use > norecover and nobackfill but unless you're manually editing the crush map, > having osds consistently segfault, or for any other reason you really just > need to stop the io from recovery, then they aren't the flags for you. Even > at that, nobackfill is most likely what you need and norecover is still > probably not helpful. > > On Wed, Apr 4, 2018, 6:59 PM Gregory Farnum wrote: > >> On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski >> wrote: >> >>> Greg, thanks for Your reply! >>> >>> I think Your idea makes sense, I've did tests and its quite hard to >>> understand for me. I'll try to explain my situation in few steps >>> below. >>> I think that ceph showing progress in recovery but it can only solve >>> objects which doesn't really changed. It won't try to repair objects >>> which are really degraded because of norecovery flag. Am i right? >>> After a while I see blocked requests(as You can see below). >>> >> >> Yeah, so the implementation of this is a bit funky. Basically, when the >> OSD gets a map specifying norecovery, it will prevent any new recovery ops >> from starting once it processes that map. But it doesn't change the state >> of the PGs out of recovery; they just won't queue up more work. >> >> So probably the existing recovery IO was from OSDs that weren't >> up-to-date yet. Or maybe there's a bug in the norecover implementation; it >> definitely looks a bit fragile. >> >> But really I just wouldn't use that command. It's an expert flag that you >> shouldn't use except in some extreme wonky cluster situations (and even >> those may no longer exist in modern Ceph). For the use case you shared in >> your first email, I'd just stick with noout. >> -Greg >> >> >>> >>> - FEW SEC AFTER START OSD - >>> # ceph status >>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0 >>> health HEALTH_WARN >>> 140 pgs degraded >>> 1 pgs recovering >>> 92 pgs recovery_wait >>> 140 pgs stuck unclean >>> recovery 942/5772119 objects degraded (0.016%) >>> noout,nobackfill,norecover flag(s) set >>> monmap e10: 3 mons at >>> {node-19= >>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0} >>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20 >>> osdmap e18727: 36 osds: 36 up, 30 in >>> flags noout,nobackfill,norecover >>> pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects >>> 25204 GB used, 17124 GB / 42329 GB avail >>> 942/5772119 objects degraded (0.016%) >>> 1332 active+clean >>> 92 active+recovery_wait+degraded >>> 47 active+degraded >>>1 active+recovering+degraded >>> recovery io 31608 kB/s, 4 objects/s >>> client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s >>> >>> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS - >>> # ceph status >>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0 >>> health HEALTH_WARN >>> 140 pgs degraded >>> 1 pgs recovering >>> 109 pgs recovery_wait >>> 140 pgs stuck unclean >>> 80 requests are blocked > 32 sec >>> recovery 847/5775929 <(847)%20577-5929> objects degraded >>> (0.015%) >>> noout,nobackfill,norecover flag(s) set >>> monmap e10: 3 mons at >>> {node-19= >>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0} >>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20 >>> osdmap e18727: 36 osds: 36 up, 30 in >>> flags noout,nobackfill,norecover >>> pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects >>> 25234 GB used, 17094 GB / 42329 GB avail >>> 847/5775929 <(847)%20577-5929> objects degraded (0.015%) >>> 1332 active+clean >>> 109 active+recovery_wait+degraded >>> 30 active+degraded < degraded objects count got >>> stuck >>>1 active+recovering+degraded >>> recovery io 3743 kB/s, 0 objects/s < depend on command execution >>> this line showing 0 objects/s or doesn't exists >>> client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s >>> >>> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL >>> - >>> # ceph status >>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0 >>> health HEALTH_WARN >>> 134 pgs degraded >>> 134 pgs recovery_wait >>> 134 pgs stuck degraded >>> 134 pgs stuck
Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority
I'm seconding what Greg is saying There is no reason to set nobackfill and norecover just for restarting OSDs. That will only cause the problems you're seeing without giving you any benefit. There are reasons to use norecover and nobackfill but unless you're manually editing the crush map, having osds consistently segfault, or for any other reason you really just need to stop the io from recovery, then they aren't the flags for you. Even at that, nobackfill is most likely what you need and norecover is still probably not helpful. On Wed, Apr 4, 2018, 6:59 PM Gregory Farnumwrote: > On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski > wrote: > >> Greg, thanks for Your reply! >> >> I think Your idea makes sense, I've did tests and its quite hard to >> understand for me. I'll try to explain my situation in few steps >> below. >> I think that ceph showing progress in recovery but it can only solve >> objects which doesn't really changed. It won't try to repair objects >> which are really degraded because of norecovery flag. Am i right? >> After a while I see blocked requests(as You can see below). >> > > Yeah, so the implementation of this is a bit funky. Basically, when the > OSD gets a map specifying norecovery, it will prevent any new recovery ops > from starting once it processes that map. But it doesn't change the state > of the PGs out of recovery; they just won't queue up more work. > > So probably the existing recovery IO was from OSDs that weren't up-to-date > yet. Or maybe there's a bug in the norecover implementation; it definitely > looks a bit fragile. > > But really I just wouldn't use that command. It's an expert flag that you > shouldn't use except in some extreme wonky cluster situations (and even > those may no longer exist in modern Ceph). For the use case you shared in > your first email, I'd just stick with noout. > -Greg > > >> >> - FEW SEC AFTER START OSD - >> # ceph status >> cluster 848b340a-be27-45cb-ab66-3151d877a5a0 >> health HEALTH_WARN >> 140 pgs degraded >> 1 pgs recovering >> 92 pgs recovery_wait >> 140 pgs stuck unclean >> recovery 942/5772119 objects degraded (0.016%) >> noout,nobackfill,norecover flag(s) set >> monmap e10: 3 mons at >> {node-19= >> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0} >> election epoch 724, quorum 0,1,2 node-19,node-21,node-20 >> osdmap e18727: 36 osds: 36 up, 30 in >> flags noout,nobackfill,norecover >> pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects >> 25204 GB used, 17124 GB / 42329 GB avail >> 942/5772119 objects degraded (0.016%) >> 1332 active+clean >> 92 active+recovery_wait+degraded >> 47 active+degraded >>1 active+recovering+degraded >> recovery io 31608 kB/s, 4 objects/s >> client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s >> >> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS - >> # ceph status >> cluster 848b340a-be27-45cb-ab66-3151d877a5a0 >> health HEALTH_WARN >> 140 pgs degraded >> 1 pgs recovering >> 109 pgs recovery_wait >> 140 pgs stuck unclean >> 80 requests are blocked > 32 sec >> recovery 847/5775929 <(847)%20577-5929> objects degraded >> (0.015%) >> noout,nobackfill,norecover flag(s) set >> monmap e10: 3 mons at >> {node-19= >> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0} >> election epoch 724, quorum 0,1,2 node-19,node-21,node-20 >> osdmap e18727: 36 osds: 36 up, 30 in >> flags noout,nobackfill,norecover >> pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects >> 25234 GB used, 17094 GB / 42329 GB avail >> 847/5775929 <(847)%20577-5929> objects degraded (0.015%) >> 1332 active+clean >> 109 active+recovery_wait+degraded >> 30 active+degraded < degraded objects count got >> stuck >>1 active+recovering+degraded >> recovery io 3743 kB/s, 0 objects/s < depend on command execution >> this line showing 0 objects/s or doesn't exists >> client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s >> >> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL >> - >> # ceph status >> cluster 848b340a-be27-45cb-ab66-3151d877a5a0 >> health HEALTH_WARN >> 134 pgs degraded >> 134 pgs recovery_wait >> 134 pgs stuck degraded >> 134 pgs stuck unclean >> recovery 591/5778179 objects degraded (0.010%) >> monmap e10: 3 mons at >> {node-19= >> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0} >> election epoch 724, quorum 0,1,2
Re: [ceph-users] jewel ceph has PG mapped always to the same OSD's
Deep scrub doesn't help. After some steps (not sure what exact list) ceph does remap this pg to another osd, but PG doesn't move # ceph pg map 11.206 osdmap e176314 pg 11.206 (11.206) -> up [955,198,801] acting [787,697] Hangs in this state forever, 'ceph pg 11.206 query' hangs as well On Sat, Apr 7, 2018 at 12:42 AM, Konstantin Danilovwrote: > David, > >> What happens when you deep-scrub this PG? > we haven't try to deep-scrub it, will try. > >> What do the OSD logs show for any lines involving the problem PGs? > Nothing special were logged about this particular osd, except that it's > degraded. > Yet osd consume quite a lot portion of its CPU time in > snappy/leveldb/jemalloc libs. > In logs there a lot of messages from leveldb about moving data between > levels. > Needles to mention that this PG is from RGW index bucket, so it's metadata > only > and get a relatively hight load. Yet not we have 3 PG with the same > behavior from rgw data pool ()cluster have almost all data in RGW > >> Was anything happening on your cluster just before this started happening >> at first? > Cluster gets many updates in a week before issue, but nothing particularly > noticeable. > SSD OSD get's split in two, about 10% of OSD were removed. Some networking > issues > appears. > > Thanks > > On Fri, Apr 6, 2018 at 10:07 PM, David Turner wrote: >> >> What happens when you deep-scrub this PG? What do the OSD logs show for >> any lines involving the problem PGs? Was anything happening on your cluster >> just before this started happening at first? >> >> On Fri, Apr 6, 2018 at 2:29 PM Konstantin Danilov >> wrote: >>> >>> Hi all, we have a strange issue on one cluster. >>> >>> One PG is mapped to the particular set of OSD, say X,Y and Z doesn't >>> matter what how >>> we change crush map. >>> The whole picture is next: >>> >>> * This is 10.2.7 ceph version, all monitors and osd's have the same >>> version >>> * One PG eventually get into 'active+degraded+incomplete' state. It >>> was active+clean for a long time >>> and already has some data. We can't detect the event, which leads it >>> to this state. Probably it's >>> happened after some osd was removed from the cluster >>> * This PG has all 3 required OSD up and running, and all of them >>> online (pool_sz=3, min_pool_sz=2) >>> * All requests to pg stack forever, historic_ops shows that it waiting >>> on "waiting_for_degraded_pg" >>> * ceph pg query hangs forever >>> * We can't copy data from another pool as well - copying process hangs >>> and that fails with >>> (34) Numerical result out of range >>> * We was trying to restart osd's, nodes, mon's with no effects >>> * Eventually we found that shutting down osd Z(not primary) does solve >>> the issue, but >>> only before ceph set this osd out. If we trying to change the weight >>> of this osd or remove it from cluster problem appears again. Cluster >>> is working only while osd Z is down and not out and has the default >>> weight >>> * Then we have found that doesn't matter what we are doing with crushmap >>> - >>> osdmaptool --test-map-pgs-dump always put this PG to the same set of >>> osd - [X, Y] (in this osdmap Z is already down). We updating crush map >>> to remove nodes with OSD X,Y and Z completely out of it, compile it, >>> import it back to osdmap and run osdmaptool and always get the same >>> results >>> * After several nodes restart and setting osd Z down, but no out we >>> are now have 3 more PG with the same behaviour, but 'pined' to another >>> osd's >>> * We have run osdmaptool from luminous ceph to check if upmap >>> extension is somehow getting into this osd map - it is not. >>> >>> So this is where we are now. Have anyone seen something like this? Any >>> ideas are welcome. Thanks >>> >>> >>> -- >>> Kostiantyn Danilov >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Kostiantyn Danilov aka koder.ua > Principal software engineer, Mirantis > > skype:koder.ua > http://koder-ua.blogspot.com/ > http://mirantis.com -- Kostiantyn Danilov aka koder.ua Principal software engineer, Mirantis skype:koder.ua http://koder-ua.blogspot.com/ http://mirantis.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com