Re: [ceph-users] Upgrading ceph and mapped rbds

2018-04-07 Thread David Turner
I had several kernel mapped rbds as well as ceph-fuse mounted CephFS
clients when I upgraded from Jewel to Luminous. I rolled out the client
upgrades over a few weeks after the upgrade. I had tested that the client
use cases I had would be fine running Jewel connecting to a Luminous
cluster so there weren't any surprised for me when I did it in production.

On Tue, Apr 3, 2018, 11:21 PM Konstantin Shalygin  wrote:

> > The VMs are XenServer VMs with virtual Disk saved at the NFS Server
> which has the RBD mounted … So there is nor migration from my POV as there
> is no second storage to migrate to ...
>
>
>
> All your pain is self-inflicted.
>
> Just FYI clients are not interrupted when you upgrade ceph. Client will
> be interrupted only when update, so if you (suddenly) change crush
> tunables, minimum_required_version for example (for this reason clients
> must be upgraded before cluster).
>
>
>
>
> k
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] amount of PGs/pools/OSDs for your openstack / Ceph

2018-04-07 Thread Christian Wuerdig
The general recommendation is to target around 100 PG/OSD. Have you tried
the https://ceph.com/pgcalc/ tool?

On Wed, 4 Apr 2018 at 21:38, Osama Hasebou  wrote:

> Hi Everyone,
>
> I would like to know what kind of setup had the Ceph community been using
> for their Openstack's Ceph configuration when it comes to number of Pools &
> OSDs and their PGs.
>
> Ceph documentation briefly mentions it for small cluster size, and I would
> like to know from your experience, how much PGs have you created for your
> openstack pools in reality for a ceph cluster ranging from 1-2 PB capacity
> or 400-600 number of OSDs that performs well without issues.
>
> Hope to hear from you!
>
> Thanks.
>
> Regards,
> Ossi
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore OSD did not start at system-boot

2018-04-07 Thread Alfredo Deza
On Thu, Apr 5, 2018 at 6:33 AM, Ansgar Jazdzewski
 wrote:
> hi folks,
>
> i just figured out that my ODS's did not start because the filsystem
> is not mounted.

Would love to see some ceph-volume logs (both ceph-volume.log and
ceph-volume-systemd.log) because we do try several times with timeouts
before giving up.

If the filesystem is not available, the systemd units should keep
trying for a while.


>
> So i wrote a script to Hack my way around it
> #
> #! /usr/bin/env bash
>
> DATA=( $(ceph-volume lvm list | grep -e 'osd id\|osd fsid' | awk
> '{print $3}' | tr '\n' ' ') )
>
> OSDS=$(( ${#DATA[@]}/2 ))
>
> for OSD in $(seq 0 $(($OSDS-1))); do
>  ceph-volume lvm activate "${DATA[( $OSD*2 )]}" "${DATA[( $OSD*2+1 )]}"
> done
> #
>
> i'am sure that this is not the way it should be!? so any help i
> welcome to figure out why my BlueStore-OSD is not mounted at
> boot-time.
>
> Thanks,
> Ansgar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how the files in /var/lib/ceph/osd/ceph-0 are generated

2018-04-07 Thread Alfredo Deza
On Fri, Apr 6, 2018 at 10:27 PM, Jeffrey Zhang
 wrote:
> Yes, I am using ceph-volume.
>
> And i found where the keyring comes from.
>
> bluestore will save all the information at the starting of disk
> (BDEV_LABEL_BLOCK_SIZE=4096)
> this area is used for saving labels, including keyring, whoami etc.

Correct, this is documented here:
http://docs.ceph.com/docs/master/ceph-volume/lvm/activate/#summary
(see step 4):

   Recreate all the files needed with ceph-bluestore-tool
prime-osd-dir by pointing it to the OSD block device.


>
> these can be read through ceph-bluestore-tool show-lable
>
> $ ceph-bluestore-tool  show-label --path /var/lib/ceph/osd/ceph-0
> {
> "/var/lib/ceph/osd/ceph-0/block": {
> "osd_uuid": "c349b2ba-690f-4a36-b6f6-2cc0d0839f29",
> "size": 2147483648,
> "btime": "2018-04-04 10:22:25.216117",
> "description": "main",
> "bluefs": "1",
> "ceph_fsid": "14941be9-c327-4a17-8b86-be50ee2f962e",
> "kv_backend": "rocksdb",
> "magic": "ceph osd volume v026",
> "mkfs_done": "yes",
> "osd_key": "AQDgNsRaVtsRIBAA6pmOf7y2GBufyE83nHwVvg==",
> "ready": "ready",
> "whoami": "0"
> }
> }
>
> So during mounting the /var/lib/ceph/osd/ceph-0, ceph will dump these
> content into the
> tmpfs folder.
>
> On Fri, Apr 6, 2018 at 10:21 PM, David Turner  wrote:
>>
>> Likely the differences you're seeing of /dev/sdb1 and tmpfs have to do
>> with how ceph-disk vs ceph-volume manage the OSDs and what their defaults
>> are.  ceph-disk will create partitions on devices while ceph-volume
>> configures LVM on the block device.  Also with bluestore you do not have a
>> standard filesystem, so ceph-volume creates a mock folder to place the
>> necessary information into /var/lib/ceph/osd/ceph-0 to track the information
>> for the OSD and how to start it.
>>
>> On Wed, Apr 4, 2018 at 6:20 PM Gregory Farnum  wrote:
>>>
>>> On Tue, Apr 3, 2018 at 6:30 PM Jeffrey Zhang
>>>  wrote:

 I am testing ceph Luminous, the environment is

 - centos 7.4
 - ceph luminous ( ceph offical repo)
 - ceph-deploy 2.0
 - bluestore + separate wal and db

 I found the ceph osd folder `/var/lib/ceph/osd/ceph-0` is mounted
 from tmpfs. But where the files in that folder come from? like
 `keyring`,
 `whoami`?
>>>
>>>
>>> These are generated as part of the initialization process. I don't know
>>> the exact commands involved, but the keyring for instance will draw from the
>>> results of "ceph osd new" (which is invoked by one of the ceph-volume setup
>>> commands). That and whoami are part of the basic information an OSD needs to
>>> communicate with a monitor.
>>> -Greg
>>>


 $ ls -alh /var/lib/ceph/osd/ceph-0/
 lrwxrwxrwx.  1 ceph ceph   24 Apr  3 16:49 block ->
 /dev/ceph-pool/osd0.data
 lrwxrwxrwx.  1 root root   22 Apr  3 16:49 block.db ->
 /dev/ceph-pool/osd0-db
 lrwxrwxrwx.  1 root root   23 Apr  3 16:49 block.wal ->
 /dev/ceph-pool/osd0-wal
 -rw---.  1 ceph ceph   37 Apr  3 16:49 ceph_fsid
 -rw---.  1 ceph ceph   37 Apr  3 16:49 fsid
 -rw---.  1 ceph ceph   55 Apr  3 16:49 keyring
 -rw---.  1 ceph ceph6 Apr  3 16:49 ready
 -rw---.  1 ceph ceph   10 Apr  3 16:49 type
 -rw---.  1 ceph ceph2 Apr  3 16:49 whoami

 I guess they may be loaded from bluestore. But I can not find any clue
 for this.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Separate --block.wal --block.db bluestore not working as expected.

2018-04-07 Thread Alfredo Deza
On Sat, Apr 7, 2018 at 11:59 AM, Gary Verhulp  wrote:
>
>
>
>
>
> I’m trying to create bluestore osds with separate --block.wal --block.db
> devices on a write intensive SSD
>
>
>
> I’ve split the SSD (/dev/sda) into two partditions sda1 and sda2 for db and
> wal
>
>
>
>
>
> I seems to me the osd uuid is getting changed and I’m only able to start the
> last OSD
>
>
>
> Do I need to create a new partition or logical volume on the SSD for each
> OSD?

Correct! This is what is needed for each OSD. You are re-using the
same partitions for the other OSD which is why you are getting the
following message:



2018-04-06 19:45:43.730515 7fe91a9cfd00 -1 bluestore(/dev/sda1)
_check_or_set_bdev_label bdev /dev/sda1 fsid
eb6cbcb3-f644-4973-b745-0e4389ef719c does not match our fsid
9d7a103a-f590-4842-bd3d-e9da27c3fb09



>
>
>
> I’m sure this is a simple fail in my understanding of how it is supposed to
> be provisioned.
>
> Any advice would be appreciated.
>
>
>
> Thanks,
>
> Gary
>
>
>
>
>
> [root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdc
> --block.wal /dev/sda2 --block.db /dev/sda1
>
> Running command: sudo vgcreate --force --yes
> ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034 /dev/sdc
>
> stdout: Physical volume "/dev/sdc" successfully created.
>
> stdout: Volume group "ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034"
> successfully created
>
> Running command: sudo lvcreate --yes -l 100%FREE -n
> osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09
> ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034
>
> stdout: Logical volume "osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09"
> created.
>
> Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1
>
> Running command: chown -R ceph:ceph /dev/dm-2
>
> Running command: sudo ln -s
> /dev/ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034/osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09
> /var/lib/ceph/osd/ceph-1/block
>
> Running command: sudo ceph --cluster ceph --name client.bootstrap-osd
> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
> /var/lib/ceph/osd/ceph-1/activate.monmap
>
> stderr: got monmap epoch 1
>
> Running command: ceph-authtool /var/lib/ceph/osd/ceph-1/keyring
> --create-keyring --name osd.1 --add-key
> AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g==
>
> stdout: creating /var/lib/ceph/osd/ceph-1/keyring
>
> stdout: added entity osd.1 auth auth(auid = 18446744073709551615
> key=AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== with 0 caps)
>
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring
>
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/
>
> Running command: chown -R ceph:ceph /dev/sda2
>
> Running command: chown -R ceph:ceph /dev/sda1
>
> Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore
> --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --key
>  --bluestore-block-wal-path
> /dev/sda2 --bluestore-block-db-path /dev/sda1 --osd-data
> /var/lib/ceph/osd/ceph-1/ --osd-uuid 9d7a103a-f590-4842-bd3d-e9da27c3fb09
> --setuser ceph --setgroup ceph
>
> stderr: 2018-04-06 19:41:44.519662 7f734f2e4d00 -1
> bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode
> label at offset 102: buffer::malformed_input: void
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
> end of struct encoding
>
> stderr: 2018-04-06 19:41:44.520939 7f734f2e4d00 -1
> bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode
> label at offset 102: buffer::malformed_input: void
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
> end of struct encoding
>
> stderr: 2018-04-06 19:41:44.521190 7f734f2e4d00 -1
> bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to decode
> label at offset 102: buffer::malformed_input: void
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
> end of struct encoding
>
> stderr: 2018-04-06 19:41:44.521454 7f734f2e4d00 -1
> bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid
>
> stderr: 2018-04-06 19:41:47.307648 7f734f2e4d00 -1 key
> AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g==
>
> stderr: 2018-04-06 19:41:48.068161 7f734f2e4d00 -1 created object store
> /var/lib/ceph/osd/ceph-1/ for osd.1 fsid
> 1ff50434-64ad-42bd-9a70-1968e4a9a813
>
>
>
>
>
> [root@osdhost osd]# ceph-bluestore-tool show-label --dev /dev/sda1
>
> {
>
> "/dev/sda1": {
>
> "osd_uuid": "9d7a103a-f590-4842-bd3d-e9da27c3fb09",
>
> "size": 200043171840,
>
> "btime": "2018-04-06 19:41:44.523894",
>
> "description": "bluefs db"
>
> }
>
> }
>
>
>
> [root@osdhost  osd]# ceph-volume lvm prepare --bluestore --data /dev/sdd
> --block.wal /dev/sda2 --block.db /dev/sda1
>
> Running command: sudo vgcreate --force --yes
> ceph-cc91203d-de5c-4d27-8c48-a58663075e67 /dev/sdd
>
> stdout: Physical volume "/dev/sdd" successfully created.
>
> stdout: Volume group 

Re: [ceph-users] Ceph scrub logs: _scan_snaps no head for $object?

2018-04-07 Thread Marc Roos

How do you resolve these issues?


Apr  7 22:39:21 c03 ceph-osd: 2018-04-07 22:39:21.928484 7f0826524700 -1 
osd.13 pg_epoch: 19008 pg[17.13( v 19008'6019891 
(19008'6018375,19008'6019891] local-lis/les=18980/18981 n=3825 
ec=3636/3636 lis/c 18980/18980 les/c/f 18981/18982/0 18980/18980/18903) 
[4,13,0] r=1 lpr=18980 luod=0'0 crt=19008'6019891 lcod 19008'6019890 
active] _scan_snaps no head for 
17:cbf61056:::rbd_data.239f5274b0dc51.0ff2:15 (have MIN)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Separate --block.wal --block.db bluestore not working as expected.

2018-04-07 Thread Gary Verhulp


I’m trying to create bluestore osds with separate --block.wal --block.db 
devices on a write intensive SSD


I’ve split the SSD (/dev/sda) into two partditions sda1 and sda2 for db 
and wal


I seems to me the osd uuid is getting changed and I’m only able to start 
the last OSD


Do I need to create a new partition or logical volume on the SSD for 
each OSD?


I’m sure this is a simple fail in my understanding of how it is supposed 
to be provisioned.


Any advice would be appreciated.

Thanks,

Gary

[root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdc 
--block.wal /dev/sda2 --block.db /dev/sda1


Running command: sudo vgcreate --force --yes 
ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034 /dev/sdc


stdout: Physical volume "/dev/sdc" successfully created.

stdout: Volume group "ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034" 
successfully created


Running command: sudo lvcreate --yes -l 100%FREE -n 
osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09 
ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034


stdout: Logical volume "osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09" 
created.


Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-1

Running command: chown -R ceph:ceph /dev/dm-2

Running command: sudo ln -s 
/dev/ceph-5a6b8ab6-ca12-4855-9a5a-a3a54c249034/osd-block-9d7a103a-f590-4842-bd3d-e9da27c3fb09 
/var/lib/ceph/osd/ceph-1/block


Running command: sudo ceph --cluster ceph --name client.bootstrap-osd 
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
/var/lib/ceph/osd/ceph-1/activate.monmap


stderr: got monmap epoch 1

Running command: ceph-authtool /var/lib/ceph/osd/ceph-1/keyring 
--create-keyring --name osd.1 --add-key 
AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g==


stdout: creating /var/lib/ceph/osd/ceph-1/keyring

stdout: added entity osd.1 auth auth(auid = 18446744073709551615 
key=AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g== with 0 caps)


Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring

Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/

Running command: chown -R ceph:ceph /dev/sda2

Running command: chown -R ceph:ceph /dev/sda1

Running command: sudo ceph-osd --cluster ceph --osd-objectstore 
bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap 
--key  
--bluestore-block-wal-path /dev/sda2 --bluestore-block-db-path /dev/sda1 
--osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 
9d7a103a-f590-4842-bd3d-e9da27c3fb09 --setuser ceph --setgroup ceph


stderr: 2018-04-06 19:41:44.519662 7f734f2e4d00 -1 
bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to 
decode label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode 
past end of struct encoding


stderr: 2018-04-06 19:41:44.520939 7f734f2e4d00 -1 
bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to 
decode label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode 
past end of struct encoding


stderr: 2018-04-06 19:41:44.521190 7f734f2e4d00 -1 
bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label unable to 
decode label at offset 102: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode 
past end of struct encoding


stderr: 2018-04-06 19:41:44.521454 7f734f2e4d00 -1 
bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid


stderr: 2018-04-06 19:41:47.307648 7f734f2e4d00 -1 key 
AQDjL8haKmzYOhAAM7ehRUUgF/n4x/Ybu7VR/g==


stderr: 2018-04-06 19:41:48.068161 7f734f2e4d00 -1 created object store 
/var/lib/ceph/osd/ceph-1/ for osd.1 fsid 
1ff50434-64ad-42bd-9a70-1968e4a9a813


[root@osdhost osd]# ceph-bluestore-tool show-label --dev /dev/sda1

{

"/dev/sda1": {

"osd_uuid": "9d7a103a-f590-4842-bd3d-e9da27c3fb09",

"size": 200043171840,

"btime": "2018-04-06 19:41:44.523894",

"description": "bluefs db"

    }

}

[root@osdhost  osd]# ceph-volume lvm prepare --bluestore --data /dev/sdd 
--block.wal /dev/sda2 --block.db /dev/sda1


Running command: sudo vgcreate --force --yes 
ceph-cc91203d-de5c-4d27-8c48-a58663075e67 /dev/sdd


stdout: Physical volume "/dev/sdd" successfully created.

stdout: Volume group "ceph-cc91203d-de5c-4d27-8c48-a58663075e67" 
successfully created


Running command: sudo lvcreate --yes -l 100%FREE -n 
osd-block-eb6cbcb3-f644-4973-b745-0e4389ef719c 
ceph-cc91203d-de5c-4d27-8c48-a58663075e67


stdout: Logical volume "osd-block-eb6cbcb3-f644-4973-b745-0e4389ef719c" 
created.


Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-6

Running command: chown -R ceph:ceph /dev/dm-8

Running command: sudo ln -s 
/dev/ceph-cc91203d-de5c-4d27-8c48-a58663075e67/osd-block-eb6cbcb3-f644-4973-b745-0e4389ef719c 
/var/lib/ceph/osd/ceph-6/block


Running command: sudo ceph --cluster ceph --name client.bootstrap-osd 
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-07 Thread Damian Dabrowski
ok now I understand, thanks for all this helpful answers!

On Sat, Apr 7, 2018, 15:26 David Turner  wrote:

> I'm seconding what Greg is saying  There is no reason to set nobackfill
> and norecover just for restarting OSDs. That will only cause the problems
> you're seeing without giving you any benefit. There are reasons to use
> norecover and nobackfill but unless you're manually editing the crush map,
> having osds consistently segfault, or for any other reason you really just
> need to stop the io from recovery, then they aren't the flags for you. Even
> at that, nobackfill is most likely what you need and norecover is still
> probably not helpful.
>
> On Wed, Apr 4, 2018, 6:59 PM Gregory Farnum  wrote:
>
>> On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski 
>> wrote:
>>
>>> Greg, thanks for Your reply!
>>>
>>> I think Your idea makes sense, I've did tests and its quite hard to
>>> understand for me. I'll try to explain my situation in few steps
>>> below.
>>> I think that ceph showing progress in recovery but it can only solve
>>> objects which doesn't really changed. It won't try to repair objects
>>> which are really degraded because of norecovery flag. Am i right?
>>> After a while I see blocked requests(as You can see below).
>>>
>>
>> Yeah, so the implementation of this is a bit funky. Basically, when the
>> OSD gets a map specifying norecovery, it will prevent any new recovery ops
>> from starting once it processes that map. But it doesn't change the state
>> of the PGs out of recovery; they just won't queue up more work.
>>
>> So probably the existing recovery IO was from OSDs that weren't
>> up-to-date yet. Or maybe there's a bug in the norecover implementation; it
>> definitely looks a bit fragile.
>>
>> But really I just wouldn't use that command. It's an expert flag that you
>> shouldn't use except in some extreme wonky cluster situations (and even
>> those may no longer exist in modern Ceph). For the use case you shared in
>> your first email, I'd just stick with noout.
>> -Greg
>>
>>
>>>
>>> - FEW SEC AFTER START OSD -
>>> # ceph status
>>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>  health HEALTH_WARN
>>> 140 pgs degraded
>>> 1 pgs recovering
>>> 92 pgs recovery_wait
>>> 140 pgs stuck unclean
>>> recovery 942/5772119 objects degraded (0.016%)
>>> noout,nobackfill,norecover flag(s) set
>>>  monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>  osdmap e18727: 36 osds: 36 up, 30 in
>>> flags noout,nobackfill,norecover
>>>   pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
>>> 25204 GB used, 17124 GB / 42329 GB avail
>>> 942/5772119 objects degraded (0.016%)
>>> 1332 active+clean
>>>   92 active+recovery_wait+degraded
>>>   47 active+degraded
>>>1 active+recovering+degraded
>>> recovery io 31608 kB/s, 4 objects/s
>>>   client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s
>>>
>>> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -
>>> # ceph status
>>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>  health HEALTH_WARN
>>> 140 pgs degraded
>>> 1 pgs recovering
>>> 109 pgs recovery_wait
>>> 140 pgs stuck unclean
>>> 80 requests are blocked > 32 sec
>>> recovery 847/5775929 <(847)%20577-5929> objects degraded
>>> (0.015%)
>>> noout,nobackfill,norecover flag(s) set
>>>  monmap e10: 3 mons at
>>> {node-19=
>>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>>  osdmap e18727: 36 osds: 36 up, 30 in
>>> flags noout,nobackfill,norecover
>>>   pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
>>> 25234 GB used, 17094 GB / 42329 GB avail
>>> 847/5775929 <(847)%20577-5929> objects degraded (0.015%)
>>> 1332 active+clean
>>>  109 active+recovery_wait+degraded
>>>   30 active+degraded < degraded objects count got
>>> stuck
>>>1 active+recovering+degraded
>>> recovery io 3743 kB/s, 0 objects/s < depend on command execution
>>> this line showing 0 objects/s or doesn't exists
>>>   client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s
>>>
>>> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL
>>> -
>>> # ceph status
>>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>>  health HEALTH_WARN
>>> 134 pgs degraded
>>> 134 pgs recovery_wait
>>> 134 pgs stuck degraded
>>> 134 pgs stuck 

Re: [ceph-users] Ceph recovery kill VM's even with the smallest priority

2018-04-07 Thread David Turner
I'm seconding what Greg is saying  There is no reason to set nobackfill and
norecover just for restarting OSDs. That will only cause the problems
you're seeing without giving you any benefit. There are reasons to use
norecover and nobackfill but unless you're manually editing the crush map,
having osds consistently segfault, or for any other reason you really just
need to stop the io from recovery, then they aren't the flags for you. Even
at that, nobackfill is most likely what you need and norecover is still
probably not helpful.

On Wed, Apr 4, 2018, 6:59 PM Gregory Farnum  wrote:

> On Thu, Mar 29, 2018 at 3:17 PM Damian Dabrowski 
> wrote:
>
>> Greg, thanks for Your reply!
>>
>> I think Your idea makes sense, I've did tests and its quite hard to
>> understand for me. I'll try to explain my situation in few steps
>> below.
>> I think that ceph showing progress in recovery but it can only solve
>> objects which doesn't really changed. It won't try to repair objects
>> which are really degraded because of norecovery flag. Am i right?
>> After a while I see blocked requests(as You can see below).
>>
>
> Yeah, so the implementation of this is a bit funky. Basically, when the
> OSD gets a map specifying norecovery, it will prevent any new recovery ops
> from starting once it processes that map. But it doesn't change the state
> of the PGs out of recovery; they just won't queue up more work.
>
> So probably the existing recovery IO was from OSDs that weren't up-to-date
> yet. Or maybe there's a bug in the norecover implementation; it definitely
> looks a bit fragile.
>
> But really I just wouldn't use that command. It's an expert flag that you
> shouldn't use except in some extreme wonky cluster situations (and even
> those may no longer exist in modern Ceph). For the use case you shared in
> your first email, I'd just stick with noout.
> -Greg
>
>
>>
>> - FEW SEC AFTER START OSD -
>> # ceph status
>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>  health HEALTH_WARN
>> 140 pgs degraded
>> 1 pgs recovering
>> 92 pgs recovery_wait
>> 140 pgs stuck unclean
>> recovery 942/5772119 objects degraded (0.016%)
>> noout,nobackfill,norecover flag(s) set
>>  monmap e10: 3 mons at
>> {node-19=
>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>  osdmap e18727: 36 osds: 36 up, 30 in
>> flags noout,nobackfill,norecover
>>   pgmap v20851644: 1472 pgs, 7 pools, 8510 GB data, 1880 kobjects
>> 25204 GB used, 17124 GB / 42329 GB avail
>> 942/5772119 objects degraded (0.016%)
>> 1332 active+clean
>>   92 active+recovery_wait+degraded
>>   47 active+degraded
>>1 active+recovering+degraded
>> recovery io 31608 kB/s, 4 objects/s
>>   client io 73399 kB/s rd, 80233 kB/s wr, 1218 op/s
>>
>> - 1 MIN AFTER OSD START, RECOVERY STUCK, BLOCKED REQUESTS -
>> # ceph status
>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>  health HEALTH_WARN
>> 140 pgs degraded
>> 1 pgs recovering
>> 109 pgs recovery_wait
>> 140 pgs stuck unclean
>> 80 requests are blocked > 32 sec
>> recovery 847/5775929 <(847)%20577-5929> objects degraded
>> (0.015%)
>> noout,nobackfill,norecover flag(s) set
>>  monmap e10: 3 mons at
>> {node-19=
>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>> election epoch 724, quorum 0,1,2 node-19,node-21,node-20
>>  osdmap e18727: 36 osds: 36 up, 30 in
>> flags noout,nobackfill,norecover
>>   pgmap v20851812: 1472 pgs, 7 pools, 8520 GB data, 1881 kobjects
>> 25234 GB used, 17094 GB / 42329 GB avail
>> 847/5775929 <(847)%20577-5929> objects degraded (0.015%)
>> 1332 active+clean
>>  109 active+recovery_wait+degraded
>>   30 active+degraded < degraded objects count got
>> stuck
>>1 active+recovering+degraded
>> recovery io 3743 kB/s, 0 objects/s < depend on command execution
>> this line showing 0 objects/s or doesn't exists
>>   client io 26521 kB/s rd, 64211 kB/s wr, 1212 op/s
>>
>> - FEW SECONDS AFTER UNSETTING FLAGS NOOUT, NORECOVERY, NOBACKFILL
>> -
>> # ceph status
>> cluster 848b340a-be27-45cb-ab66-3151d877a5a0
>>  health HEALTH_WARN
>> 134 pgs degraded
>> 134 pgs recovery_wait
>> 134 pgs stuck degraded
>> 134 pgs stuck unclean
>> recovery 591/5778179 objects degraded (0.010%)
>>  monmap e10: 3 mons at
>> {node-19=
>> 172.31.0.2:6789/0,node-20=172.31.0.8:6789/0,node-21=172.31.0.6:6789/0}
>> election epoch 724, quorum 0,1,2 

Re: [ceph-users] jewel ceph has PG mapped always to the same OSD's

2018-04-07 Thread Konstantin Danilov
Deep scrub doesn't help.
After some steps (not sure what exact list)
ceph does remap this pg to another osd, but PG doesn't move
# ceph pg map 11.206
osdmap e176314 pg 11.206 (11.206) -> up [955,198,801] acting [787,697]

Hangs in this state forever, 'ceph pg 11.206 query' hangs as well

On Sat, Apr 7, 2018 at 12:42 AM, Konstantin Danilov
 wrote:
> David,
>
>> What happens when you deep-scrub this PG?
> we haven't try to deep-scrub it, will try.
>
>> What do the OSD logs show for any lines involving the problem PGs?
> Nothing special were logged about this particular osd, except that it's
> degraded.
> Yet osd consume quite a lot portion of its CPU time in
> snappy/leveldb/jemalloc libs.
> In logs there a lot of messages from leveldb about moving data between
> levels.
> Needles to mention that this PG is from RGW index bucket, so it's metadata
> only
> and get a relatively hight load. Yet not we have 3 PG with the same
> behavior from rgw data pool ()cluster have almost all data in RGW
>
>> Was anything happening on your cluster just before this started happening
>> at first?
> Cluster gets many updates in a week before issue, but nothing particularly
> noticeable.
> SSD OSD get's split in two, about 10% of OSD were removed. Some networking
> issues
> appears.
>
> Thanks
>
> On Fri, Apr 6, 2018 at 10:07 PM, David Turner  wrote:
>>
>> What happens when you deep-scrub this PG?  What do the OSD logs show for
>> any lines involving the problem PGs?  Was anything happening on your cluster
>> just before this started happening at first?
>>
>> On Fri, Apr 6, 2018 at 2:29 PM Konstantin Danilov 
>> wrote:
>>>
>>> Hi all, we have a strange issue on one cluster.
>>>
>>> One PG is mapped to the particular set of OSD, say X,Y and Z doesn't
>>> matter what how
>>> we change crush map.
>>> The whole picture is next:
>>>
>>> * This is 10.2.7 ceph version, all monitors and osd's have the same
>>> version
>>> * One  PG eventually get into 'active+degraded+incomplete' state. It
>>> was active+clean for a long time
>>> and already has some data. We can't detect the event, which leads it
>>> to this state. Probably it's
>>> happened after some osd was removed from the cluster
>>> * This PG has all 3 required OSD up and running, and all of them
>>> online (pool_sz=3, min_pool_sz=2)
>>> * All requests to pg stack forever, historic_ops shows that it waiting
>>> on "waiting_for_degraded_pg"
>>> * ceph pg query hangs forever
>>> * We can't copy data from another pool as well - copying process hangs
>>> and that fails with
>>> (34) Numerical result out of range
>>>  * We was trying to restart osd's, nodes, mon's with no effects
>>> * Eventually we found that shutting down osd Z(not primary) does solve
>>> the issue, but
>>> only before ceph set this osd out. If we trying to change the weight
>>> of this osd or remove it from cluster problem appears again. Cluster
>>> is working only while osd Z is down and not out and has the default
>>> weight
>>> * Then we have found that doesn't matter what we are doing with crushmap
>>> -
>>> osdmaptool --test-map-pgs-dump always put this PG to the same set of
>>> osd - [X, Y] (in this osdmap Z is already down). We updating crush map
>>> to remove nodes with OSD X,Y and Z completely out of it, compile it,
>>> import it back to osdmap and run osdmaptool and always get the same
>>> results
>>> * After several nodes restart and setting osd Z down, but no out we
>>> are now have 3 more PG with the same behaviour, but 'pined' to another
>>> osd's
>>> * We have run osdmaptool from luminous ceph to check if upmap
>>> extension is somehow getting into this osd map - it is not.
>>>
>>> So this is where we are now. Have anyone seen something like this? Any
>>> ideas are welcome. Thanks
>>>
>>>
>>> --
>>> Kostiantyn Danilov
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Kostiantyn Danilov aka koder.ua
> Principal software engineer, Mirantis
>
> skype:koder.ua
> http://koder-ua.blogspot.com/
> http://mirantis.com



-- 
Kostiantyn Danilov aka koder.ua
Principal software engineer, Mirantis

skype:koder.ua
http://koder-ua.blogspot.com/
http://mirantis.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com