Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?
On 2/16/19 12:33 AM, David Turner wrote: The answer is probably going to be in how big your DB partition is vs how big your HDD disk is. From your output it looks like you have a 6TB HDD with a 28GB Blocks.DB partition. Even though the DB used size isn't currently full, I would guess that at some point since this OSD was created that it did fill up and what you're seeing is the part of the DB that spilled over to the data disk. This is why the official recommendation (that is quite cautious, but cautious because some use cases will use this up) for a blocks.db partition is 4% of the data drive. For your 6TB disks that's a recommendation of 240GB per DB partition. Of course the actual size of the DB needed is dependent on your use case. But pretty much every use case for a 6TB disk needs a bigger partition than 28GB. My current db size of osd.33 is 7910457344 bytes, and osd.73 is 2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs 6388Mbyte (6.69% of db_total_bytes). Why osd.33 is not used slow storage at this case? k ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG_AVAILABILITY with one osd down?
Yesterday I saw this one.. it puzzles me: 2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 : cluster [INF] overall HEALTH_OK 2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 : cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec. Implicated osds 58 (REQUEST_SLOW) 2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 : cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec. Implicated osds 9,19,52,58,68 (REQUEST_SLOW) 2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 : cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec. Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW) 2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 : cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from different host after 33.862482 >= grace 29.247323) 2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) 2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 : cluster [WRN] Health check failed: Reduced data availability: 6 pgs peering (PG_AVAILABILITY) 2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 : cluster [WRN] Health check failed: Degraded data redundancy: 3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 : cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec. Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW) 2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 17 pgs peering) 2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 : cluster [WRN] Health check update: Degraded data redundancy: 9897139/700354131 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are blocked > 32 sec. Implicated osds 32,55) 2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 : cluster [WRN] Health check update: Degraded data redundancy: 9897140/700354194 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 : cluster [WRN] Health check update: Degraded data redundancy: 9897142/700354287 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) 2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 : cluster [WRN] Health check update: Degraded data redundancy: 9897143/700354356 objects degraded (1.413%), 200 pgs degraded (PG_DEGRADED) ... shortened .. 2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 : cluster [WRN] Health check update: Degraded data redundancy: 9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 : cluster [WRN] Health check update: Degraded data redundancy: 9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 : cluster [WRN] Health check update: Degraded data redundancy: 9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 : cluster [WRN] Health check update: Degraded data redundancy: 9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 : cluster [INF] osd.29 10.194.133.58:6844/305358 boot 2019-02-15 21:44:08.498060 mon.torsk1 mon.0 10.194.132.88:6789/0 604376 : cluster [WRN] Health check update: Degraded data redundancy: 9897174/700357056 objects degraded (1.413%), 200 pgs degraded, 201 pgs undersized (PG_DEGRADED) 2019-02-15 21:44:08.996099 mon.torsk1 mon.0 10.194.132.88:6789/0 604377 : cluster [WRN] Health check failed: Reduced data availability: 12 pgs peering (PG_AVAILABILITY) 2019-02-15 21:44:13.498472 mon.torsk1 mon.0 10.194.132.88:6789/0 604378 : cluster [WRN] Health check update: Degraded data redundancy: 55/700357161 objects degraded (0.000%), 33 pgs degraded (PG_DEGRADED) 2019-02-15 21:44:15.081437 mon.torsk1 mon.0 10.194.132.88:6789/0 604379 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 12 pgs peering) 2019-02-15 21:44:18.498808 mon.torsk1 mon.0 10.194.132.88:6789/0 604380 : cluster [WRN] Health check update: Degraded data redundancy: 14/700357230 objects degraded
[ceph-users] Openstack RBD EC pool
Hi, I tried to add a "archive" storage class to our Openstack environment by introducing a second storage backend offering RBD volumes having their data in an erasure coded pool. As I will have to specify a data-pool I tried it as follows: ### keyring files: ceph.client.cinder.keyring ceph.client.cinder-ec.keyring ### ceph.conf [global] fsid = b5e30221-a214-353c-b66b-8c37b4349123 mon host = ceph-mon.service.i.ewcs.ch auth cluster required = cephx auth service required = cephx auth client required = cephx ### ## ceph.ec.conf [global] fsid = b5e30221-a214-353c-b66b-8c37b4349123 mon host = ceph-mon.service.i.. auth cluster required = cephx auth service required = cephx auth client required = cephx [client.cinder-ec] rbd default data pool = ewos1-prod_cinder_ec # # cinder-volume.conf ... [ceph1-rp3-1] volume_backend_name = ceph1-rp3-1 volume_driver = cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf = /etc/ceph/ceph.conf rbd_user = cinder rbd_secret_uuid = xxxcc8b-xx-ae16xx rbd_pool = cinder rbd_flatten_volume_from_snapshot = false rbd_max_clone_depth = 5 rbd_store_chunk_size = 4 rados_connect_timeout = -1 report_discard_supported = true rbd_exclusive_cinder_pool = true enable_deferred_deletion = true deferred_deletion_delay = 259200 deferred_deletion_purge_interval = 3600 [ceph1-ec-1] volume_backend_name = ceph1-ec-1 volume_driver = cinder.volume.drivers.rbd.RBDDriver rbd_ceph_conf = /etc/ceph/ceph.ec.conf rbd_user = cinder-ec rbd_secret_uuid = xxcc8b-xx-ae16xx rbd_pool = cinder_ec_metadata rbd_flatten_volume_from_snapshot = false rbd_max_clone_depth = 3 rbd_store_chunk_size = 4 rados_connect_timeout = -1 report_discard_supported = true rbd_exclusive_cinder_pool = true enable_deferred_deletion = true deferred_deletion_delay = 259200 deferred_deletion_purge_interval = 3600 ## I created three pools (for cinder) like: ceph osd pool create cinder 512 512 replicated rack_replicated_rule ceph osd pool create cinder_ec_metadata 6 6 replicated rack_replicated_rule ceph osd pool create cinder_ec 512 512 erasure ec32 ceph osd pool set cinder_ec allow_ec_overwrites true I am able to use backend ceph1-rp3-1 without any errors (create, attach, delete, snapshot). I am also able to create volumes via: openstack volume create --size 100 --type ec1 myvolume_ec but I am not able to attach it to any instance. I get erros like: ==> libvirtd.log <== 2019-02-15 22:23:01.771+: 27895: error : qemuMonitorJSONCheckError:392 : internal error: unable to execute QEMU command 'device_add': Property 'scsi-hd.drive' can't find value 'drive-scsi0-0-0-3' My instance got three disks (root,swap and one cinder replicated volume) amd looks like: instance-254e 6d41c54b-753a-46c7-a573-bedf8822fbf5 xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0;> x3-1 2019-02-15 21:18:24 16384 80 8192 0 4 ... ... hvm ... /usr/bin/qemu-system-x86_64 name='nova/6d41c54b-753a-46c7-a573-bedf8822fbf5_disk'> name='nova/6d41c54b-753a-46c7-a573-bedf8822fbf5_disk.swap'> name='cinder/volume-01e8cb68-1f86-4142-958c-fdd1c301833a'> 125829120 1000 01e8cb68-1f86-4142-958c-fdd1c301833a function='0x0'/> ... Any ideas? All the best, Florian smime.p7s Description: S/MIME cryptographic signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS - read latency.
Hi. I've got a bunch of "small" files moved onto CephFS as archive/bulk storage and now I have the backup (to tape) to spool over them. A sample of the single-threaded backup client delivers this very consistent pattern: $ sudo strace -T -p 7307 2>&1 | grep -A 7 -B 3 open write(111, "\377\377\377\377", 4) = 4 <0.11> openat(AT_FDCWD, "/ceph/cluster/rsyncbackups/fileshare.txt", O_RDONLY) = 38 <0.30> write(111, "\0\0\0\021197418 2 67201568", 21) = 21 <0.36> read(38, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.049733> write(111, "\0\1\0\0CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0"..., 65540) = 65540 <0.37> read(38, " $$ $$\16\33\16 \16\33"..., 65536) = 65536 <0.000199> write(111, "\0\1\0\0 $$ $$\16\33\16 $$"..., 65540) = 65540 <0.26> read(38, "$ \33 \16\33\25 \33\33\33 \33\33\33 \25\0\26\2\16NVDOLOVB"..., 65536) = 65536 <0.35> write(111, "\0\1\0\0$ \33 \16\33\25 \33\33\33 \33\33\33 \25\0\26\2\16NVDO"..., 65540) = 65540 <0.24> The pattern is very consistent, thus it is not one PG or one OSD being contented. $ sudo strace -T -p 7307 2>&1 | grep -A 3 open |grep read read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 11968 <0.070917> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 23232 <0.039789> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.053598> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 28240 <0.105046> read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.061966> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.050943> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.031217> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 7392 <0.052612> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 288 <0.075930> read(41, "1316919290-DASPHYNBAAPe2218b"..., 65536) = 940 <0.040609> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 22400 <0.038423> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 11984 <0.039051> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 9040 <0.054161> read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.040654> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 22352 <0.031236> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 <0.123424> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 49984 <0.052249> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 28176 <0.052742> read(41, "CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 288 <0.092039> Or to sum: sudo strace -T -p 23748 2>&1 | grep -A 3 open | grep read | perl -ane'/<(\d+\.\d+)>/; print $1 . "\n";' | head -n 1000 | ministat N Min MaxMedian AvgStddev x 1000 3.2e-05 2.141551 0.054313 0.065834359 0.091480339 As can be seen the "initial" read averages at 65.8ms - which - if the filesize is say 1MB and the rest of the time is 0 - caps read performance mostly 20MB/s .. at that pace, the journey through double digit TB is long even with 72 OSD's backing. Spec: Ceph Luminous 12.2.5 - Bluestore 6 OSD nodes, 10TB HDDs, 4+2 EC pool, 10GbitE Locally the drives deliver latencies of approximately 6-8ms for a random read. Any suggestion on where to find out where the remaining 50ms is being spend would be truely helpful. Large files "just works" as read-ahead does a nice job in getting performance up. -- Jesper ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
On Fri, Feb 15, 2019 at 1:39 AM Ilya Dryomov wrote: > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: > > > > Hi Marc, > > > > You can see previous designs on the Ceph store: > > > > https://www.proforma.com/sdscommunitystore > > Hi Mike, > > This site stopped working during DevConf and hasn't been working since. > I think Greg has contacted some folks about this, but it would be great > if you could follow up because it's been a couple of weeks now... That’s odd because we thought this was resolved by Monday, but I do see from the time stamps I was back in the USA when testing it. It must be geographical as Dan says... :/ > > > Thanks, > > Ilya > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host
Actually I think I misread what this was doing, sorry. Can you do a “ceph osd tree”? It’s hard to see the structure via the text dumps. On Wed, Feb 13, 2019 at 10:49 AM Gregory Farnum wrote: > Your CRUSH rule for EC spools is forcing that behavior with the line > > step chooseleaf indep 1 type ctnr > > If you want different behavior, you’ll need a different crush rule. > > On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2 wrote: > >> Hi, cephers >> >> >> I am building a ceph EC cluster.when a disk is error,I out it.But its all >> PGs remap to the osds in the same host,which I think they should remap to >> other hosts in the same rack. >> test process is: >> >> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 >> site1_sata_erasure_ruleset 4 >> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1 >> /etc/init.d/ceph stop osd.2 >> ceph osd out 2 >> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2 >> diff /tmp/1 /tmp/2 -y --suppress-common-lines >> >> 0 1.0 1.0 118 osd.0 | 0 1.0 1.0 126 osd.0 >> 1 1.0 1.0 123 osd.1 | 1 1.0 1.0 139 osd.1 >> 2 1.0 1.0 122 osd.2 | 2 1.0 0 0 osd.2 >> 3 1.0 1.0 113 osd.3 | 3 1.0 1.0 131 osd.3 >> 4 1.0 1.0 122 osd.4 | 4 1.0 1.0 136 osd.4 >> 5 1.0 1.0 112 osd.5 | 5 1.0 1.0 127 osd.5 >> 6 1.0 1.0 114 osd.6 | 6 1.0 1.0 128 osd.6 >> 7 1.0 1.0 124 osd.7 | 7 1.0 1.0 136 osd.7 >> 8 1.0 1.0 95 osd.8 | 8 1.0 1.0 113 osd.8 >> 9 1.0 1.0 112 osd.9 | 9 1.0 1.0 119 osd.9 >> TOTAL 3073T 197G | TOTAL 3065T 197G >> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52 >> >> >> some config info: (detail configs see: >> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125) >> jewel 10.2.11 filestore+rocksdb >> >> ceph osd erasure-code-profile get ISA-4-2 >> k=4 >> m=2 >> plugin=isa >> ruleset-failure-domain=ctnr >> ruleset-root=site1-sata >> technique=reed_sol_van >> >> part of ceph.conf is: >> >> [global] >> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900 >> auth cluster required = cephx >> auth service required = cephx >> auth client required = cephx >> pid file = /home/ceph/var/run/$name.pid >> log file = /home/ceph/log/$cluster-$name.log >> mon osd nearfull ratio = 0.85 >> mon osd full ratio = 0.95 >> admin socket = /home/ceph/var/run/$cluster-$name.asok >> osd pool default size = 3 >> osd pool default min size = 1 >> osd objectstore = filestore >> filestore merge threshold = -10 >> >> [mon] >> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring >> mon data = /home/ceph/var/lib/$type/$cluster-$id >> mon cluster log file = /home/ceph/log/$cluster.log >> [osd] >> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring >> osd data = /home/ceph/var/lib/$type/$cluster-$id >> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal >> osd journal size = 1 >> osd mkfs type = xfs >> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k >> osd backfill full ratio = 0.92 >> osd failsafe full ratio = 0.95 >> osd failsafe nearfull ratio = 0.85 >> osd max backfills = 1 >> osd crush update on start = false >> osd op thread timeout = 60 >> filestore split multiple = 8 >> filestore max sync interval = 15 >> filestore min sync interval = 5 >> [osd.0] >> host = cld-osd1-56 >> addr = X >> user = ceph >> devs = /disk/link/osd-0/data >> osd journal = /disk/link/osd-0/journal >> ……. >> [osd.503] >> host = cld-osd42-56 >> addr = 10.108.87.52 >> user = ceph >> devs = /disk/link/osd-503/data >> osd journal = /disk/link/osd-503/journal >> >> >> crushmap is below: >> >> # begin crush map >> tunable choose_local_tries 0 >> tunable choose_local_fallback_tries 0 >> tunable choose_total_tries 50 >> tunable chooseleaf_descend_once 1 >> tunable chooseleaf_vary_r 1 >> tunable straw_calc_version 1 >> tunable allowed_bucket_algs 54 >> >> # devices >> device 0 osd.0 >> device 1 osd.1 >> device 2 osd.2 >> 。。。 >> device 502 osd.502 >> device 503 osd.503 >> >> # types >> type 0 osd # osd >> type 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xx >> type 2 media# sata/ssd group by rack, -11~1x/-21~2x >> type 3 mediagroup # sata/ssd group by site, -5/-6 >> type 4 unit # site, -2 >> type 5 root # root, -1 >> >> # buckets >> ctnr cld-osd1-56-sata { >> id -101 # do not change unnecessarily >> # weight 10.000 >> alg straw2 >> hash 0 # rjenkins1 >> item osd.0 weight 1.000 >> item osd.1 weight 1.000 >> item osd.2 weight 1.000 >> item osd.3 weight 1.000 >> item osd.4 weight 1.000 >> item osd.5 weight 1.000 >> item osd.6 weight 1.000 >> item osd.7 weight 1.000 >> item osd.8 weight 1.000 >> item osd.9 weight 1.000 >> } >> ctnr cld-osd1-56-ssd { >> id -201 # do not change unnecessarily >> # weight 2.000 >> alg straw2 >> hash 0 # rjenkins1 >>
Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?
The answer is probably going to be in how big your DB partition is vs how big your HDD disk is. From your output it looks like you have a 6TB HDD with a 28GB Blocks.DB partition. Even though the DB used size isn't currently full, I would guess that at some point since this OSD was created that it did fill up and what you're seeing is the part of the DB that spilled over to the data disk. This is why the official recommendation (that is quite cautious, but cautious because some use cases will use this up) for a blocks.db partition is 4% of the data drive. For your 6TB disks that's a recommendation of 240GB per DB partition. Of course the actual size of the DB needed is dependent on your use case. But pretty much every use case for a 6TB disk needs a bigger partition than 28GB. On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin wrote: > Wrong metadata paste of osd.73 in previous message. > > > { > > "id": 73, > "arch": "x86_64", > "back_addr": "10.10.10.6:6804/175338", > "back_iface": "vlan3", > "bluefs": "1", > "bluefs_db_access_mode": "blk", > "bluefs_db_block_size": "4096", > "bluefs_db_dev": "259:22", > "bluefs_db_dev_node": "nvme2n1", > "bluefs_db_driver": "KernelDevice", > "bluefs_db_model": "INTEL SSDPEDMD400G4 ", > "bluefs_db_partition_path": "/dev/nvme2n1p11", > "bluefs_db_rotational": "0", > "bluefs_db_serial": "CVFT4324002Q400BGN ", > "bluefs_db_size": "30064771072", > "bluefs_db_type": "nvme", > "bluefs_single_shared_device": "0", > "bluefs_slow_access_mode": "blk", > "bluefs_slow_block_size": "4096", > "bluefs_slow_dev": "8:176", > "bluefs_slow_dev_node": "sdl", > "bluefs_slow_driver": "KernelDevice", > "bluefs_slow_model": "TOSHIBA HDWE160 ", > "bluefs_slow_partition_path": "/dev/sdl2", > "bluefs_slow_rotational": "1", > "bluefs_slow_size": "6001069199360", > "bluefs_slow_type": "hdd", > "bluefs_wal_access_mode": "blk", > "bluefs_wal_block_size": "4096", > "bluefs_wal_dev": "259:22", > "bluefs_wal_dev_node": "nvme2n1", > "bluefs_wal_driver": "KernelDevice", > "bluefs_wal_model": "INTEL SSDPEDMD400G4 ", > "bluefs_wal_partition_path": "/dev/nvme2n1p12", > "bluefs_wal_rotational": "0", > "bluefs_wal_serial": "CVFT4324002Q400BGN ", > "bluefs_wal_size": "1073741824", > "bluefs_wal_type": "nvme", > "bluestore_bdev_access_mode": "blk", > "bluestore_bdev_block_size": "4096", > "bluestore_bdev_dev": "8:176", > "bluestore_bdev_dev_node": "sdl", > "bluestore_bdev_driver": "KernelDevice", > "bluestore_bdev_model": "TOSHIBA HDWE160 ", > "bluestore_bdev_partition_path": "/dev/sdl2", > "bluestore_bdev_rotational": "1", > "bluestore_bdev_size": "6001069199360", > "bluestore_bdev_type": "hdd", > "ceph_version": "ceph version 12.2.10 > (177915764b752804194937482a39e95e0ca3de94) luminous (stable)", > "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz", > "default_device_class": "hdd", > "distro": "centos", > "distro_description": "CentOS Linux 7 (Core)", > "distro_version": "7", > "front_addr": "172.16.16.16:6803/175338", > "front_iface": "vlan4", > "hb_back_addr": "10.10.10.6:6805/175338", > "hb_front_addr": "172.16.16.16:6805/175338", > "hostname": "ceph-osd5", > "journal_rotational": "0", > "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018", > "kernel_version": "3.10.0-862.11.6.el7.x86_64", > "mem_swap_kb": "0", > "mem_total_kb": "65724256", > "os": "Linux", > "osd_data": "/var/lib/ceph/osd/ceph-73", > "osd_objectstore": "bluestore", > "rotational": "1" > } > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host
I'm leaving the response on the CRUSH rule for Gregory, but you have another problem you're running into that is causing more of this data to stay on this node than you intend. While you `out` the OSD it is still contributing to the Host's weight. So the host is still set to receive that amount of data and distribute it among the disks inside of it. This is the default behavior (even if you `destroy` the OSD) to minimize the data movement for losing the disk and again for adding it back into the cluster after you replace the device. If you are really strapped for space, though, then you might consider fully purging the OSD which will reduce the Host weight to what the other OSDs are. However if you do have a problem in your CRUSH rule, then doing this won't change anything for you. On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2 wrote: > Thanks. I read the your reply in > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.html > so using indep will do fewer data remap when osd failed. > using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap > using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap > > am I right? > if so, what recommend to do when a disk failed and the total available > size of the rest disk in the machine is not enough(can not replace failed > disk immediately). or I should reserve more available size in EC situation. > > On 02/14/2019 02:49,Gregory Farnum > wrote: > > Your CRUSH rule for EC spools is forcing that behavior with the line > > step chooseleaf indep 1 type ctnr > > If you want different behavior, you’ll need a different crush rule. > > On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2 wrote: > >> Hi, cephers >> >> >> I am building a ceph EC cluster.when a disk is error,I out it.But its all >> PGs remap to the osds in the same host,which I think they should remap to >> other hosts in the same rack. >> test process is: >> >> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 >> site1_sata_erasure_ruleset 4 >> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1 >> /etc/init.d/ceph stop osd.2 >> ceph osd out 2 >> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2 >> diff /tmp/1 /tmp/2 -y --suppress-common-lines >> >> 0 1.0 1.0 118 osd.0 | 0 1.0 1.0 126 osd.0 >> 1 1.0 1.0 123 osd.1 | 1 1.0 1.0 139 osd.1 >> 2 1.0 1.0 122 osd.2 | 2 1.0 0 0 osd.2 >> 3 1.0 1.0 113 osd.3 | 3 1.0 1.0 131 osd.3 >> 4 1.0 1.0 122 osd.4 | 4 1.0 1.0 136 osd.4 >> 5 1.0 1.0 112 osd.5 | 5 1.0 1.0 127 osd.5 >> 6 1.0 1.0 114 osd.6 | 6 1.0 1.0 128 osd.6 >> 7 1.0 1.0 124 osd.7 | 7 1.0 1.0 136 osd.7 >> 8 1.0 1.0 95 osd.8 | 8 1.0 1.0 113 osd.8 >> 9 1.0 1.0 112 osd.9 | 9 1.0 1.0 119 osd.9 >> TOTAL 3073T 197G | TOTAL 3065T 197G >> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52 >> >> >> some config info: (detail configs see: >> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125) >> jewel 10.2.11 filestore+rocksdb >> >> ceph osd erasure-code-profile get ISA-4-2 >> k=4 >> m=2 >> plugin=isa >> ruleset-failure-domain=ctnr >> ruleset-root=site1-sata >> technique=reed_sol_van >> >> part of ceph.conf is: >> >> [global] >> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900 >> auth cluster required = cephx >> auth service required = cephx >> auth client required = cephx >> pid file = /home/ceph/var/run/$name.pid >> log file = /home/ceph/log/$cluster-$name.log >> mon osd nearfull ratio = 0.85 >> mon osd full ratio = 0.95 >> admin socket = /home/ceph/var/run/$cluster-$name.asok >> osd pool default size = 3 >> osd pool default min size = 1 >> osd objectstore = filestore >> filestore merge threshold = -10 >> >> [mon] >> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring >> mon data = /home/ceph/var/lib/$type/$cluster-$id >> mon cluster log file = /home/ceph/log/$cluster.log >> [osd] >> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring >> osd data = /home/ceph/var/lib/$type/$cluster-$id >> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal >> osd journal size = 1 >> osd mkfs type = xfs >> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k >> osd backfill full ratio = 0.92 >> osd failsafe full ratio = 0.95 >> osd failsafe nearfull ratio = 0.85 >> osd max backfills = 1 >> osd crush update on start = false >> osd op thread timeout = 60 >> filestore split multiple = 8 >> filestore max sync interval = 15 >> filestore min sync interval = 5 >> [osd.0] >> host = cld-osd1-56 >> addr = X >> user = ceph >> devs = /disk/link/osd-0/data >> osd journal = /disk/link/osd-0/journal >> ……. >> [osd.503] >> host = cld-osd42-56 >> addr = 10.108.87.52 >> user = ceph >> devs = /disk/link/osd-503/data >> osd journal = /disk/link/osd-503/journal >> >> >> crushmap is below: >> >> # begin crush map >>
Re: [ceph-users] Problems with osd creation in Ubuntu 18.04, ceph 13.2.4-1bionic
I have found that running a zap before all prepare/create commands with ceph-volume helps things run smoother. Zap is specifically there to clear everything on a disk away to make the disk ready to be used as an OSD. Your wipefs command is still fine, but then I would lvm zap the disk before continuing. I would run the commands like [1] this. I also prefer the single command lvm create as opposed to lvm prepare and lvm activate. Try that out and see if you still run into the problems creating the BlueStore filesystem. [1] ceph-volume lvm zap /dev/sdg ceph-volume lvm prepare --bluestore --data /dev/sdg On Thu, Feb 14, 2019 at 10:25 AM Rainer Krienke wrote: > Hi, > > I am quite new to ceph and just try to set up a ceph cluster. Initially > I used ceph-deploy for this but when I tried to create a BlueStore osd > ceph-deploy fails. Next I tried the direct way on one of the OSD-nodes > using ceph-volume to create the osd, but this also fails. Below you can > see what ceph-volume says. > > I ensured that there was no left over lvm VG and LV on the disk sdg > before I started the osd creation for this disk. The very same error > happens also on other disks not just for /dev/sdg. All the disk have 4TB > in size and the linux system is Ubuntu 18.04 and finally ceph is > installed in version 13.2.4-1bionic from this repo: > https://download.ceph.com/debian-mimic. > > There is a VG and two LV's on the system for the ubuntu system itself > that is installed on two separate disks configured as software raid1 and > lvm on top of the raid. But I cannot imagine that this might do any harm > to cephs osd creation. > > Does anyone have an idea what might be wrong? > > Thanks for hints > Rainer > > root@ceph1:~# wipefs -fa /dev/sdg > root@ceph1:~# ceph-volume lvm prepare --bluestore --data /dev/sdg > Running command: /usr/bin/ceph-authtool --gen-print-key > Running command: /usr/bin/ceph --cluster ceph --name > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring > -i - osd new 14d041d6-0beb-4056-8df2-3920e2febce0 > Running command: /sbin/vgcreate --force --yes > ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b /dev/sdg > stdout: Physical volume "/dev/sdg" successfully created. > stdout: Volume group "ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b" > successfully created > Running command: /sbin/lvcreate --yes -l 100%FREE -n > osd-block-14d041d6-0beb-4056-8df2-3920e2febce0 > ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b > stdout: Logical volume "osd-block-14d041d6-0beb-4056-8df2-3920e2febce0" > created. > Running command: /usr/bin/ceph-authtool --gen-print-key > Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0 > --> Absolute path not found for executable: restorecon > --> Ensure $PATH environment variable contains common executable locations > Running command: /bin/chown -h ceph:ceph > > /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0 > Running command: /bin/chown -R ceph:ceph /dev/dm-8 > Running command: /bin/ln -s > > /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0 > /var/lib/ceph/osd/ceph-0/block > Running command: /usr/bin/ceph --cluster ceph --name > client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring > mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap > stderr: got monmap epoch 1 > Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring > --create-keyring --name osd.0 --add-key > AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ== > stdout: creating /var/lib/ceph/osd/ceph-0/keyring > added entity osd.0 auth auth(auid = 18446744073709551615 > key=AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ== with 0 caps) > Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring > Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/ > Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore > bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap > --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid > 14d041d6-0beb-4056-8df2-3920e2febce0 --setuser ceph --setgroup ceph > stderr: 2019-02-14 13:45:54.788 7f3fcecb3240 -1 > bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid > stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: In > function 'virtual int KernelDevice::read(uint64_t, uint64_t, > ceph::bufferlist*, IOContext*, bool)' thread 7f3fcecb3240 time > 2019-02-14 13:45:54.841130 > stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: 821: > FAILED assert((uint64_t)r == len) > stderr: ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) > mimic (stable) > stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, > char const*)+0x102) [0x7f3fc60d33e2] > stderr: 2: (()+0x26d5a7) [0x7f3fc60d35a7] > stderr: 3: (KernelDevice::read(unsigned long, unsigned long, > ceph::buffer::list*, IOContext*, bool)+0x4a7) [0x561371346817] > stderr: 4:
[ceph-users] Second radosgw install
Hi, I want to install a second radosgw to my existing ceph cluster (mimic) on another server. Should I create it like the first one, with 'ceph-deploy rgw create' ? I don't want to mess with the existing rgw system pools. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: >>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe >>> OSDs as well. Over time their latency increased until we started to >>> notice I/O-wait inside VMs. > > I'm also notice it in the vms. BTW, what it your nvme disk size ? Samsung PM983 3.84TB SSDs in both clusters. > > >>> A restart fixed it. We also increased memory target from 4G to 6G on >>> these OSDs as the memory would allow it. > > I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. > (my last test was 8gb with 1osd of 6TB, but that didn't help) There are 10 OSDs in these systems with 96GB of memory in total. We are runnigh with memory target on 6G right now to make sure there is no leakage. If this runs fine for a longer period we will go to 8GB per OSD so it will max out on 80GB leaving 16GB as spare. As these OSDs were all restarted earlier this week I can't tell how it will hold up over a longer period. Monitoring (Zabbix) shows the latency is fine at the moment. Wido > > > - Mail original - > De: "Wido den Hollander" > À: "Alexandre Derumier" , "Igor Fedotov" > > Cc: "ceph-users" , "ceph-devel" > > Envoyé: Vendredi 15 Février 2019 14:50:34 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until > restart > > On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: >> Thanks Igor. >> >> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is >> different. >> >> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't >> see this latency problem. >> >> > > Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe > OSDs as well. Over time their latency increased until we started to > notice I/O-wait inside VMs. > > A restart fixed it. We also increased memory target from 4G to 6G on > these OSDs as the memory would allow it. > > But we noticed this on two different 12.2.10/11 clusters. > > A restart made the latency drop. Not only the numbers, but the > real-world latency as experienced by a VM as well. > > Wido > >> >> >> >> >> >> - Mail original - >> De: "Igor Fedotov" >> Cc: "ceph-users" , "ceph-devel" >> >> Envoyé: Vendredi 15 Février 2019 13:47:57 >> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until >> restart >> >> Hi Alexander, >> >> I've read through your reports, nothing obvious so far. >> >> I can only see several times average latency increase for OSD write ops >> (in seconds) >> 0.002040060 (first hour) vs. >> >> 0.002483516 (last 24 hours) vs. >> 0.008382087 (last hour) >> >> subop_w_latency: >> 0.000478934 (first hour) vs. >> 0.000537956 (last 24 hours) vs. >> 0.003073475 (last hour) >> >> and OSD read ops, osd_r_latency: >> >> 0.000408595 (first hour) >> 0.000709031 (24 hours) >> 0.004979540 (last hour) >> >> What's interesting is that such latency differences aren't observed at >> neither BlueStore level (any _lat params under "bluestore" section) nor >> rocksdb one. >> >> Which probably means that the issue is rather somewhere above BlueStore. >> >> Suggest to proceed with perf dumps collection to see if the picture >> stays the same. >> >> W.r.t. memory usage you observed I see nothing suspicious so far - No >> decrease in RSS report is a known artifact that seems to be safe. >> >> Thanks, >> Igor >> >> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: >>> Hi Igor, >>> >>> Thanks again for helping ! >>> >>> >>> >>> I have upgrade to last mimic this weekend, and with new autotune memory, >>> I have setup osd_memory_target to 8G. (my nvme are 6TB) >>> >>> >>> I have done a lot of perf dump and mempool dump and ps of process to >> see rss memory at different hours, >>> here the reports for osd.0: >>> >>> http://odisoweb1.odiso.net/perfanalysis/ >>> >>> >>> osd has been started the 12-02-2019 at 08:00 >>> >>> first report after 1h running >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt >>> >> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt >> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt >>> >>> >>> >>> report after 24 before counter resets >>> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt >>> >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt >> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt >>> >>> report 1h after counter reset >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt >>> >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt >> >>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt >>> >>> >>> >>> >>> I'm seeing the bluestore buffer bytes memory increasing up to 4G >> around 12-02-2019 at 14:00 >>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png >>> Then after that, slowly
Re: [ceph-users] ceph osd commit latency increase over time, until restart
>>Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe >>OSDs as well. Over time their latency increased until we started to >>notice I/O-wait inside VMs. I'm also notice it in the vms. BTW, what it your nvme disk size ? >>A restart fixed it. We also increased memory target from 4G to 6G on >>these OSDs as the memory would allow it. I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. (my last test was 8gb with 1osd of 6TB, but that didn't help) - Mail original - De: "Wido den Hollander" À: "Alexandre Derumier" , "Igor Fedotov" Cc: "ceph-users" , "ceph-devel" Envoyé: Vendredi 15 Février 2019 14:50:34 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: > Thanks Igor. > > I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is > different. > > I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't > see this latency problem. > > Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe OSDs as well. Over time their latency increased until we started to notice I/O-wait inside VMs. A restart fixed it. We also increased memory target from 4G to 6G on these OSDs as the memory would allow it. But we noticed this on two different 12.2.10/11 clusters. A restart made the latency drop. Not only the numbers, but the real-world latency as experienced by a VM as well. Wido > > > > > > - Mail original - > De: "Igor Fedotov" > Cc: "ceph-users" , "ceph-devel" > > Envoyé: Vendredi 15 Février 2019 13:47:57 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until > restart > > Hi Alexander, > > I've read through your reports, nothing obvious so far. > > I can only see several times average latency increase for OSD write ops > (in seconds) > 0.002040060 (first hour) vs. > > 0.002483516 (last 24 hours) vs. > 0.008382087 (last hour) > > subop_w_latency: > 0.000478934 (first hour) vs. > 0.000537956 (last 24 hours) vs. > 0.003073475 (last hour) > > and OSD read ops, osd_r_latency: > > 0.000408595 (first hour) > 0.000709031 (24 hours) > 0.004979540 (last hour) > > What's interesting is that such latency differences aren't observed at > neither BlueStore level (any _lat params under "bluestore" section) nor > rocksdb one. > > Which probably means that the issue is rather somewhere above BlueStore. > > Suggest to proceed with perf dumps collection to see if the picture > stays the same. > > W.r.t. memory usage you observed I see nothing suspicious so far - No > decrease in RSS report is a known artifact that seems to be safe. > > Thanks, > Igor > > On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: >> Hi Igor, >> >> Thanks again for helping ! >> >> >> >> I have upgrade to last mimic this weekend, and with new autotune memory, >> I have setup osd_memory_target to 8G. (my nvme are 6TB) >> >> >> I have done a lot of perf dump and mempool dump and ps of process to > see rss memory at different hours, >> here the reports for osd.0: >> >> http://odisoweb1.odiso.net/perfanalysis/ >> >> >> osd has been started the 12-02-2019 at 08:00 >> >> first report after 1h running >> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt > >> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt >> >> >> >> report after 24 before counter resets >> >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt > >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt >> >> report 1h after counter reset >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt > >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt >> >> >> >> >> I'm seeing the bluestore buffer bytes memory increasing up to 4G > around 12-02-2019 at 14:00 >> http://odisoweb1.odiso.net/perfanalysis/graphs2.png >> Then after that, slowly decreasing. >> >> >> Another strange thing, >> I'm seeing total bytes at 5G at 12-02-2018.13:30 >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt > >> Then is decreasing over time (around 3,7G this morning), but RSS is > still at 8G >> >> >> I'm graphing mempools counters too since yesterday, so I'll able to > track them over time. >> >> - Mail original - >> De: "Igor Fedotov" >> À: "Alexandre Derumier" >> Cc: "Sage Weil" , "ceph-users" > , "ceph-devel" >> Envoyé: Lundi 11 Février 2019 12:03:17 >> Objet: Re: [ceph-users] ceph osd commit latency increase over time, > until restart >> >> On 2/8/2019 6:57
[ceph-users] mount.ceph replacement in Python
Hi! I've created a mount.ceph.c replacement in Python which also utilizes the kernel keyring and does name resolutions. You can mount a CephFS without installing Ceph that way (and without using the legacy secret= mount option). https://github.com/SFTtech/ceph-mount When you place the script (or a symlink) in /sbin/mount.ceph, you can mount CephFS with systemd .mount units. I hope it's useful for somebody here someday :) Currently it's not optimized for proper packaging (no setup.py yet). If things don't work or you wanna change something, just open bugs or pull requests please. -- Jonas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph
Bingo! Changed disk to scsi and adapter to virtio is working perfectly. Thank you Mark! Regards, Gesiel Em sex, 15 de fev de 2019 às 10:21, Marc Roos escreveu: > > Use scsi disk and virtio adapter? I think that is recommended also for > use with ceph rbd. > > > > -Original Message- > From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] > Sent: 15 February 2019 13:16 > To: Marc Roos > Cc: ceph-users > Subject: Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph > > HI Marc, > > i tried this and the problem continue :-( > > > Em sex, 15 de fev de 2019 às 10:04, Marc Roos > escreveu: > > > > > And then in the windows vm > cmd > diskpart > Rescan > > Linux vm > echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda) > echo 1 > /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd) > > > > I have this to, have to do this to: > > virsh qemu-monitor-command vps-test2 --hmp "info block" > virsh qemu-monitor-command vps-test2 --hmp "block_resize > drive-scsi0-0-0-0 12G" > > > > > > -Original Message- > From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] > Sent: 15 February 2019 12:59 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph > > Hi, > > I'm making a environment for VMs with qemu/kvm and Ceph using RBD, > and > I'm with the follow problem: The guest VM not recognizes disk > resize > (increase). The cenario is: > > Host: > Centos 7.6 > Libvirt 4.5 > Ceph 13.2.4 > > I follow the following steps to increase the disk (ex: disk 10Gb > to > 20Gb): > > > # rbd resize --size 20480 mypool/vm_test # virsh blockresize > --domain > vm_test --path vda --size 20G > > But after this steps, the disk in VM continue with original size. > For > apply the change, is necessary reboot VM. > If I use local datastore instead Ceph, the VM recognize new size > imediatally. > > Does anyone have this? Is this expected? > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: > Thanks Igor. > > I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is > different. > > I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't > see this latency problem. > > Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe OSDs as well. Over time their latency increased until we started to notice I/O-wait inside VMs. A restart fixed it. We also increased memory target from 4G to 6G on these OSDs as the memory would allow it. But we noticed this on two different 12.2.10/11 clusters. A restart made the latency drop. Not only the numbers, but the real-world latency as experienced by a VM as well. Wido > > > > > > - Mail original - > De: "Igor Fedotov" > Cc: "ceph-users" , "ceph-devel" > > Envoyé: Vendredi 15 Février 2019 13:47:57 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until > restart > > Hi Alexander, > > I've read through your reports, nothing obvious so far. > > I can only see several times average latency increase for OSD write ops > (in seconds) > 0.002040060 (first hour) vs. > > 0.002483516 (last 24 hours) vs. > 0.008382087 (last hour) > > subop_w_latency: > 0.000478934 (first hour) vs. > 0.000537956 (last 24 hours) vs. > 0.003073475 (last hour) > > and OSD read ops, osd_r_latency: > > 0.000408595 (first hour) > 0.000709031 (24 hours) > 0.004979540 (last hour) > > What's interesting is that such latency differences aren't observed at > neither BlueStore level (any _lat params under "bluestore" section) nor > rocksdb one. > > Which probably means that the issue is rather somewhere above BlueStore. > > Suggest to proceed with perf dumps collection to see if the picture > stays the same. > > W.r.t. memory usage you observed I see nothing suspicious so far - No > decrease in RSS report is a known artifact that seems to be safe. > > Thanks, > Igor > > On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: >> Hi Igor, >> >> Thanks again for helping ! >> >> >> >> I have upgrade to last mimic this weekend, and with new autotune memory, >> I have setup osd_memory_target to 8G. (my nvme are 6TB) >> >> >> I have done a lot of perf dump and mempool dump and ps of process to > see rss memory at different hours, >> here the reports for osd.0: >> >> http://odisoweb1.odiso.net/perfanalysis/ >> >> >> osd has been started the 12-02-2019 at 08:00 >> >> first report after 1h running >> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt > >> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt >> >> >> >> report after 24 before counter resets >> >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt > >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt >> >> report 1h after counter reset >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt > >> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt >> >> >> >> >> I'm seeing the bluestore buffer bytes memory increasing up to 4G > around 12-02-2019 at 14:00 >> http://odisoweb1.odiso.net/perfanalysis/graphs2.png >> Then after that, slowly decreasing. >> >> >> Another strange thing, >> I'm seeing total bytes at 5G at 12-02-2018.13:30 >> > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt > >> Then is decreasing over time (around 3,7G this morning), but RSS is > still at 8G >> >> >> I'm graphing mempools counters too since yesterday, so I'll able to > track them over time. >> >> - Mail original - >> De: "Igor Fedotov" >> À: "Alexandre Derumier" >> Cc: "Sage Weil" , "ceph-users" > , "ceph-devel" >> Envoyé: Lundi 11 Février 2019 12:03:17 >> Objet: Re: [ceph-users] ceph osd commit latency increase over time, > until restart >> >> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: >>> another mempool dump after 1h run. (latency ok) >>> >>> Biggest difference: >>> >>> before restart >>> - >>> "bluestore_cache_other": { >>> "items": 48661920, >>> "bytes": 1539544228 >>> }, >>> "bluestore_cache_data": { >>> "items": 54, >>> "bytes": 643072 >>> }, >>> (other caches seem to be quite low too, like bluestore_cache_other > take all the memory) >>> >>> >>> After restart >>> - >>> "bluestore_cache_other": { >>> "items": 12432298, >>> "bytes": 500834899 >>> }, >>> "bluestore_cache_data": { >>> "items": 40084, >>> "bytes": 1056235520 >>> }, >>> >> This is fine as cache is warming after restart and some rebalancing >> between data and metadata might occur. >> >> What relates to allocator and most
[ceph-users] Files in CephFS data pool
Is there anyway to find out which files are stored in a CephFS data pool? I know you can reference the extended attributes, but those are only relevant for files created after ceph.dir.layout.pool or ceph.file.layout.pool attributes are set - I need to know about all the files in a pool. Thanks! -TJ Ragan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Thanks Igor. I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is different. I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see this latency problem. - Mail original - De: "Igor Fedotov" Cc: "ceph-users" , "ceph-devel" Envoyé: Vendredi 15 Février 2019 13:47:57 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart Hi Alexander, I've read through your reports, nothing obvious so far. I can only see several times average latency increase for OSD write ops (in seconds) 0.002040060 (first hour) vs. 0.002483516 (last 24 hours) vs. 0.008382087 (last hour) subop_w_latency: 0.000478934 (first hour) vs. 0.000537956 (last 24 hours) vs. 0.003073475 (last hour) and OSD read ops, osd_r_latency: 0.000408595 (first hour) 0.000709031 (24 hours) 0.004979540 (last hour) What's interesting is that such latency differences aren't observed at neither BlueStore level (any _lat params under "bluestore" section) nor rocksdb one. Which probably means that the issue is rather somewhere above BlueStore. Suggest to proceed with perf dumps collection to see if the picture stays the same. W.r.t. memory usage you observed I see nothing suspicious so far - No decrease in RSS report is a known artifact that seems to be safe. Thanks, Igor On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: > Hi Igor, > > Thanks again for helping ! > > > > I have upgrade to last mimic this weekend, and with new autotune memory, > I have setup osd_memory_target to 8G. (my nvme are 6TB) > > > I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, > here the reports for osd.0: > > http://odisoweb1.odiso.net/perfanalysis/ > > > osd has been started the 12-02-2019 at 08:00 > > first report after 1h running > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt > > > > report after 24 before counter resets > > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt > > report 1h after counter reset > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt > > > > > I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 > http://odisoweb1.odiso.net/perfanalysis/graphs2.png > Then after that, slowly decreasing. > > > Another strange thing, > I'm seeing total bytes at 5G at 12-02-2018.13:30 > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt > Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G > > > I'm graphing mempools counters too since yesterday, so I'll able to track them over time. > > - Mail original - > De: "Igor Fedotov" > À: "Alexandre Derumier" > Cc: "Sage Weil" , "ceph-users" , "ceph-devel" > Envoyé: Lundi 11 Février 2019 12:03:17 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart > > On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: >> another mempool dump after 1h run. (latency ok) >> >> Biggest difference: >> >> before restart >> - >> "bluestore_cache_other": { >> "items": 48661920, >> "bytes": 1539544228 >> }, >> "bluestore_cache_data": { >> "items": 54, >> "bytes": 643072 >> }, >> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) >> >> >> After restart >> - >> "bluestore_cache_other": { >> "items": 12432298, >> "bytes": 500834899 >> }, >> "bluestore_cache_data": { >> "items": 40084, >> "bytes": 1056235520 >> }, >> > This is fine as cache is warming after restart and some rebalancing > between data and metadata might occur. > > What relates to allocator and most probably to fragmentation growth is : > > "bluestore_alloc": { > "items": 165053952, > "bytes": 165053952 > }, > > which had been higher before the reset (if I got these dumps' order > properly) > > "bluestore_alloc": { > "items": 210243456, > "bytes": 210243456 > }, > > But as I mentioned - I'm not 100% sure this might cause such a huge > latency increase... > > Do you have perf counters dump after the restart? > > Could you collect some more dumps - for both mempool and perf counters? > > So ideally I'd like to have: > > 1) mempool/perf counters dumps after the restart (1hour is OK) > > 2) mempool/perf counters dumps in 24+ hours after restart > > 3) reset perf counters
Re: [ceph-users] single OSDs cause cluster hickups
Yeah. I'm monitoring such issue reports for a while and it looks like something is definitely wrong with response times under certain circumstances. Mpt sure if all these reports have the same root cause though. Scrubbing seems to be one of the trigger. Perhaps we need more low-level detection/warning for high response times from HW and/or DB. Planning to look how feasible is that warning means shortly. Thanks, Igor On 2/15/2019 3:24 PM, Denny Kreische wrote: Hi Igor, Thanks for your reply. I can verify, discard is disabled in our cluster: 10:03 root@node106b [fra]:~# ceph daemon osd.417 config show | grep discard "bdev_async_discard": "false", "bdev_enable_discard": "false", [...] So there must be something else causing the problems. Thanks, Denny Am 15.02.2019 um 12:41 schrieb Igor Fedotov : Hi Denny, Do not remember exactly when discards appeared in BlueStore but they are disabled by default: See bdev_enable_discard option. Thanks, Igor On 2/15/2019 2:12 PM, Denny Kreische wrote: Hi, two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs. somehow we see strange behaviour since then. Single OSDs seem to block for around 5 minutes and this causes the whole cluster and connected applications to hang. This happened 5 times during the last 10 days at irregular times, it didn't happen before the upgrade. OSD log shows something like this (more log here: https://pastebin.com/6BYam5r4): [...] 2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) 2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) [...] In this example osd.417 seems to have a problem. I can see same log line in other osd logs with placement groups related to osd.417. I assume that all placement groups related to osd.417 are hanging or blocked when osd.417 is blocked. How can I see in detail what might cause a certain OSD to stop working? The cluster consists of 3 different SSD vendors (micron, samsung, intel), but only micron disks are affected until now. we earlier had problems with micron SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for several minutes. we migrated to bluestore about a year ago. just in case, is there any kind of ssd trim/discard happening in bluestore since mimic? Thanks, Denny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Hi Alexander, I've read through your reports, nothing obvious so far. I can only see several times average latency increase for OSD write ops (in seconds) 0.002040060 (first hour) vs. 0.002483516 (last 24 hours) vs. 0.008382087 (last hour) subop_w_latency: 0.000478934 (first hour) vs. 0.000537956 (last 24 hours) vs. 0.003073475 (last hour) and OSD read ops, osd_r_latency: 0.000408595 (first hour) 0.000709031 (24 hours) 0.004979540 (last hour) What's interesting is that such latency differences aren't observed at neither BlueStore level (any _lat params under "bluestore" section) nor rocksdb one. Which probably means that the issue is rather somewhere above BlueStore. Suggest to proceed with perf dumps collection to see if the picture stays the same. W.r.t. memory usage you observed I see nothing suspicious so far - No decrease in RSS report is a known artifact that seems to be safe. Thanks, Igor On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: > Hi Igor, > > Thanks again for helping ! > > > > I have upgrade to last mimic this weekend, and with new autotune memory, > I have setup osd_memory_target to 8G. (my nvme are 6TB) > > > I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, > here the reports for osd.0: > > http://odisoweb1.odiso.net/perfanalysis/ > > > osd has been started the 12-02-2019 at 08:00 > > first report after 1h running > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt > > > > report after 24 before counter resets > > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt > > report 1h after counter reset > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt > http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt > > > > > I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 > http://odisoweb1.odiso.net/perfanalysis/graphs2.png > Then after that, slowly decreasing. > > > Another strange thing, > I'm seeing total bytes at 5G at 12-02-2018.13:30 > http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt > Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G > > > I'm graphing mempools counters too since yesterday, so I'll able to track them over time. > > - Mail original - > De: "Igor Fedotov" > À: "Alexandre Derumier" > Cc: "Sage Weil" , "ceph-users" , "ceph-devel" > Envoyé: Lundi 11 Février 2019 12:03:17 > Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart > > On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: >> another mempool dump after 1h run. (latency ok) >> >> Biggest difference: >> >> before restart >> - >> "bluestore_cache_other": { >> "items": 48661920, >> "bytes": 1539544228 >> }, >> "bluestore_cache_data": { >> "items": 54, >> "bytes": 643072 >> }, >> (other caches seem to be quite low too, like bluestore_cache_other take all the memory) >> >> >> After restart >> - >> "bluestore_cache_other": { >> "items": 12432298, >> "bytes": 500834899 >> }, >> "bluestore_cache_data": { >> "items": 40084, >> "bytes": 1056235520 >> }, >> > This is fine as cache is warming after restart and some rebalancing > between data and metadata might occur. > > What relates to allocator and most probably to fragmentation growth is : > > "bluestore_alloc": { > "items": 165053952, > "bytes": 165053952 > }, > > which had been higher before the reset (if I got these dumps' order > properly) > > "bluestore_alloc": { > "items": 210243456, > "bytes": 210243456 > }, > > But as I mentioned - I'm not 100% sure this might cause such a huge > latency increase... > > Do you have perf counters dump after the restart? > > Could you collect some more dumps - for both mempool and perf counters? > > So ideally I'd like to have: > > 1) mempool/perf counters dumps after the restart (1hour is OK) > > 2) mempool/perf counters dumps in 24+ hours after restart > > 3) reset perf counters after 2), wait for 1 hour (and without OSD > restart) and dump mempool/perf counters again. > > So we'll be able to learn both allocator mem usage growth and operation > latency distribution for the following periods: > > a) 1st hour after restart > > b) 25th hour. > > > Thanks, > > Igor > > >> full mempool dump after restart >> --- >> >> { >> "mempool": { >> "by_pool": { >> "bloom_filter": { >> "items": 0, >> "bytes": 0 >> }, >> "bluestore_alloc": { >> "items": 165053952, >> "bytes": 165053952 >> }, >>
Re: [ceph-users] ceph osd commit latency increase over time, until restart
Hi Alexander, I've read through your reports, nothing obvious so far. I can only see several times average latency increase for OSD write ops (in seconds) 0.002040060 (first hour) vs. 0.002483516 (last 24 hours) vs. 0.008382087 (last hour) subop_w_latency: 0.000478934 (first hour) vs. 0.000537956 (last 24 hours) vs. 0.003073475 (last hour) and OSD read ops, osd_r_latency: 0.000408595 (first hour) 0.000709031 (24 hours) 0.004979540 (last hour) What's interesting is that such latency differences aren't observed at neither BlueStore level (any _lat params under "bluestore" section) nor rocksdb one. Which probably means that the issue is rather somewhere above BlueStore. Suggest to proceed with perf dumps collection to see if the picture stays the same. W.r.t. memory usage you observed I see nothing suspicious so far - No decrease in RSS report is a known artifact that seems to be safe. Thanks, Igor On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: Hi Igor, Thanks again for helping ! I have upgrade to last mimic this weekend, and with new autotune memory, I have setup osd_memory_target to 8G. (my nvme are 6TB) I have done a lot of perf dump and mempool dump and ps of process to see rss memory at different hours, here the reports for osd.0: http://odisoweb1.odiso.net/perfanalysis/ osd has been started the 12-02-2019 at 08:00 first report after 1h running http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt report after 24 before counter resets http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt report 1h after counter reset http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt I'm seeing the bluestore buffer bytes memory increasing up to 4G around 12-02-2019 at 14:00 http://odisoweb1.odiso.net/perfanalysis/graphs2.png Then after that, slowly decreasing. Another strange thing, I'm seeing total bytes at 5G at 12-02-2018.13:30 http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G I'm graphing mempools counters too since yesterday, so I'll able to track them over time. - Mail original - De: "Igor Fedotov" À: "Alexandre Derumier" Cc: "Sage Weil" , "ceph-users" , "ceph-devel" Envoyé: Lundi 11 Février 2019 12:03:17 Objet: Re: [ceph-users] ceph osd commit latency increase over time, until restart On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: another mempool dump after 1h run. (latency ok) Biggest difference: before restart - "bluestore_cache_other": { "items": 48661920, "bytes": 1539544228 }, "bluestore_cache_data": { "items": 54, "bytes": 643072 }, (other caches seem to be quite low too, like bluestore_cache_other take all the memory) After restart - "bluestore_cache_other": { "items": 12432298, "bytes": 500834899 }, "bluestore_cache_data": { "items": 40084, "bytes": 1056235520 }, This is fine as cache is warming after restart and some rebalancing between data and metadata might occur. What relates to allocator and most probably to fragmentation growth is : "bluestore_alloc": { "items": 165053952, "bytes": 165053952 }, which had been higher before the reset (if I got these dumps' order properly) "bluestore_alloc": { "items": 210243456, "bytes": 210243456 }, But as I mentioned - I'm not 100% sure this might cause such a huge latency increase... Do you have perf counters dump after the restart? Could you collect some more dumps - for both mempool and perf counters? So ideally I'd like to have: 1) mempool/perf counters dumps after the restart (1hour is OK) 2) mempool/perf counters dumps in 24+ hours after restart 3) reset perf counters after 2), wait for 1 hour (and without OSD restart) and dump mempool/perf counters again. So we'll be able to learn both allocator mem usage growth and operation latency distribution for the following periods: a) 1st hour after restart b) 25th hour. Thanks, Igor full mempool dump after restart --- { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 165053952, "bytes": 165053952 }, "bluestore_cache_data": { "items": 40084, "bytes": 1056235520 }, "bluestore_cache_onode": { "items": 5, "bytes": 14935200 }, "bluestore_cache_other": { "items": 12432298, "bytes": 500834899 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 11, "bytes": 8184 },
Re: [ceph-users] single OSDs cause cluster hickups
Hi Igor, Thanks for your reply. I can verify, discard is disabled in our cluster: 10:03 root@node106b [fra]:~# ceph daemon osd.417 config show | grep discard "bdev_async_discard": "false", "bdev_enable_discard": "false", [...] So there must be something else causing the problems. Thanks, Denny > Am 15.02.2019 um 12:41 schrieb Igor Fedotov : > > Hi Denny, > > Do not remember exactly when discards appeared in BlueStore but they are > disabled by default: > > See bdev_enable_discard option. > > > Thanks, > > Igor > > On 2/15/2019 2:12 PM, Denny Kreische wrote: >> Hi, >> >> two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to >> mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs. >> somehow we see strange behaviour since then. Single OSDs seem to block for >> around 5 minutes and this causes the whole cluster and connected >> applications to hang. This happened 5 times during the last 10 days at >> irregular times, it didn't happen before the upgrade. >> >> OSD log shows something like this (more log here: >> https://pastebin.com/6BYam5r4): >> >> [...] >> 2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics >> reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff >> 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) >> 2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics >> reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff >> 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) >> [...] >> >> In this example osd.417 seems to have a problem. I can see same log line in >> other osd logs with placement groups related to osd.417. >> I assume that all placement groups related to osd.417 are hanging or blocked >> when osd.417 is blocked. >> >> How can I see in detail what might cause a certain OSD to stop working? >> >> The cluster consists of 3 different SSD vendors (micron, samsung, intel), >> but only micron disks are affected until now. we earlier had problems with >> micron SSDs with filestore (xfs), it was fstrim to cause single OSDs to >> block for several minutes. we migrated to bluestore about a year ago. just >> in case, is there any kind of ssd trim/discard happening in bluestore since >> mimic? >> >> Thanks, >> Denny >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Denny Kreische IT System Ingenieur und Consultant Am Teichdamm 20 04680 Colditz Telefon: 034381 55125 Mobil: 0176 2115 1457 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph
Use scsi disk and virtio adapter? I think that is recommended also for use with ceph rbd. -Original Message- From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] Sent: 15 February 2019 13:16 To: Marc Roos Cc: ceph-users Subject: Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph HI Marc, i tried this and the problem continue :-( Em sex, 15 de fev de 2019 às 10:04, Marc Roos escreveu: And then in the windows vm cmd diskpart Rescan Linux vm echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda) echo 1 > /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd) I have this to, have to do this to: virsh qemu-monitor-command vps-test2 --hmp "info block" virsh qemu-monitor-command vps-test2 --hmp "block_resize drive-scsi0-0-0-0 12G" -Original Message- From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] Sent: 15 February 2019 12:59 To: ceph-users@lists.ceph.com Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph Hi, I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and I'm with the follow problem: The guest VM not recognizes disk resize (increase). The cenario is: Host: Centos 7.6 Libvirt 4.5 Ceph 13.2.4 I follow the following steps to increase the disk (ex: disk 10Gb to 20Gb): # rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain vm_test --path vda --size 20G But after this steps, the disk in VM continue with original size. For apply the change, is necessary reboot VM. If I use local datastore instead Ceph, the VM recognize new size imediatally. Does anyone have this? Is this expected? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph
HI Marc, i tried this and the problem continue :-( Em sex, 15 de fev de 2019 às 10:04, Marc Roos escreveu: > > > And then in the windows vm > cmd > diskpart > Rescan > > Linux vm > echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda) > echo 1 > /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd) > > > > I have this to, have to do this to: > > virsh qemu-monitor-command vps-test2 --hmp "info block" > virsh qemu-monitor-command vps-test2 --hmp "block_resize > drive-scsi0-0-0-0 12G" > > > > > > -Original Message- > From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] > Sent: 15 February 2019 12:59 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph > > Hi, > > I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and > I'm with the follow problem: The guest VM not recognizes disk resize > (increase). The cenario is: > > Host: > Centos 7.6 > Libvirt 4.5 > Ceph 13.2.4 > > I follow the following steps to increase the disk (ex: disk 10Gb to > 20Gb): > > > # rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain > vm_test --path vda --size 20G > > But after this steps, the disk in VM continue with original size. For > apply the change, is necessary reboot VM. > If I use local datastore instead Ceph, the VM recognize new size > imediatally. > > Does anyone have this? Is this expected? > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph
And then in the windows vm cmd diskpart Rescan Linux vm echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda) echo 1 > /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd) I have this to, have to do this to: virsh qemu-monitor-command vps-test2 --hmp "info block" virsh qemu-monitor-command vps-test2 --hmp "block_resize drive-scsi0-0-0-0 12G" -Original Message- From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] Sent: 15 February 2019 12:59 To: ceph-users@lists.ceph.com Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph Hi, I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and I'm with the follow problem: The guest VM not recognizes disk resize (increase). The cenario is: Host: Centos 7.6 Libvirt 4.5 Ceph 13.2.4 I follow the following steps to increase the disk (ex: disk 10Gb to 20Gb): # rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain vm_test --path vda --size 20G But after this steps, the disk in VM continue with original size. For apply the change, is necessary reboot VM. If I use local datastore instead Ceph, the VM recognize new size imediatally. Does anyone have this? Is this expected? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph
I have this to, have to do this to: virsh qemu-monitor-command vps-test2 --hmp "info block" virsh qemu-monitor-command vps-test2 --hmp "block_resize drive-scsi0-0-0-0 12G" -Original Message- From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] Sent: 15 February 2019 12:59 To: ceph-users@lists.ceph.com Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph Hi, I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and I'm with the follow problem: The guest VM not recognizes disk resize (increase). The cenario is: Host: Centos 7.6 Libvirt 4.5 Ceph 13.2.4 I follow the following steps to increase the disk (ex: disk 10Gb to 20Gb): # rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain vm_test --path vda --size 20G But after this steps, the disk in VM continue with original size. For apply the change, is necessary reboot VM. If I use local datastore instead Ceph, the VM recognize new size imediatally. Does anyone have this? Is this expected? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Online disk resize with Qemu/KVM and Ceph
Hi, I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and I'm with the follow problem: The guest VM not recognizes disk resize (increase). The cenario is: Host: Centos 7.6 Libvirt 4.5 Ceph 13.2.4 I follow the following steps to increase the disk (ex: disk 10Gb to 20Gb): # rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain vm_test --path vda --size 20G But after this steps, the disk in VM continue with original size. For apply the change, is necessary reboot VM. If I use local datastore instead Ceph, the VM recognize new size imediatally. Does anyone have this? Is this expected? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] single OSDs cause cluster hickups
Hi Denny, Do not remember exactly when discards appeared in BlueStore but they are disabled by default: See bdev_enable_discard option. Thanks, Igor On 2/15/2019 2:12 PM, Denny Kreische wrote: Hi, two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs. somehow we see strange behaviour since then. Single OSDs seem to block for around 5 minutes and this causes the whole cluster and connected applications to hang. This happened 5 times during the last 10 days at irregular times, it didn't happen before the upgrade. OSD log shows something like this (more log here: https://pastebin.com/6BYam5r4): [...] 2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) 2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) [...] In this example osd.417 seems to have a problem. I can see same log line in other osd logs with placement groups related to osd.417. I assume that all placement groups related to osd.417 are hanging or blocked when osd.417 is blocked. How can I see in detail what might cause a certain OSD to stop working? The cluster consists of 3 different SSD vendors (micron, samsung, intel), but only micron disks are affected until now. we earlier had problems with micron SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for several minutes. we migrated to bluestore about a year ago. just in case, is there any kind of ssd trim/discard happening in bluestore since mimic? Thanks, Denny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] single OSDs cause cluster hickups
Hi, two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs. somehow we see strange behaviour since then. Single OSDs seem to block for around 5 minutes and this causes the whole cluster and connected applications to hang. This happened 5 times during the last 10 days at irregular times, it didn't happen before the upgrade. OSD log shows something like this (more log here: https://pastebin.com/6BYam5r4): [...] 2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) 2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516) [...] In this example osd.417 seems to have a problem. I can see same log line in other osd logs with placement groups related to osd.417. I assume that all placement groups related to osd.417 are hanging or blocked when osd.417 is blocked. How can I see in detail what might cause a certain OSD to stop working? The cluster consists of 3 different SSD vendors (micron, samsung, intel), but only micron disks are affected until now. we earlier had problems with micron SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for several minutes. we migrated to bluestore about a year ago. just in case, is there any kind of ssd trim/discard happening in bluestore since mimic? Thanks, Denny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
On Fri, Feb 15, 2019 at 12:01 PM Willem Jan Withagen wrote: > > On 15/02/2019 11:56, Dan van der Ster wrote: > > On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen > > wrote: > >> > >> On 15/02/2019 10:39, Ilya Dryomov wrote: > >>> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: > > Hi Marc, > > You can see previous designs on the Ceph store: > > https://www.proforma.com/sdscommunitystore > >>> > >>> Hi Mike, > >>> > >>> This site stopped working during DevConf and hasn't been working since. > >>> I think Greg has contacted some folks about this, but it would be great > >>> if you could follow up because it's been a couple of weeks now... > >> > >> Ilya, > >> > >> The site is working for me. > >> It only does not contain the Nautilus shirts (yet) > > > > I found in the past that the http redirection for www.proforma.com > > doesn't work from over here in Europe. > > If someone can post the redirection target then we can access it directly. > > Like: > > https://proformaprostores.com/Category > > > at least, that is where I get directed to. Exactly! That URL works here at CERN... www.proforma.com is stuck forever. -- dan > > --WjW > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
On 15/02/2019 11:56, Dan van der Ster wrote: On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote: On 15/02/2019 10:39, Ilya Dryomov wrote: On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: Hi Marc, You can see previous designs on the Ceph store: https://www.proforma.com/sdscommunitystore Hi Mike, This site stopped working during DevConf and hasn't been working since. I think Greg has contacted some folks about this, but it would be great if you could follow up because it's been a couple of weeks now... Ilya, The site is working for me. It only does not contain the Nautilus shirts (yet) I found in the past that the http redirection for www.proforma.com doesn't work from over here in Europe. If someone can post the redirection target then we can access it directly. Like: https://proformaprostores.com/Category at least, that is where I get directed to. --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
I have no issues opening that site from Germany. Zitat von Dan van der Ster : On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote: On 15/02/2019 10:39, Ilya Dryomov wrote: > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: >> >> Hi Marc, >> >> You can see previous designs on the Ceph store: >> >> https://www.proforma.com/sdscommunitystore > > Hi Mike, > > This site stopped working during DevConf and hasn't been working since. > I think Greg has contacted some folks about this, but it would be great > if you could follow up because it's been a couple of weeks now... Ilya, The site is working for me. It only does not contain the Nautilus shirts (yet) I found in the past that the http redirection for www.proforma.com doesn't work from over here in Europe. If someone can post the redirection target then we can access it directly. -- dan --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen wrote: > > On 15/02/2019 10:39, Ilya Dryomov wrote: > > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: > >> > >> Hi Marc, > >> > >> You can see previous designs on the Ceph store: > >> > >> https://www.proforma.com/sdscommunitystore > > > > Hi Mike, > > > > This site stopped working during DevConf and hasn't been working since. > > I think Greg has contacted some folks about this, but it would be great > > if you could follow up because it's been a couple of weeks now... > > Ilya, > > The site is working for me. > It only does not contain the Nautilus shirts (yet) I found in the past that the http redirection for www.proforma.com doesn't work from over here in Europe. If someone can post the redirection target then we can access it directly. -- dan > > --WjW > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
On 15/02/2019 10:39, Ilya Dryomov wrote: On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: Hi Marc, You can see previous designs on the Ceph store: https://www.proforma.com/sdscommunitystore Hi Mike, This site stopped working during DevConf and hasn't been working since. I think Greg has contacted some folks about this, but it would be great if you could follow up because it's been a couple of weeks now... Ilya, The site is working for me. It only does not contain the Nautilus shirts (yet) --WjW ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mon_data_size_warn limits for large cluster
today I again hit the warn with 30G also... On Thu, Feb 14, 2019 at 7:39 PM Sage Weil wrote: > > On Thu, 7 Feb 2019, Dan van der Ster wrote: > > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy > > wrote: > > > > > > Hi Dan, > > > >During backfilling scenarios, the mons keep old maps and grow quite > > > >quickly. So if you have balancing, pg splitting, etc. ongoing for > > > >awhile, the mon stores will eventually trigger that 15GB alarm. > > > >But the intended behavior is that once the PGs are all active+clean, > > > >the old maps should be trimmed and the disk space freed. > > > > > > old maps not trimmed after cluster reached to "all+clean" state for all > > > PGs. > > > Is there (known) bug here? > > > As the size of dB showing > 15G, do I need to run the compact commands > > > to do the trimming? > > > > Compaction isn't necessary -- you should only need to restart all > > peon's then the leader. A few minutes later the db's should start > > trimming. > > The next time someone sees this behavior, can you please > > - enable debug_mon = 20 on all mons (*before* restarting) >ceph tell mon.* injectargs '--debug-mon 20' > - wait for 10 minutes or so to generate some logs > - add 'debug mon = 20' to ceph.conf (on mons only) > - restart the monitors > - wait for them to start trimming > - remove 'debug mon = 20' from ceph.conf (on mons only) > - tar up the log files, ceph-post-file them, and share them with ticket > http://tracker.ceph.com/issues/38322 > > Thanks! > sage > > > > > > -- dan > > > > > > > > > > Thanks > > > Swami > > > > > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster > > > wrote: > > > > > > > > Hi, > > > > > > > > With HEALTH_OK a mon data dir should be under 2GB for even such a large > > > > cluster. > > > > > > > > During backfilling scenarios, the mons keep old maps and grow quite > > > > quickly. So if you have balancing, pg splitting, etc. ongoing for > > > > awhile, the mon stores will eventually trigger that 15GB alarm. > > > > But the intended behavior is that once the PGs are all active+clean, > > > > the old maps should be trimmed and the disk space freed. > > > > > > > > However, several people have noted that (at least in luminous > > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all > > > > mons are restarted. This ticket seems related: > > > > http://tracker.ceph.com/issues/37875 > > > > > > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the > > > > mon stores dropping from >15GB to ~700MB each time). > > > > > > > > -- Dan > > > > > > > > > > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil wrote: > > > > > > > > > > Hi Swami > > > > > > > > > > The limit is somewhat arbitrary, based on cluster sizes we had seen > > > > > when > > > > > we picked it. In your case it should be perfectly safe to increase > > > > > it. > > > > > > > > > > sage > > > > > > > > > > > > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote: > > > > > > > > > > > Hello - Are the any limits for mon_data_size for cluster with 2PB > > > > > > (with 2000+ OSDs)? > > > > > > > > > > > > Currently it set as 15G. What is logic behind this? Can we increase > > > > > > when we get the mon_data_size_warn messages? > > > > > > > > > > > > I am getting the mon_data_size_warn message even though there a > > > > > > ample > > > > > > of free space on the disk (around 300G free disk) > > > > > > > > > > > > Earlier thread on the same discusion: > > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html > > > > > > > > > > > > Thanks > > > > > > Swami > > > > > > > > > > > > > > > > > > > > > > > ___ > > > > > ceph-users mailing list > > > > > ceph-users@lists.ceph.com > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Nautilus Release T-shirt Design
On Fri, Feb 15, 2019 at 12:05 AM Mike Perez wrote: > > Hi Marc, > > You can see previous designs on the Ceph store: > > https://www.proforma.com/sdscommunitystore Hi Mike, This site stopped working during DevConf and hasn't been working since. I think Greg has contacted some folks about this, but it would be great if you could follow up because it's been a couple of weeks now... Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: NAS solution for CephFS
On Fri, 2019-02-15 at 15:34 +0800, Marvin Zhang wrote: > Thanks Jeff. > If I set Attr_Expiration_Time as zero in conf , deos it mean timeout > is zero? If so, every client will see the change immediately. Will it > decrease the performance hardly? > I seems that GlusterFS FSAL use UPCALL to invalidate the cache. How > about the CephFS FSAL? > We mostly suggest ganesha's attribute cache be disabled when exporting FSAL_CEPH. libcephfs caches attributes too, and it knows the status of those attributes better than ganesha can. A call into libcephfs from ganesha to retrieve cached attributes is mostly just in-memory copies within the same process, so any performance overhead there is pretty minimal. If we need to go to the network to get the attributes, then that was a case where the cache should have been invalidated anyway, and we avoid having to check the validity of the cache. > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton wrote: > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote: > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40 > > > Will Client query 'change' attribute every time before reading to know > > > if the data has been changed? > > > > > > +-+++-+---+ > > > | Name| ID | Data Type | Acc | Defined in| > > > +-+++-+---+ > > > | supported_attrs | 0 | bitmap4| R | Section 5.8.1.1 | > > > | type| 1 | nfs_ftype4 | R | Section 5.8.1.2 | > > > | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | > > > | change | 3 | changeid4 | R | Section 5.8.1.4 | > > > | size| 4 | uint64_t | R W | Section 5.8.1.5 | > > > | link_support| 5 | bool | R | Section 5.8.1.6 | > > > | symlink_support | 6 | bool | R | Section 5.8.1.7 | > > > | named_attr | 7 | bool | R | Section 5.8.1.8 | > > > | fsid| 8 | fsid4 | R | Section 5.8.1.9 | > > > | unique_handles | 9 | bool | R | Section 5.8.1.10 | > > > | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | > > > | rdattr_error| 11 | nfsstat4 | R | Section 5.8.1.12 | > > > | filehandle | 19 | nfs_fh4| R | Section 5.8.1.13 | > > > +-+++-+---+ > > > > > > > Not every time -- only when the cache needs revalidation. > > > > In the absence of a delegation, that happens on a timeout (see the > > acregmin/acregmax settings in nfs(5)), though things like opens and file > > locking events also affect when the client revalidates. > > > > When the v4 client does revalidate the cache, it relies heavily on NFSv4 > > change attribute. Cephfs's change attribute is cluster-coherent too, so > > if the client does revalidate it should see changes made on other > > servers. > > > > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton > > > wrote: > > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote: > > > > > Hi Jeff, > > > > > Another question is about Client Caching when disabling delegation. > > > > > I set breakpoint on nfs4_op_read, which is OP_READ process function in > > > > > nfs-ganesha. Then I read a file, I found that it will hit only once on > > > > > the first time, which means latter reading operation on this file will > > > > > not trigger OP_READ. It will read the data from client side cache. Is > > > > > it right? > > > > > > > > Yes. In the absence of a delegation, the client will periodically query > > > > for the inode attributes, and will serve reads from the cache if it > > > > looks like the file hasn't changed. > > > > > > > > > I also checked the nfs client code in linux kernel. Only > > > > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ again, > > > > > like this: > > > > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) { > > > > > ret = nfs_invalidate_mapping(inode, mapping); > > > > > } > > > > > This about this senario, client1 connect ganesha1 and client2 connect > > > > > ganesha2. I read /1.txt on client1 and client1 will cache the data. > > > > > Then I modify this file on client2. At that time, how client1 know the > > > > > file is modifed and how it will add NFS_INO_INVALID_DATA into > > > > > cache_validity? > > > > > > > > Once you modify the code on client2, ganesha2 will request the necessary > > > > caps from the ceph MDS, and client1 will have its caps revoked. It'll > > > > then make the change. > > > > > > > > When client1 reads again it will issue a GETATTR against the file [1]. > > > > ganesha1 will then request caps to do the getattr, which will end up > > > > revoking ganesha2's caps. client1 will then see the change in attributes > > > > (the change attribute and mtime, most likely) and will invalidate the > > > > mapping,