[ceph-users] Mimic Bluestore memory optimization
Hello Ceph! I am tracking down a performance issue with some of our mimic 13.2.4 OSDs. It feels like a lack of memory but I have no real proof of the issue. I have used the memory profiling ( pprof tool ) and the OSD's are maintaining their 4GB allocated limit. My questions are: 1.How do you know if the allocated memory is enough for the OSD? My 1TB disks and 12TB disks take the same memory and I wonder if the OSDs should have memory allocated based on the size of the disks? 2.In the past, SSD disks needs 3 times the memory and now they don't, why is that? ( 1GB ram per HDD and 3GB ram per SSD both went to 4GB ) 3.I have read that the number of placement groups per OSD is a significant factor in the memory usage. Generally I have ~200 placement groups per OSD, this is at the higher end of the recommended values and I wonder if its causing high memory usage? For reference the hosts are 1 x 6 core CPU, 72GB ram, 14 OSDs, 2 x 10Gbit. LSI cachecade / writeback cache for the HDD and LSI JBOD for SSDs. 9 hosts in this cluster. Kind regards, Glen Baars This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] block.db linking to 2 disks
After a reboot of a node I have one particular OSD that won't boot. (Latest Mimic) When I "/var/lib/ceph/osd/ceph-8 # ls -lsh" I get " 0 lrwxrwxrwx 1 root root 19 Feb 25 02:09 block.db -> '/dev/sda5 /dev/sdc5'" For some reasons it is trying to link block.db to two disks, if I remove the block.db link and manually create the correct link the OSD still fails to start due to the perms on block.db file being root:root. If I run a chmod it just goes back to root:root and the following shows in the OSD logs 2019-02-25 02:03:21.738 7f574b2a1240 -1 bluestore(/var/lib/ceph/osd/ceph-8) _open_db /var/lib/ceph/osd/ceph-8/block.db symlink exists but target unusable: (13) Permission denied 2019-02-25 02:03:21.738 7f574b2a1240 1 bdev(0x55dbf0a56700 /var/lib/ceph/osd/ceph-8/block) close 2019-02-25 02:03:22.034 7f574b2a1240 -1 osd.8 0 OSD:init: unable to mount object store 2019-02-25 02:03:22.034 7f574b2a1240 -1 ** ERROR: osd init failed: (13) Permission denied Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Experiences with the Samsung SM/PM883 disk?
That sounds more like the result I expected, maybe there's something wrong with my disk or server (other disks perform fine, though). Paul Paul On Fri, Feb 22, 2019 at 8:25 PM Jacob DeGlopper wrote: > > What are you connecting it to? We just got the exact same drive for > testing, and I'm seeing much higher performance, connected to a > motherboard 6 Gb SATA port on a Supermicro X9 board. > > [root@centos7 jacob]# smartctl -a /dev/sda > > Device Model: Samsung SSD 883 DCT 960GB > Firmware Version: HXT7104Q > SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) > > [root@centos7 jacob]# fio --filename=/dev/sda --direct=1 --sync=1 > --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based > --group_reporting --name=journal-test > > write: IOPS=15.9k, BW=62.1MiB/s (65.1MB/s)(3728MiB/60001msec) > > 8 processes: > > write: IOPS=58.1k, BW=227MiB/s (238MB/s)(13.3GiB/60003msec) > > > On 2/22/19 8:47 AM, Paul Emmerich wrote: > > Hi, > > > > it looks like the beloved Samsung SM/PM863a is no longer available and > > the replacement is the new SM/PM883. > > > > We got an 960GB PM883 (MZ7LH960HAJR-5) here and I ran the usual > > fio benchmark... and got horrible results :( > > > > fio --filename=/dev/sdX --direct=1 --sync=1 --rw=write --bs=4k > > --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting > > --name=journal-test > > > > 1 thread - 1150 iops > > 4 threads - 2305 iops > > 8 threads - 4200 iops > > 16 threads - 7230 iops > > > > Now that's a factor of 15 or so slower than the PM863a. > > > > Someone here reports better results with a 883: > > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > > > > Maybe there's a difference between the SM and PM variant of these new > > disks for performance? (This wasn't the case for the 863a) > > > > Does anyone else have these new 883 disks yet? > > Any experience reports? > > > > Paul > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?
> Date: Fri, 22 Feb 2019 16:26:34 -0800 > From: solarflow99 > > > Aren't you undersized at only 30GB? I thought you should have 4% of your > OSDs The 4% guidance is new. Until relatively recently the oft-suggested and default value was 1%. > From: "Vitaliy Filippov" > Numbers are easy to calculate from RocksDB parameters, however I also > don't understand why it's 3 -> 30 -> 300... > > Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB, > L1 should be 10 GB, and L2 should be 100 GB? I’m very curious as well, one would think that in practice the size and usage of the OSD would be factors, something the docs imply. This is an area where we could really use more concrete guidance. Clusters especially using HDDs are often doing so for $/TB reasons. Economics and available slots are constraints on how much faster WAL+DB storage can be provisioned. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuration about using nvme SSD
I've tried 4x OSD on fast SAS SSDs in a test setup with only 2 such drives in cluster - it increased CPU consumption a lot, but total 4Kb random write iops (RBD) only went from ~11000 to ~22000. So it was 2x increase, but at a huge cost. One thing that's worked for me to get more out of nvmes with Ceph is to create multiple partitions on the nvme with an osd on each partition. That way you get more osd processes and CPU per nvme device. I've heard of people using up to 4 partitions like this. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usenix Vault 2019
Oh you are so close David, but I have to go to Tampa to a client site, otherwise I'd hop on a flight to Boston to say hi. Hope you are doing well. Are you going to the Cephalocon in Barcelona? -- Alex Gorbachev Storcium On Sun, Feb 24, 2019 at 10:40 AM David Turner wrote: > > There is a scheduled birds of a feather for Ceph tomorrow night, but I also > noticed that there are only trainings tomorrow. Unless you are paying more > for those, you likely don't have much to do on Monday. That's the boat I'm > in. Is anyone interested in getting together tomorrow in Boston during the > training day? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usenix Vault 2019
There is a scheduled birds of a feather for Ceph tomorrow night, but I also noticed that there are only trainings tomorrow. Unless you are paying more for those, you likely don't have much to do on Monday. That's the boat I'm in. Is anyone interested in getting together tomorrow in Boston during the training day? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuration about using nvme SSD
One thing that's worked for me to get more out of nvmes with Ceph is to create multiple partitions on the nvme with an osd on each partition. That way you get more osd processes and CPU per nvme device. I've heard of people using up to 4 partitions like this. On Sun, Feb 24, 2019, 10:25 AM Vitaliy Filippov wrote: > > We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS > > per OSD.by rados. > > Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :) > some colleagues tell that SPDK works out of the box, but almost doesn't > increase performance, because the userland-kernel interaction isn't the > bottleneck currently, it's Ceph code itself. I also tried once, but I > couldn't make it work. When I have some spare NVMe's I'll make another > attempt. > > So... try it and share your results here :) we're all interested. > > -- > With best regards, >Vitaliy Filippov > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuration about using nvme SSD
We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS per OSD.by rados. Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :) some colleagues tell that SPDK works out of the box, but almost doesn't increase performance, because the userland-kernel interaction isn't the bottleneck currently, it's Ceph code itself. I also tried once, but I couldn't make it work. When I have some spare NVMe's I'll make another attempt. So... try it and share your results here :) we're all interested. -- With best regards, Vitaliy Filippov ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com