[ceph-users] Mimic Bluestore memory optimization

2019-02-24 Thread Glen Baars
Hello Ceph!

I am tracking down a performance issue with some of our mimic 13.2.4 OSDs. It 
feels like a lack of memory but I have no real proof of the issue. I have used 
the memory profiling ( pprof tool ) and the OSD's are maintaining their 4GB 
allocated limit.

My questions are:

1.How do you know if the allocated memory is enough for the OSD? My 1TB disks 
and 12TB disks take the same memory and I wonder if the OSDs should have memory 
allocated based on the size of the disks?
2.In the past, SSD disks needs 3 times the memory and now they don't, why is 
that? ( 1GB ram per HDD and 3GB ram per SSD both went to 4GB )
3.I have read that the number of placement groups per OSD is a significant 
factor in the memory usage. Generally I have ~200 placement groups per OSD, 
this is at the higher end of the recommended values and I wonder if its causing 
high memory usage?

For reference the hosts are 1 x 6 core CPU, 72GB ram, 14 OSDs, 2 x 10Gbit. LSI 
cachecade / writeback cache for the HDD and LSI JBOD for SSDs. 9 hosts in this 
cluster.

Kind regards,
Glen Baars
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] block.db linking to 2 disks

2019-02-24 Thread Ashley Merrick
After a reboot of a node I have one particular OSD that won't boot. (Latest
Mimic)

When I "/var/lib/ceph/osd/ceph-8 # ls -lsh"

I get "   0 lrwxrwxrwx 1 root root 19 Feb 25 02:09 block.db -> '/dev/sda5
/dev/sdc5'"

For some reasons it is trying to link block.db to two disks, if I remove
the block.db link and manually create the correct link the OSD still fails
to start due to the perms on block.db file being root:root.

If I run a chmod it just goes back to root:root and the following shows in
the OSD logs

2019-02-25 02:03:21.738 7f574b2a1240 -1 bluestore(/var/lib/ceph/osd/ceph-8)
_open_db /var/lib/ceph/osd/ceph-8/block.db symlink exists but target
unusable: (13) Permission denied
2019-02-25 02:03:21.738 7f574b2a1240  1 bdev(0x55dbf0a56700
/var/lib/ceph/osd/ceph-8/block) close
2019-02-25 02:03:22.034 7f574b2a1240 -1 osd.8 0 OSD:init: unable to mount
object store
2019-02-25 02:03:22.034 7f574b2a1240 -1  ** ERROR: osd init failed: (13)
Permission denied

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Experiences with the Samsung SM/PM883 disk?

2019-02-24 Thread Paul Emmerich
That sounds more like the result I expected, maybe there's something
wrong with my disk or server (other disks perform fine, though).

Paul



Paul

On Fri, Feb 22, 2019 at 8:25 PM Jacob DeGlopper  wrote:
>
> What are you connecting it to?  We just got the exact same drive for
> testing, and I'm seeing much higher performance, connected to a
> motherboard 6 Gb SATA port on a Supermicro X9 board.
>
> [root@centos7 jacob]# smartctl -a /dev/sda
>
> Device Model: Samsung SSD 883 DCT 960GB
> Firmware Version: HXT7104Q
> SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
>
> [root@centos7 jacob]# fio --filename=/dev/sda --direct=1 --sync=1
> --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based
> --group_reporting --name=journal-test
>
> write: IOPS=15.9k, BW=62.1MiB/s (65.1MB/s)(3728MiB/60001msec)
>
> 8 processes:
>
> write: IOPS=58.1k, BW=227MiB/s (238MB/s)(13.3GiB/60003msec)
>
>
> On 2/22/19 8:47 AM, Paul Emmerich wrote:
> > Hi,
> >
> > it looks like the beloved Samsung SM/PM863a is no longer available and
> > the replacement is the new SM/PM883.
> >
> > We got an 960GB PM883 (MZ7LH960HAJR-5) here and I ran the usual
> > fio benchmark... and got horrible results :(
> >
> > fio --filename=/dev/sdX --direct=1 --sync=1 --rw=write --bs=4k
> > --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
> > --name=journal-test
> >
> >   1 thread  - 1150 iops
> >   4 threads - 2305 iops
> >   8 threads - 4200 iops
> > 16 threads - 7230 iops
> >
> > Now that's a factor of 15 or so slower than the PM863a.
> >
> > Someone here reports better results with a 883:
> > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >
> > Maybe there's a difference between the SM and PM variant of these new
> > disks for performance? (This wasn't the case for the 863a)
> >
> > Does anyone else have these new 883 disks yet?
> > Any experience reports?
> >
> > Paul
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-24 Thread Anthony D'Atri

> Date: Fri, 22 Feb 2019 16:26:34 -0800
> From: solarflow99 
> 
> 
> Aren't you undersized at only 30GB?  I thought you should have 4% of your
> OSDs

The 4% guidance is new.  Until relatively recently the oft-suggested and 
default value was 1%.

> From: "Vitaliy Filippov" 
> Numbers are easy to calculate from RocksDB parameters, however I also  
> don't understand why it's 3 -> 30 -> 300...
> 
> Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,  
> L1 should be 10 GB, and L2 should be 100 GB?

I’m very curious as well, one would think that in practice the size and usage 
of the OSD would be factors, something the docs imply.

This is an area where we could really use more concrete guidance.  Clusters 
especially using HDDs are often doing so for $/TB reasons.  Economics and 
available slots are constraints on how much faster WAL+DB storage can be 
provisioned.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread Vitaliy Filippov
I've tried 4x OSD on fast SAS SSDs in a test setup with only 2 such drives  
in cluster - it increased CPU consumption a lot, but total 4Kb random  
write iops (RBD) only went from ~11000 to ~22000. So it was 2x increase,  
but at a huge cost.



One thing that's worked for me to get more out of nvmes with Ceph is to
create multiple partitions on the nvme with an osd on each partition.  
That

way you get more osd processes and CPU per nvme device. I've heard of
people using up to 4 partitions like this.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usenix Vault 2019

2019-02-24 Thread Alex Gorbachev
Oh you are so close David, but I have to go to Tampa to a client site,
otherwise I'd hop on a flight to Boston to say hi.

Hope you are doing well.  Are you going to the Cephalocon in Barcelona?

--
Alex Gorbachev
Storcium

On Sun, Feb 24, 2019 at 10:40 AM David Turner  wrote:
>
> There is a scheduled birds of a feather for Ceph tomorrow night, but I also 
> noticed that there are only trainings tomorrow. Unless you are paying more 
> for those, you likely don't have much to do on Monday. That's the boat I'm 
> in. Is anyone interested in getting together tomorrow in Boston during the 
> training day?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usenix Vault 2019

2019-02-24 Thread David Turner
There is a scheduled birds of a feather for Ceph tomorrow night, but I also
noticed that there are only trainings tomorrow. Unless you are paying more
for those, you likely don't have much to do on Monday. That's the boat I'm
in. Is anyone interested in getting together tomorrow in Boston during the
training day?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread David Turner
One thing that's worked for me to get more out of nvmes with Ceph is to
create multiple partitions on the nvme with an osd on each partition. That
way you get more osd processes and CPU per nvme device. I've heard of
people using up to 4 partitions like this.

On Sun, Feb 24, 2019, 10:25 AM Vitaliy Filippov  wrote:

> > We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS
> > per OSD.by rados.
>
> Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :)
> some colleagues tell that SPDK works out of the box, but almost doesn't
> increase performance, because the userland-kernel interaction isn't the
> bottleneck currently, it's Ceph code itself. I also tried once, but I
> couldn't make it work. When I have some spare NVMe's I'll make another
> attempt.
>
> So... try it and share your results here :) we're all interested.
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuration about using nvme SSD

2019-02-24 Thread Vitaliy Filippov

We can get 513558 IOPS in 4K read per nvme by fio but only 45146 IOPS
per OSD.by rados.


Don't expect Ceph to fully utilize NVMe's, it's software and it's slow :)  
some colleagues tell that SPDK works out of the box, but almost doesn't  
increase performance, because the userland-kernel interaction isn't the  
bottleneck currently, it's Ceph code itself. I also tried once, but I  
couldn't make it work. When I have some spare NVMe's I'll make another  
attempt.


So... try it and share your results here :) we're all interested.

--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com