Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Anthony D'Atri
> However, I'm starting to think that the problem isn't with the number
> of threads that have work to do... the problem may just be that the
> OSD & PG code has enough thread locking happening that there is no
> possible way to have more than a few things happening on a single OSD
> (or perhaps a single placement group).
> 
> Has anyone thought about the problem from this angle?  This would help
> explain why multiple-OSDs-per-SSD is so effective (even though the
> thought of doing this in production is utterly terrifying).


When researching this topic a few months back the below is what I found, HTH.  
We’re planning to break up NVMe drives into multiple OSDs.  I don’t find this 
terrifying so much as somewhat awkward, we’ll have to update deployment and 
troubleshooting/maintenance procedures to act accordingly. 

Back in the day it was conventional Ceph wisdom to never put multiple OSDs on a 
single device, but my sense was that was an artifact of bottlenecked spinners.  
The resultant seek traffic I imagine could be ugly, but would it be worse than 
we already suffered with colo journals?  (*)  With a device that can handle 
lots of IO depth without seeks, IMHO it’s not so bad, especially as Ceph has 
evolved to cope better with larger numbers of OSDs.


"per-osd session lock", "all AIO completions are fired from a single thread – 
so even if you are pumping data to the OSDs using 8 threads, you are only 
getting serialized completions”

https://apawel.me/ceph-creating-multiple-osds-on-nvme-devices-luminous/

https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

https://www.spinics.net/lists/ceph-devel/msg41570.html

https://bugzilla.redhat.com/show_bug.cgi?id=1541415

http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning


> With block sizes 64K and lower the avgqu-sz value never went above 1
> under any workload, and I never saw the iostat util% much above 50%.


I’ve been told that iostat %util isn’t as meaningful with SSDs as it was with 
HDDs, but I don’t recall the rationale.  ymmv.




*  And oh did we suffer from them :-x

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Mark Lehrer
Thanks, that looks quite useful.

I did a few tests and got basically a null result.  In fact, when I
put the RBDs on different pools on the same SSDs or pools on different
SSDs, performance was a few percent worse than leaving them on the
same pool.  I definitely wasn't expecting this!

It looks like the only way to get the queue depth up is with larger
block sizes.  The way this ends up working is that one of the layers
(lvm?) splits the writes into smaller blocks (looks like 32K or 64K
according to avgrq-sz) and adds all the smaller blocks to the queue,
increasing the effective queue depth to 10 or so (still pretty low
considering how much work the /dev/rbd* devices are trying to do).
With block sizes 64K and lower the avgqu-sz value never went above 1
under any workload, and I never saw the iostat util% much above 50%.

The effective queue depth on each RBD is pegged at 128 in this case
(which is the default rbd queue size... max is only 256).

There is so much performance left on the table with these SSDs that it
is painful and depressing... there must be a way other than
multiple-OSDs-per-SSD to keep these things busy.

Thanks,
Mark


On Tue, Aug 6, 2019 at 10:56 AM Mark Nelson  wrote:
>
> You may be interested in using my wallclock profiler to look at lock
> contention:
>
>
> https://github.com/markhpc/gdbpmp
>
>
> It will greatly slow down the OSD but will show you where time is being
> spent and so far the results appear to at least be relatively
> informative.  I used it recently when refactoring the bluestore caches
> to trim on add (from multiple threads) and break the bluestore cache
> into separate onode/buffer caches with their own locks:
>
>
> https://github.com/ceph/ceph/pull/28597
>
>
> One of the things you'll notice is that we have a single kv sync
> thread.  Historically that has been one of the limiting factors in terms
> of write throughput, though these days I tend to see a mix of various
> factors (potentially the shardedopwq, optracker, kv sync, etc).
> Certainly lock contention plays a part here.
>
>
> Mark
>
>
> On 8/6/19 11:41 AM, Mark Lehrer wrote:
> > I have a few more cycles this week to dedicate to the problem of
> > making OSDs do more than maybe 5 simultaneous operations (as measured
> > by the iostat effective queue depth of the drive).
> >
> > However, I'm starting to think that the problem isn't with the number
> > of threads that have work to do... the problem may just be that the
> > OSD & PG code has enough thread locking happening that there is no
> > possible way to have more than a few things happening on a single OSD
> > (or perhaps a single placement group).
> >
> > Has anyone thought about the problem from this angle?  This would help
> > explain why multiple-OSDs-per-SSD is so effective (even though the
> > thought of doing this in production is utterly terrifying).
> >
> > For my next set of tests, I'll try some multi-pool testing and see if
> > isolating the placement groups helps with the thread limitations I'm
> > seeing.  Last time, I was testing multiple RBDs in the same pool.
> >
> > Thanks,
> > Mark
> >
> >
> >
> > On Sat, May 11, 2019 at 5:50 AM Maged Mokhtar  wrote:
> >>
> >> On 10/05/2019 19:54, Mark Lehrer wrote:
> >>> I'm setting up a new Ceph cluster with fast SSD drives, and there is
> >>> one problem I want to make sure to address straight away:
> >>> comically-low OSD queue depths.
> >>>
> >>> On the past several clusters I built, there was one major performance
> >>> problem that I never had time to really solve, which is this:
> >>> regardless of how much work the RBDs were being asked to do, the OSD
> >>> effective queue depth (as measured by iostat's "avgrq-sz" column)
> >>> never went above 3... even if I had multiple RBDs with queue depths in
> >>> the thousands.
> >>>
> >>> This made sense back in the old days of spinning drives.  However, for
> >>> example with these particular drives and a 4K or 16K block size you
> >>> don't see maximum read performance until the queue depth gets to 50+.
> >>> At a queue depth of 4 the bandwidth is less than 20% what it is at
> >>> 256.  The bottom line here is that Ceph performance is simply
> >>> embarrassing whenever the OSD effective queue depth is in single
> >>> digits.
> >>>
> >>> On my last cluster, I spent a week or two researching and trying OSD
> >>> config parameters trying to increase the queue depth.  So far, the
> >>> only effective method I have seen to increase the effective OSD queue
> >>> depth is a gross hack - using multiple partitions per SSD to create
> >>> multiple OSDs.
> >>>
> >>> My questions:
> >>>
> >>> 1) Is there anyone on this list who has solved this problem already?
> >>> On the performance articles I have seen, the authors don't show iostat
> >>> results (or any OSD effective queue depth numbers) so I can't really
> >>> tell.
> >>>
> >>> 2) If there isn't a good response to #1, is anyone else out there able
> >>> to do some experimentation to help figure this out? 

Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Mark Nelson
You may be interested in using my wallclock profiler to look at lock 
contention:



https://github.com/markhpc/gdbpmp


It will greatly slow down the OSD but will show you where time is being 
spent and so far the results appear to at least be relatively 
informative.  I used it recently when refactoring the bluestore caches 
to trim on add (from multiple threads) and break the bluestore cache 
into separate onode/buffer caches with their own locks:



https://github.com/ceph/ceph/pull/28597


One of the things you'll notice is that we have a single kv sync 
thread.  Historically that has been one of the limiting factors in terms 
of write throughput, though these days I tend to see a mix of various 
factors (potentially the shardedopwq, optracker, kv sync, etc).  
Certainly lock contention plays a part here.



Mark


On 8/6/19 11:41 AM, Mark Lehrer wrote:

I have a few more cycles this week to dedicate to the problem of
making OSDs do more than maybe 5 simultaneous operations (as measured
by the iostat effective queue depth of the drive).

However, I'm starting to think that the problem isn't with the number
of threads that have work to do... the problem may just be that the
OSD & PG code has enough thread locking happening that there is no
possible way to have more than a few things happening on a single OSD
(or perhaps a single placement group).

Has anyone thought about the problem from this angle?  This would help
explain why multiple-OSDs-per-SSD is so effective (even though the
thought of doing this in production is utterly terrifying).

For my next set of tests, I'll try some multi-pool testing and see if
isolating the placement groups helps with the thread limitations I'm
seeing.  Last time, I was testing multiple RBDs in the same pool.

Thanks,
Mark



On Sat, May 11, 2019 at 5:50 AM Maged Mokhtar  wrote:


On 10/05/2019 19:54, Mark Lehrer wrote:

I'm setting up a new Ceph cluster with fast SSD drives, and there is
one problem I want to make sure to address straight away:
comically-low OSD queue depths.

On the past several clusters I built, there was one major performance
problem that I never had time to really solve, which is this:
regardless of how much work the RBDs were being asked to do, the OSD
effective queue depth (as measured by iostat's "avgrq-sz" column)
never went above 3... even if I had multiple RBDs with queue depths in
the thousands.

This made sense back in the old days of spinning drives.  However, for
example with these particular drives and a 4K or 16K block size you
don't see maximum read performance until the queue depth gets to 50+.
At a queue depth of 4 the bandwidth is less than 20% what it is at
256.  The bottom line here is that Ceph performance is simply
embarrassing whenever the OSD effective queue depth is in single
digits.

On my last cluster, I spent a week or two researching and trying OSD
config parameters trying to increase the queue depth.  So far, the
only effective method I have seen to increase the effective OSD queue
depth is a gross hack - using multiple partitions per SSD to create
multiple OSDs.

My questions:

1) Is there anyone on this list who has solved this problem already?
On the performance articles I have seen, the authors don't show iostat
results (or any OSD effective queue depth numbers) so I can't really
tell.

2) If there isn't a good response to #1, is anyone else out there able
to do some experimentation to help figure this out?  All you would
need to do to get started is collect the output of this command while
a high-QD rbd test is happening: "iostat -mtxy 1" -- you should
collect it on all of the OSD servers as well as the client (you will
want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat
probably won't see it).

3) If there is any technical reason why this is impossible, please let
me know before I get to far down this road... but because the multiple
partitions trick works so well I expect it must be possible somehow.

Thanks,
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

i assume you mean avgqu-sz (queue size) rather than avgrq-sz (request
size). if so, what avgrq-sz do you get ? what kernel and io scheduler
being used ?

It is not uncommon if the system is not well tuned for your workload,
you may have a bottleneck in cpu running near 100% and your disks would
be single digit % busy, the faster your disks are and the more disks you
have, the less they will be busy if there is some cpu or network
bottleneck. If so the queue depth on them will be very low.

It is also possible the cluster has good performance but the bottleneck
is from the client(s) doing the test and is/are not fast enough to fully
stress your cluster, hence your disks.

To know more, we need more numbers:
-How many SSDs/OSDs do you have, what is their raw device random 4k
write sync iops ?
-How many hosts and cpu cores do you have ?

Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-08-06 Thread Mark Lehrer
I have a few more cycles this week to dedicate to the problem of
making OSDs do more than maybe 5 simultaneous operations (as measured
by the iostat effective queue depth of the drive).

However, I'm starting to think that the problem isn't with the number
of threads that have work to do... the problem may just be that the
OSD & PG code has enough thread locking happening that there is no
possible way to have more than a few things happening on a single OSD
(or perhaps a single placement group).

Has anyone thought about the problem from this angle?  This would help
explain why multiple-OSDs-per-SSD is so effective (even though the
thought of doing this in production is utterly terrifying).

For my next set of tests, I'll try some multi-pool testing and see if
isolating the placement groups helps with the thread limitations I'm
seeing.  Last time, I was testing multiple RBDs in the same pool.

Thanks,
Mark



On Sat, May 11, 2019 at 5:50 AM Maged Mokhtar  wrote:
>
>
> On 10/05/2019 19:54, Mark Lehrer wrote:
> > I'm setting up a new Ceph cluster with fast SSD drives, and there is
> > one problem I want to make sure to address straight away:
> > comically-low OSD queue depths.
> >
> > On the past several clusters I built, there was one major performance
> > problem that I never had time to really solve, which is this:
> > regardless of how much work the RBDs were being asked to do, the OSD
> > effective queue depth (as measured by iostat's "avgrq-sz" column)
> > never went above 3... even if I had multiple RBDs with queue depths in
> > the thousands.
> >
> > This made sense back in the old days of spinning drives.  However, for
> > example with these particular drives and a 4K or 16K block size you
> > don't see maximum read performance until the queue depth gets to 50+.
> > At a queue depth of 4 the bandwidth is less than 20% what it is at
> > 256.  The bottom line here is that Ceph performance is simply
> > embarrassing whenever the OSD effective queue depth is in single
> > digits.
> >
> > On my last cluster, I spent a week or two researching and trying OSD
> > config parameters trying to increase the queue depth.  So far, the
> > only effective method I have seen to increase the effective OSD queue
> > depth is a gross hack - using multiple partitions per SSD to create
> > multiple OSDs.
> >
> > My questions:
> >
> > 1) Is there anyone on this list who has solved this problem already?
> > On the performance articles I have seen, the authors don't show iostat
> > results (or any OSD effective queue depth numbers) so I can't really
> > tell.
> >
> > 2) If there isn't a good response to #1, is anyone else out there able
> > to do some experimentation to help figure this out?  All you would
> > need to do to get started is collect the output of this command while
> > a high-QD rbd test is happening: "iostat -mtxy 1" -- you should
> > collect it on all of the OSD servers as well as the client (you will
> > want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat
> > probably won't see it).
> >
> > 3) If there is any technical reason why this is impossible, please let
> > me know before I get to far down this road... but because the multiple
> > partitions trick works so well I expect it must be possible somehow.
> >
> > Thanks,
> > Mark
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> i assume you mean avgqu-sz (queue size) rather than avgrq-sz (request
> size). if so, what avgrq-sz do you get ? what kernel and io scheduler
> being used ?
>
> It is not uncommon if the system is not well tuned for your workload,
> you may have a bottleneck in cpu running near 100% and your disks would
> be single digit % busy, the faster your disks are and the more disks you
> have, the less they will be busy if there is some cpu or network
> bottleneck. If so the queue depth on them will be very low.
>
> It is also possible the cluster has good performance but the bottleneck
> is from the client(s) doing the test and is/are not fast enough to fully
> stress your cluster, hence your disks.
>
> To know more, we need more numbers:
> -How many SSDs/OSDs do you have, what is their raw device random 4k
> write sync iops ?
> -How many hosts and cpu cores do you have ?
> -How many nics and their speed ?
> -What total iops do you get ? What params did you use for the 4k test ?
> is it random or sequential ?
> -Do you use enough threads/queue depth to stress all your OSDs in
> parallel ?
> -Run atop during the test, what cpu and disk % busy do you see on all
> hosts including clients ?
> -How many clients do you use ? For a fast cluster you may need many
> clients to stress it, keep increasing clients until your numbers saturate.
>
> /Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-05-11 Thread Maged Mokhtar



On 10/05/2019 19:54, Mark Lehrer wrote:

I'm setting up a new Ceph cluster with fast SSD drives, and there is
one problem I want to make sure to address straight away:
comically-low OSD queue depths.

On the past several clusters I built, there was one major performance
problem that I never had time to really solve, which is this:
regardless of how much work the RBDs were being asked to do, the OSD
effective queue depth (as measured by iostat's "avgrq-sz" column)
never went above 3... even if I had multiple RBDs with queue depths in
the thousands.

This made sense back in the old days of spinning drives.  However, for
example with these particular drives and a 4K or 16K block size you
don't see maximum read performance until the queue depth gets to 50+.
At a queue depth of 4 the bandwidth is less than 20% what it is at
256.  The bottom line here is that Ceph performance is simply
embarrassing whenever the OSD effective queue depth is in single
digits.

On my last cluster, I spent a week or two researching and trying OSD
config parameters trying to increase the queue depth.  So far, the
only effective method I have seen to increase the effective OSD queue
depth is a gross hack - using multiple partitions per SSD to create
multiple OSDs.

My questions:

1) Is there anyone on this list who has solved this problem already?
On the performance articles I have seen, the authors don't show iostat
results (or any OSD effective queue depth numbers) so I can't really
tell.

2) If there isn't a good response to #1, is anyone else out there able
to do some experimentation to help figure this out?  All you would
need to do to get started is collect the output of this command while
a high-QD rbd test is happening: "iostat -mtxy 1" -- you should
collect it on all of the OSD servers as well as the client (you will
want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat
probably won't see it).

3) If there is any technical reason why this is impossible, please let
me know before I get to far down this road... but because the multiple
partitions trick works so well I expect it must be possible somehow.

Thanks,
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


i assume you mean avgqu-sz (queue size) rather than avgrq-sz (request 
size). if so, what avgrq-sz do you get ? what kernel and io scheduler 
being used ?


It is not uncommon if the system is not well tuned for your workload, 
you may have a bottleneck in cpu running near 100% and your disks would 
be single digit % busy, the faster your disks are and the more disks you 
have, the less they will be busy if there is some cpu or network 
bottleneck. If so the queue depth on them will be very low.


It is also possible the cluster has good performance but the bottleneck 
is from the client(s) doing the test and is/are not fast enough to fully 
stress your cluster, hence your disks.


To know more, we need more numbers:
-How many SSDs/OSDs do you have, what is their raw device random 4k 
write sync iops ?

-How many hosts and cpu cores do you have ?
-How many nics and their speed ?
-What total iops do you get ? What params did you use for the 4k test ? 
is it random or sequential ?
-Do you use enough threads/queue depth to stress all your OSDs in 
parallel ?
-Run atop during the test, what cpu and disk % busy do you see on all 
hosts including clients ?
-How many clients do you use ? For a fast cluster you may need many 
clients to stress it, keep increasing clients until your numbers saturate.


/Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to maximize the OSD effective queue depth in Ceph?

2019-05-10 Thread Mark Lehrer
I'm setting up a new Ceph cluster with fast SSD drives, and there is
one problem I want to make sure to address straight away:
comically-low OSD queue depths.

On the past several clusters I built, there was one major performance
problem that I never had time to really solve, which is this:
regardless of how much work the RBDs were being asked to do, the OSD
effective queue depth (as measured by iostat's "avgrq-sz" column)
never went above 3... even if I had multiple RBDs with queue depths in
the thousands.

This made sense back in the old days of spinning drives.  However, for
example with these particular drives and a 4K or 16K block size you
don't see maximum read performance until the queue depth gets to 50+.
At a queue depth of 4 the bandwidth is less than 20% what it is at
256.  The bottom line here is that Ceph performance is simply
embarrassing whenever the OSD effective queue depth is in single
digits.

On my last cluster, I spent a week or two researching and trying OSD
config parameters trying to increase the queue depth.  So far, the
only effective method I have seen to increase the effective OSD queue
depth is a gross hack - using multiple partitions per SSD to create
multiple OSDs.

My questions:

1) Is there anyone on this list who has solved this problem already?
On the performance articles I have seen, the authors don't show iostat
results (or any OSD effective queue depth numbers) so I can't really
tell.

2) If there isn't a good response to #1, is anyone else out there able
to do some experimentation to help figure this out?  All you would
need to do to get started is collect the output of this command while
a high-QD rbd test is happening: "iostat -mtxy 1" -- you should
collect it on all of the OSD servers as well as the client (you will
want to attach an RBD and talk to it via /dev/rbd0 otherwise iostat
probably won't see it).

3) If there is any technical reason why this is impossible, please let
me know before I get to far down this road... but because the multiple
partitions trick works so well I expect it must be possible somehow.

Thanks,
Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com