I wanted to report an update.
We added more ceph storage nodes, so we can take the problem OSDs out.
speeds are faster.
I found a way to monitor OSD latency in ceph, using "ceph pg dump osds"
The commit latency is always "0" for us.
fs_perf_stat/commit_latency_ms
But the apply latency shows us
Here are some random samples I recorded in the past 30 minutes.
11 K blocks 10542 kB/s 909 op/s
12 K blocks 15397 kB/s 1247 op/s
26 K blocks 34306 kB/s 1307 op/s
33 K blocks 48509 kB/s 1465 op/s
59 K blocks 59333 kB/s 999 op/s
172 K blocks 101939 kB/s 590 op/s
104 K
4M block sizes you will only need 22.5 iops
On 2017-12-08 09:59, Maged Mokhtar wrote:
> Hi Russell,
>
> It is probably due to the difference in block sizes used in the test vs your
> cluster load. You have a latency problem which is limiting your max write
> iops to around 2.5K. For large
Hi Russell,
It is probably due to the difference in block sizes used in the test vs
your cluster load. You have a latency problem which is limiting your max
write iops to around 2.5K. For large block sizes you do not need that
many iops, for example if you write in 4M block sizes you will only
I want to provide an update to my interesting situation.
(New storage nodes were purchased and are going into the cluster soon)
I have been monitoring the ceph storage nodes with atop and read/write
through put with ceph-dash for the last month.
I am regularly seeing 80-90MB/s of write throughput
Yes, several have recommended the fio test now.
I cannot perform a fio test at this time. Because the post referred to
directs us to write the fio test data directly to the disk device, e.g.
/dev/sdj. I'd have to take an OSD completely out in order to perform the
test. And I am not ready to do
I would be interested in seeing the results from the post mentioned by an
earlier contributor:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
Test an "old" M500 and a "new" M500 and see if the performance is A)
acceptable and B)
Yes, all the MD500s we use are both journal and OSD, even the older ones.
We have a 3 year lifecycle and move older nodes from one ceph cluster to
another.
On old systems with 3 year old MD500s, they run as RAID0, and run faster
than our current problem system with 1 year old MD500s, ran as
@Russel, are your "older Crucial M500"s being used as journals?
Crucial M500s are not to be used as a Ceph journal in my last experience
with them. They make good OSDs with an NVMe in front of them perhaps, but
not much else.
Ceph uses O_DSYNC for journal writes and these drives do not handle
We have older crucial M500 disks operating without such problems. So, I
have to believe it is a hardware firmware issue.
And its peculiar seeing performance boost slightly, even 24 hours later,
when I stop then start the OSDs.
Our actual writes are low, as most of our Ceph Cluster based images
It is quiet likely related, things are pointing to bad disks. Probably
the best thing is to plan for disk replacement, the sooner the better as
it could get worse.
On 2017-10-27 02:22, Christian Wuerdig wrote:
> Hm, no necessarily directly related to your performance problem,
> however: These
Hm, no necessarily directly related to your performance problem,
however: These SSDs have a listed endurance of 72TB total data written
- over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
that you run the journal for each OSD on the same disk, that's
effectively at most 0.02 DWPD
Would be nice to see your output of:
rados bench -p rbd 60 write --no-cleanup -t 56 -b 4096 -o 1M
Total time run: 60.005452
Total writes made: 438295
Write size: 4096
Object size: 1048576
Bandwidth (MB/sec): 28.5322
Stddev Bandwidth: 0.514721
Max
I wish the firmware update will fix things for you.
Regarding monitoring: if your tool is able to record disk busy%, iops,
throughout then you do not need to run atop.
I still highly recommend you run the fio SSD test for sync writes:
On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar wrote:
> It depends on what stage you are in:
> in production, probably the best thing is to setup a monitoring tool
> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as well
> as resource load. This will,
It depends on what stage you are in:
in production, probably the best thing is to setup a monitoring tool
(collectd/grahite/prometheus/grafana) to monitor both ceph stats as well
as resource load. This will, among other things, show you if you have
slowing disks.
Before production you should
Thanks to all.
I took the OSDs down in the problem host, without shutting down the machine.
As predicted, our MB/s about doubled.
Using this bench/atop procedure, I found two other OSDs on another host
that are the next bottlenecks.
Is this the only good way to really test the performance of the
The two newest machines have the LSI MegaRAID SAS-3 3008 [Fury]. The first
one performs the best of the four. The second one is the problem host. The
Non-RAID option just takes RAID configuration out of the picture so ceph
can have direct access to the disk. We need that to have ceph's support of
Hello,
On Fri, 20 Oct 2017 13:35:55 -0500 Russell Glaue wrote:
> On the machine in question, the 2nd newest, we are using the LSI MegaRAID
> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.
> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
>
On the machine in question, the 2nd newest, we are using the LSI MegaRAID
SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.
The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
earlier, each single drive configured as RAID0.
Thanks for everyone's help.
I
I can attest that the battery in the raid controller is a thing. I'm used
to using lsi controllers, but my current position has hp raid controllers
and we just tracked down 10 of our nodes that had >100ms await pretty much
always were the only 10 nodes in the cluster with failed batteries on the
Hello,
On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
> That is a good idea.
> However, a previous rebalancing processes has brought performance of our
> Guest VMs to a slow drag.
>
Never mind that I'm not sure that these SSDs are particular well suited
for Ceph, your problem is
That is a good idea.
However, a previous rebalancing processes has brought performance of our
Guest VMs to a slow drag.
On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez
wrote:
> Hi Russell,
>
> as you have 4 servers, assuming you are not doing EC pools, just stop all
>
I'm better off trying to solve the first hurdle.
This ceph cluster is in production serving 186 guest VMs.
-RG
On Thu, Oct 19, 2017 at 3:52 PM, David Turner wrote:
> Assuming the problem with swapping out hardware is having spare
> hardware... you could always switch
No, I have not ruled out the disk controller and backplane making the disks
slower.
Is there a way I could test that theory, other than swapping out hardware?
-RG
On Thu, Oct 19, 2017 at 3:44 PM, David Turner wrote:
> Have you ruled out the disk controller and backplane
Have you ruled out the disk controller and backplane in the server running
slower?
On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue wrote:
> I ran the test on the Ceph pool, and ran atop on all 4 storage servers, as
> suggested.
>
> Out of the 4 servers:
> 3 of them performed with
I ran the test on the Ceph pool, and ran atop on all 4 storage servers, as
suggested.
Out of the 4 servers:
3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
Momentarily spiking up to 50% on one server, and 80% on another
The 2nd newest server was almost averaging 90% disk %busy
just run the same 32 threaded rados test as you did before and this time
run atop while the test is running looking for %busy of cpu/disks. It
should give an idea if there is a bottleneck in them.
On 2017-10-18 21:35, Russell Glaue wrote:
> I cannot run the write test reviewed at the
>
I cannot run the write test reviewed at
the ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device blog. The
tests write directly to the raw disk device.
Reading an infile (created with urandom) on one SSD, writing the outfile to
another osd, yields about 17MB/s.
But Isn't this write speed
measuring resource load as outlined earlier will show if the drives are
performing well or not. Also how many osds do you have ?
On 2017-10-18 19:26, Russell Glaue wrote:
> The SSD drives are Crucial M500
> A Ceph user did some benchmarks and found it had good performance
>
36 OSDs
Each of 4 storage servers has 9 1TB SSD drives, each drive as 1 osd (no
RAID) == 36 OSDs
Each drive is one LVM group, with two volumes - one volume for the osd, one
volume for the journal
Each osd is formatted with xfs
On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar
The SSD drives are Crucial M500
A Ceph user did some benchmarks and found it had good performance
https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/
However, a user comment from 3 years ago on the blog post you linked to
says to avoid the Crucial M500
Yet, this
Check out the following link: some SSDs perform bad in Ceph due to sync
writes to journal
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
Anther thing that can help is to re-run the rados 32 threads as stress
and view resource usage
In my previous post, in one of my points I was wondering if the request
size would increase if I enabled jumbo packets. currently it is disabled.
@jdillama: The qemu settings for both these two guest machines, with
RAID/LVM and Ceph/rbd images, are the same. I am not thinking that changing
the
First a general comment: local RAID will be faster than Ceph for a
single threaded (queue depth=1) io operation test. A single thread Ceph
client will see at best same disk speed for reads and for writes 4-6
times slower than single disk. Not to mention the latency of local disks
will much better.
Take this with a grain of salt, but you could try passing
"min_io_size=,opt_io_size="
as part of QEMU's HD device parameters to see if the OS picks up the
larger IO defaults and actually uses them:
$ qemu <...snip...> -device
driver=scsi-hd,<...snip...>,min_io_size=32768,opt_io_size=4194304
On
36 matches
Mail list logo