I'm using bcache (starting around the middle of December...before that
see way higher await) for all the 12 hdds on the 2 SSDs, and NVMe for
journals. (and some months ago I changed all the 2TB disks to 6TB and
added ceph4,5)
Here's my iostat in ganglia:
just raw per disk await
http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title[]=ceph.*[]=sd[a-z]_await=line=show=1
per host max await
http://www.brockmann-consult.de/ganglia/graph_all_periods.php?title[]=ceph.*[]=max_await=line=show=1
strangely aggregated data (my max metric is the max disk, but ganglia
averages out across disk/host or something, so it's not really a max)
http://www.brockmann-consult.de/ganglia/graph_all_periods.php?c=ceph=network_report=week=by%20name=4=2=1501155678=disk_wait_report=large
or to explore and make your own graphs, start from here:
http://www.brockmann-consult.de/ganglia/
I didn't find any ganglia plugins for that, so I wrote some that take
30s averages every minute from iostat and stores them, so when you see
numbers like 400 in my data, it could have been steady 400 for 30
seconds, or 4000 for 3 seconds and then 0 for 27 seconds averaged
together, and 30s of every minute is missing from the data.
In my data, sda,b,c on ceph1,2,3 are probably always the SSDs, and sdm,n
on ceph4,5 are currently the SSDs and possibly were sda,b once;
sometimes rebooting changes it (yeah not ideal but not sure how to
change it... maybe a udev rule to name ssds differently).
And also note that I found deadline instead of CFQ scheduler has way
lower iowait and latency, but not necessarily more throughput or iops...
you could test that; but not using CFQ might disable some ceph priority
settings (or maybe not relevant since Jewel?).
ps. use fixed width on your iostat and it's more readable in html
supporting email clients...see below where I changed it
On 07/27/17 05:48, John Petrini wrote:
> Hello list,
>
> Just curious if anyone has ever seen this behavior and might have some
> ideas on how to troubleshoot it.
>
> We're seeing very high iowait in iostat across all OSD's in on a
> single OSD host. It's very spiky - dropping to zero and then shooting
> up to as high as 400 in some cases. Despite this it does not seem to
> be having a major impact on the cluster performance as a whole.
>
> Some more details:
> 3x OSD Nodes - Dell R730's: 24 cores @2.6GHz, 256GB RAM, 20x 1.2TB 10K
> SAS OSD's per node.
>
> We're running ceph hammer.
>
> Here's the output of iostat. Note that this is from a period when the
> cluster is not very busy but you can still see high spikes on a few
> OSD's. It's much worse during high load.
>
> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.000.000.50 0.00 6.00
> 24.00 0.008.000.008.00 8.00 0.40
> sdb 0.00 0.000.00 60.00 0.00 808.00
> 26.93 0.000.070.000.07 0.03 0.20
> sdc 0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00 0.00 0.00
> sdd 0.00 0.000.00 67.00 0.00 1010.00
> 30.15 0.010.090.000.09 0.09 0.60
> sde 0.00 0.000.00 93.00 0.00 868.00
> 18.67 0.000.040.000.04 0.04 0.40
> sdf 0.00 0.000.00 57.50 0.00 572.00
> 19.90 0.000.030.000.03 0.03 0.20
> sdg 0.00 1.000.003.50 0.0022.00
> 12.57 0.75 16.000.00 16.00 2.86 1.00
> sdh 0.00 0.001.50 25.50 6.00 458.50
> 34.41 2.03 75.260.00 79.69 3.04 8.20
> sdi 0.00 0.000.00 30.50 0.00 384.50
> 25.21 2.36 77.510.00 77.51 3.28 10.00
> sdj 0.00 1.001.50 105.00 6.00 925.75
> 17.5010.85 101.848.00 103.18 2.35 25.00
> sdl 0.00 0.002.000.00 320.00 0.00
> 320.00 0.013.003.000.00 2.00 0.40
> sdk 0.00 1.000.00 55.00 0.00 334.50
> 12.16 7.92 136.910.00 136.91 2.51 13.80
> sdm 0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00 0.00 0.00
> sdn 0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00 0.00 0.00
> sdo 0.00 0.001.000.00 4.00 0.00
> 8.00 0.004.004.000.00 4.00 0.40
> sdp 0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00 0.00 0.00
> sdq 0.50 0.00 756.000.00 93288.00 0.00
> 246.79 1.471.951.950.00 1.17 88.60
> sdr 0.00 0.001.000.00 4.00