Re: [ceph-users] Huge latency spikes

2019-01-06 Thread Xavier Trilla
My recommendation would be to for sure use battery backed up raid controllers for spinning disks -just a RAID 0 per disk- and use IT mode controllers for SSDs. We did plenty of testing and spinning disks REALLY benefit from raid controller write cache -in write back mode-, but SSDs are faster

Re: [ceph-users] Huge latency spikes

2018-12-31 Thread Marc Schöchlin
Hi, our dell servers contain "PERC H730P Mini" raid controllers with 2GB battery backed cache memory. All of our ceph osd disks (typically 12 * 8GB spinners or  16 * 1-2 TBssds per node) are used directly without using the raid functionality. We deactivated the cache of the controller for the

Re: [ceph-users] Huge latency spikes

2018-11-20 Thread Ashley Merrick
Me and quite a few others have had high random latency issues with disk cache enabled. ,Ash On Tue, 20 Nov 2018 at 9:09 PM, Alex Litvak wrote: > John, > > If I go with write through, shouldn't disk cache be enabled? > > On 11/20/2018 6:12 AM, John Petrini wrote: > > I would disable cache on

Re: [ceph-users] Huge latency spikes

2018-11-20 Thread Alex Litvak
John, If I go with write through, shouldn't disk cache be enabled? On 11/20/2018 6:12 AM, John Petrini wrote: I would disable cache on the controller for your journals. Use write through and no read ahead. Did you make sure the disk cache is disabled? On Tuesday, November 20, 2018, Alex

Re: [ceph-users] Huge latency spikes

2018-11-20 Thread John Petrini
I would disable cache on the controller for your journals. Use write through and no read ahead. Did you make sure the disk cache is disabled? On Tuesday, November 20, 2018, Alex Litvak wrote: > I went through raid controller firmware update. I replaced a pair of SSDs with new ones. Nothing

Re: [ceph-users] Huge latency spikes

2018-11-19 Thread Alex Litvak
I went through raid controller firmware update. I replaced a pair of SSDs with new ones. Nothing have changed. Per controller card utility it shows that no patrol reading happens and battery backup is in a good shape. Cache policy is WriteBack. I am aware on the bad battery effect but it

Re: [ceph-users] Huge latency spikes

2018-11-19 Thread Brendan Moloney
Hi, > Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back > cache is on > > Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad > BBU > Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad > BBU > > I have 2 other nodes

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Ashley Merrick
Ah yes sorry be because your behind a raid card. Your need to check the raid config I know on a HP card for example you have an option called enabled disk cache. This is separate to enabling the raid card cache, the config should be per a drive (is on HP) so worth checking the config outputs for

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Alex Litvak
Hmm, On all nodes hdparm -W /dev/sdb /dev/sdb: SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 write-caching = not supported On 11/18/2018 10:30 AM, Ashley Merrick wrote: hdparm -W /dev/xxx should show

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Ashley Merrick
hdparm -W /dev/xxx should show you On Mon, 19 Nov 2018 at 12:28 AM, Alex Litvak wrote: > All machines state the same. > > /opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -DskCache -Lall -a0 > > Adapter 0-VD 0(target id: 0): Disk Write Cache : Disk's Default > Adapter 0-VD 1(target id: 1): Disk Write

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Alex Litvak
All machines state the same. /opt/MegaRAID/MegaCli/MegaCli64 -LDGetProp -DskCache -Lall -a0 Adapter 0-VD 0(target id: 0): Disk Write Cache : Disk's Default Adapter 0-VD 1(target id: 1): Disk Write Cache : Disk's Default I assume they are all on which is actually bad based on common sense.

Re: [ceph-users] Huge latency spikes

2018-11-18 Thread Serkan Çoban
I am not saying controller cache, you should check ssd disk caches. On Sun, Nov 18, 2018 at 11:40 AM Alex Litvak wrote: > > All 3 nodes have this status for SSD mirror. Controller cache is on for all > 3. > > Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad > BBU >

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
All 3 nodes have this status for SSD mirror. Controller cache is on for all 3. Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU On 11/18/2018 12:45 AM, Serkan Çoban wrote: Does

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Serkan Çoban
Does write cache on SSDs enabled on three servers? Can you check them? On Sun, Nov 18, 2018 at 9:05 AM Alex Litvak wrote: > > Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back > cache is on > > Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back cache is on Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU I have 2 other nodes with older Perc H710 and

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Serkan Çoban
>10ms w_await for SSD is too much. How that SSD is connected to the system? Any >raid card installed on this system? What is the raid mode? On Sun, Nov 18, 2018 at 8:25 AM Alex Litvak wrote: > > Here is another snapshot. I wonder if this write io wait is too big > Device: rrqm/s

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Here is another snapshot. I wonder if this write io wait is too big Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-14 0.00 0.000.00 23.00 0.00 336.0029.22 0.34 14.740.00

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I stand corrected, I looked at the device iostat, but it was partitioned. Here is a more correct picture of what is going on now. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-14 0.00 0.000.00

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread John Petrini
The iostat isn't very helpful because there are not many writes. I'd recommend disabling cstates entirely, not sure it's your problem but it's good practice and if your cluster goes as idle as your iostat suggests it could be the culprit. ___ ceph-users

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Plot thickens: I checked c-states and apparently I am operating in c1 with all CPUS on. Apparently servers were tuned to use latency-performance tuned-adm active Current active profile: latency-performance turbostat shows PackageCore CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread John Petrini
You can check if cstates are enabled with cat /proc/acpi/processor/info. Look for power management: yes/no. If they are enabled then you can check the current cstate of each core. 0 is the CPU's normal operating range, any other state means the processor is in a power saving mode. cat

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
John, Thank you for suggestions: I looked into journal SSDs. It is close to 3 years old showing 5.17% of wear (352941GB Written to disk with 3.6 PB endurance specs over 5 years) It could be that smart not telling all but that it what I see. Vendor Specific SMART Attributes with Thresholds:

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread John Petrini
I'd take a look at cstates if it's only happening during periods of low activity. If your journals are on SSD you should also check their health. They may have exceeded their write endurance - high apply latency is a tell tale sign of this and you'd see high iowait on those disks.

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I am evaluating bluestore on the separate cluster. Unfortunately upgrading this one is out of the question at the moment for multiple reasons. That is why I am trying to find a possible root cause. On 11/17/2018 2:14 PM, Paul Emmerich wrote: Are you running FileStore? (The config options

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Paul Emmerich
Are you running FileStore? (The config options you are using looks like a FileStore config) Try out BlueStore, we've found that it reduces random latency spikes due to filesystem weirdness a lot. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I am using libvirt for block device (openstack, proxmox KVM VMs) Also I am mounting cephfs inside of VMs and on bare metal hosts. In this case it would be a kernel based client. From what I can see based on pool stats cephfs pools have higher utilization comparing to block pools during the

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Kees Meijs
Hi Alex, What kind of clients do you use? Is it KVM (QEMU) using NBD driver, kernel, or...? Regards, Kees On 17-11-18 20:17, Alex Litvak wrote: > Hello everyone, > > I am trying to troubleshoot cluster exhibiting huge spikes of latency. > I cannot quite catch it because it happens during the

[ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Hello everyone, I am trying to troubleshoot cluster exhibiting huge spikes of latency. I cannot quite catch it because it happens during the light activity and randomly affects one osd node out of 3 in the pool. This is a file store. I see some osds exhibit applied latency of 400 ms, 1