Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
All 3 nodes have this status for SSD mirror. Controller cache is on for all 3. Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU On 11/18/2018 12:45 AM, Serkan Çoban wrote: Does

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Serkan Çoban
Does write cache on SSDs enabled on three servers? Can you check them? On Sun, Nov 18, 2018 at 9:05 AM Alex Litvak wrote: > > Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back > cache is on > > Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back cache is on Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU I have 2 other nodes with older Perc H710 and

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Serkan Çoban
>10ms w_await for SSD is too much. How that SSD is connected to the system? Any >raid card installed on this system? What is the raid mode? On Sun, Nov 18, 2018 at 8:25 AM Alex Litvak wrote: > > Here is another snapshot. I wonder if this write io wait is too big > Device: rrqm/s

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Here is another snapshot. I wonder if this write io wait is too big Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-14 0.00 0.000.00 23.00 0.00 336.0029.22 0.34 14.740.00

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I stand corrected, I looked at the device iostat, but it was partitioned. Here is a more correct picture of what is going on now. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util dm-14 0.00 0.000.00

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread John Petrini
The iostat isn't very helpful because there are not many writes. I'd recommend disabling cstates entirely, not sure it's your problem but it's good practice and if your cluster goes as idle as your iostat suggests it could be the culprit. ___ ceph-users

[ceph-users] Use SSDs for metadata or for a pool cache?

2018-11-17 Thread Gesiel Galvão Bernardes
Hello, I am building a new cluster with 4 hosts, which have the following configuration: 128Gb RAM 12 HDs SATA 8TB 7.2k rpm 2 SSDs 240Gb 2x10GB Network I will use the cluster to store RBD images of VMs, I thought to use with 2x replica, if it does not get too slow. My question is: Using

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Plot thickens: I checked c-states and apparently I am operating in c1 with all CPUS on. Apparently servers were tuned to use latency-performance tuned-adm active Current active profile: latency-performance turbostat shows PackageCore CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread John Petrini
You can check if cstates are enabled with cat /proc/acpi/processor/info. Look for power management: yes/no. If they are enabled then you can check the current cstate of each core. 0 is the CPU's normal operating range, any other state means the processor is in a power saving mode. cat

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
John, Thank you for suggestions: I looked into journal SSDs. It is close to 3 years old showing 5.17% of wear (352941GB Written to disk with 3.6 PB endurance specs over 5 years) It could be that smart not telling all but that it what I see. Vendor Specific SMART Attributes with Thresholds:

Re: [ceph-users] PG auto repair with BlueStore

2018-11-17 Thread Paul Emmerich
> Wido den Hollander : > On 11/15/18 7:51 PM, koukou73gr wrote: > > Are there any means to notify the administrator that an auto-repair has > > taken place? > > I don't think so. You'll see the cluster go to HEALTH_ERR for a while > before it turns to HEALTH_OK again after the PG has been

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread John Petrini
I'd take a look at cstates if it's only happening during periods of low activity. If your journals are on SSD you should also check their health. They may have exceeded their write endurance - high apply latency is a tell tale sign of this and you'd see high iowait on those disks.

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I am evaluating bluestore on the separate cluster. Unfortunately upgrading this one is out of the question at the moment for multiple reasons. That is why I am trying to find a possible root cause. On 11/17/2018 2:14 PM, Paul Emmerich wrote: Are you running FileStore? (The config options

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Paul Emmerich
Are you running FileStore? (The config options you are using looks like a FileStore config) Try out BlueStore, we've found that it reduces random latency spikes due to filesystem weirdness a lot. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
I am using libvirt for block device (openstack, proxmox KVM VMs) Also I am mounting cephfs inside of VMs and on bare metal hosts. In this case it would be a kernel based client. From what I can see based on pool stats cephfs pools have higher utilization comparing to block pools during the

Re: [ceph-users] PG auto repair with BlueStore

2018-11-17 Thread Paul Emmerich
While I also believe it to be perfectly safe on a bluestore cluster (especially since there's osd_scrub_auto_repair_num_errors if there's more wrong than your usual bit rot), we also don't run any cluster with this option at the moment. We had it enabled for some time before we backported the

Re: [ceph-users] Huge latency spikes

2018-11-17 Thread Kees Meijs
Hi Alex, What kind of clients do you use? Is it KVM (QEMU) using NBD driver, kernel, or...? Regards, Kees On 17-11-18 20:17, Alex Litvak wrote: > Hello everyone, > > I am trying to troubleshoot cluster exhibiting huge spikes of latency. > I cannot quite catch it because it happens during the

[ceph-users] Huge latency spikes

2018-11-17 Thread Alex Litvak
Hello everyone, I am trying to troubleshoot cluster exhibiting huge spikes of latency. I cannot quite catch it because it happens during the light activity and randomly affects one osd node out of 3 in the pool. This is a file store. I see some osds exhibit applied latency of 400 ms, 1