All 3 nodes have this status for SSD mirror. Controller cache is on for all 3.
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
On 11/18/2018 12:45 AM, Serkan Çoban wrote:
Does
Does write cache on SSDs enabled on three servers? Can you check them?
On Sun, Nov 18, 2018 at 9:05 AM Alex Litvak
wrote:
>
> Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back
> cache is on
>
> Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad
Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back cache
is on
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
I have 2 other nodes with older Perc H710 and
>10ms w_await for SSD is too much. How that SSD is connected to the system? Any
>raid card installed on this system? What is the raid mode?
On Sun, Nov 18, 2018 at 8:25 AM Alex Litvak
wrote:
>
> Here is another snapshot. I wonder if this write io wait is too big
> Device: rrqm/s
Here is another snapshot. I wonder if this write io wait is too big
Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
dm-14 0.00 0.000.00 23.00 0.00 336.0029.22
0.34 14.740.00
I stand corrected, I looked at the device iostat, but it was partitioned. Here
is a more correct picture of what is going on now.
Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz await r_await w_await svctm %util
dm-14 0.00 0.000.00
The iostat isn't very helpful because there are not many writes. I'd
recommend disabling cstates entirely, not sure it's your problem but it's
good practice and if your cluster goes as idle as your iostat suggests it
could be the culprit.
___
ceph-users
Hello,
I am building a new cluster with 4 hosts, which have the following
configuration:
128Gb RAM
12 HDs SATA 8TB 7.2k rpm
2 SSDs 240Gb
2x10GB Network
I will use the cluster to store RBD images of VMs, I thought to use with 2x
replica, if it does not get too slow.
My question is: Using
Plot thickens:
I checked c-states and apparently I am operating in c1 with all CPUS on.
Apparently servers were tuned to use latency-performance
tuned-adm active
Current active profile: latency-performance
turbostat shows
PackageCore CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI
You can check if cstates are enabled with cat /proc/acpi/processor/info.
Look for power management: yes/no.
If they are enabled then you can check the current cstate of each core. 0
is the CPU's normal operating range, any other state means the processor is
in a power saving mode. cat
John,
Thank you for suggestions:
I looked into journal SSDs. It is close to 3 years old showing 5.17% of
wear (352941GB Written to disk with 3.6 PB endurance specs over 5 years)
It could be that smart not telling all but that it what I see.
Vendor Specific SMART Attributes with Thresholds:
> Wido den Hollander :
> On 11/15/18 7:51 PM, koukou73gr wrote:
> > Are there any means to notify the administrator that an auto-repair has
> > taken place?
>
> I don't think so. You'll see the cluster go to HEALTH_ERR for a while
> before it turns to HEALTH_OK again after the PG has been
I'd take a look at cstates if it's only happening during periods of
low activity. If your journals are on SSD you should also check their
health. They may have exceeded their write endurance - high apply
latency is a tell tale sign of this and you'd see high iowait on those
disks.
I am evaluating bluestore on the separate cluster. Unfortunately
upgrading this one is out of the question at the moment for multiple
reasons. That is why I am trying to find a possible root cause.
On 11/17/2018 2:14 PM, Paul Emmerich wrote:
Are you running FileStore? (The config options
Are you running FileStore? (The config options you are using looks
like a FileStore config)
Try out BlueStore, we've found that it reduces random latency spikes
due to filesystem weirdness a lot.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
I am using libvirt for block device (openstack, proxmox KVM VMs)
Also I am mounting cephfs inside of VMs and on bare metal hosts. In
this case it would be a kernel based client.
From what I can see based on pool stats cephfs pools have higher
utilization comparing to block pools during the
While I also believe it to be perfectly safe on a bluestore cluster
(especially since there's osd_scrub_auto_repair_num_errors if there's
more wrong than your usual bit rot), we also don't run any cluster
with this option at the moment. We had it enabled for some time before
we backported the
Hi Alex,
What kind of clients do you use? Is it KVM (QEMU) using NBD driver,
kernel, or...?
Regards,
Kees
On 17-11-18 20:17, Alex Litvak wrote:
> Hello everyone,
>
> I am trying to troubleshoot cluster exhibiting huge spikes of latency.
> I cannot quite catch it because it happens during the
Hello everyone,
I am trying to troubleshoot cluster exhibiting huge spikes of latency.
I cannot quite catch it because it happens during the light activity and
randomly affects one osd node out of 3 in the pool.
This is a file store.
I see some osds exhibit applied latency of 400 ms, 1
19 matches
Mail list logo