On 5/20/20 2:50 PM, Ulrich Windl wrote:
Hi!
I have a performance question regarding delay for reading blocks in a PV Xen VM.
Forst a little background: Originally to monitor NFS outages, I developed a tool
"iotwatch" (short: IOTW) that reads the first block of a block device or file
(or anything you can open() and read() with Direct I/O). The tool samples the target at a
rather high rate (like 5s), keeping statistics that are queried at a lower rate (like 5
min).
A wrapper around the tool is used as monitoring plugin, and the outoput looks
like this:
/dev/sys/var: alpha=0.01, count=75(120/120), last=0.0011, avg=0.00423/0.00264/0\
.00427, min=0.00052(0.00052/0.00084), max=0.02465(0.02465/0.02062), variance=0.\
00005(0.00003)|last=0.0011;;;0 exp_avg=0.00427;;;0 emin=0.00084;;;0 emax=0.0206\
2;;;0 davg=0.00264;;;0 dstd_dev=0.00617;;;0
A short explanation what these numbers mean:
"alpha" is the weight used for exponential averaging (e.g. for "exp_avg"). "count" is the number of samples since last read and the
number of samples in the sampling queue (e.g. 120 valid samples ot of a maximum of 120). The values "avg" is average, "min" is the
minimum", "max" is the maximum, "variance" is what it says, and "last" is the last sampling value.
In text output there are three numbers instead of just one, meaning (the
indicated value, the average of the value within the sampling queue, and the
exponentially averaged value). This is mostly for debugging. The performance
data output has just one of those values, selectable via command-line option.
Also the statistics can be (in this case they are) reset after it was read, so
min and max will start anew...
OK, that was a rather long story before presenting the details:
A VM has its root disk on a mirrored LV (cLVM) presented as "phy:", and inside
the VM the disk is partitioned like this:
Device Boot Start End Sectors Size Id Type
/dev/xvdb1 * 2048 411647 409600 200M 83 Linux
/dev/xvdb2 411648 83886079 83474432 39.8G 5 Extended
/dev/xvdb5 413696 83886079 83472384 39.8G 8e Linux LVM
xvdb5 is a PV for the sys VG, like this:
opt sys -wi-ao---- 4.00g
root sys -wi-ao---- 8.00g
srv sys -wi-ao---- 4.00g
swap sys -wi-ao---- 2.00g
tmp sys -wi-ao---- 512.00m
var sys -wi-ao---- 6.00g
LV var is mounted on /var as ext3 (acl,user_xattr). The timing threads runs
with prio -80 (nice 0) at SCHED_RR, so I guess other processes won't disturb
the measurements much. I see no other threads using a real-time scheduling
policy in the VM; system tasks seem to run at prio 0 with some negative nice
value instead...
(On the xen host corosync, DLM and OCFS2 runs with prio -2)
Now the story: The performance of the root disk inside the VM (IOTW-PV) has a
typical read delay of less than 2ms with peaks below 40ms (A comparable local
disk in bare metal) would have less than 0.2ms delay with peaks below 7ms).
However when timing the var LV (IOTW_FS), the average is below 4ms with peaks
up to 80ms.
The storage system behind is a FC-based 3PAR StorServ with all SSDs and the
service time for reads is (according to the storage system'S own perfomance
monitor (SSMC)) significantly below 0.25ms at the same time interval.
So I wonder: How can LVM in the VM add another 40ms peak to the base timing?
The other thing that puzzles me is this: While the timing for the root disk is
basically good with very few peaks, the timing of the LV has mainly three
levels: First, most common level is good performance. the next level is like
20ms (more), and the third level are peaks of another 20 or 40 ms.
Is there any explanation for this? The VM is SLES12 SP5, while the Xen Host is
still SLES11 SP4.
At the moment I'm thinking how to implement VM disks in a way that is efficient
while supporting live migration of VMs.
In the past we were using filesystem images stored in OCFS2 which itself was
put in a mirrored cLVM LV. Performance was rather poor, so I skipped the OCFS2
layer and created a separate LV for each VM. Unfortunately mirroring all VM
images to different storage systems is an absolute requirement.
Hi Ulrich,
In your use case, the (clustered) LVM2 mirroring layer is known to be a
performance sensive concern. OCFS2 and SLES12 SP5 VM should not be a
performance concern in your stack.
I do see an improvement to upgrade your host to SLES12 SP5 if possible. Hence
you can evolve clustered LVM2 mirroring to clustered md raid1 which intends to
resolve LVM2 mirroring performance concern. You can play with the following
migration doc. It should apply to SLES12 SP5 too.
https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-clvm.html#sec-ha-clvm-migrate
Cheers,
Roger
I'd be glad to get some insights.
Regards,
Ulrich
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/