On 5/20/20 2:50 PM, Ulrich Windl wrote:
Hi!

I have a performance question regarding delay for reading blocks in a PV Xen VM.
Forst a little background: Originally to monitor NFS outages, I developed a tool 
"iotwatch" (short: IOTW) that reads the first block of a block device or file 
(or anything you can open() and read() with Direct I/O). The tool samples the target at a 
rather high rate (like 5s), keeping statistics that are queried at a lower rate (like 5 
min).

A wrapper around the tool is used as monitoring plugin, and the outoput looks 
like this:
/dev/sys/var: alpha=0.01, count=75(120/120), last=0.0011, avg=0.00423/0.00264/0\
.00427, min=0.00052(0.00052/0.00084), max=0.02465(0.02465/0.02062), variance=0.\
00005(0.00003)|last=0.0011;;;0 exp_avg=0.00427;;;0 emin=0.00084;;;0 emax=0.0206\
2;;;0 davg=0.00264;;;0 dstd_dev=0.00617;;;0

A short explanation what these numbers mean:
"alpha" is the weight used for exponential averaging (e.g. for "exp_avg"). "count" is the number of samples since last read and the 
number of samples in the sampling queue (e.g. 120 valid samples ot of a maximum of 120). The values "avg" is average, "min" is the 
minimum", "max" is the maximum, "variance" is what it says, and "last" is the last sampling value.
In text output there are three numbers instead of just one, meaning (the 
indicated value, the average of the value within the sampling queue, and the 
exponentially averaged value). This is mostly for debugging. The performance 
data output has just one of those values, selectable via command-line option. 
Also the statistics can be (in this case they are) reset after it was read, so 
min and max will start anew...

OK, that was a rather long story before presenting the details:

A VM has its root disk on a mirrored LV (cLVM) presented as "phy:", and inside 
the VM the disk is partitioned like this:
Device     Boot  Start      End  Sectors  Size Id Type
/dev/xvdb1 *      2048   411647   409600  200M 83 Linux
/dev/xvdb2      411648 83886079 83474432 39.8G  5 Extended
/dev/xvdb5      413696 83886079 83472384 39.8G 8e Linux LVM

xvdb5 is a PV for the sys VG, like this:
   opt  sys -wi-ao----   4.00g
   root sys -wi-ao----   8.00g
   srv  sys -wi-ao----   4.00g
   swap sys -wi-ao----   2.00g
   tmp  sys -wi-ao---- 512.00m
   var  sys -wi-ao----   6.00g

LV var is mounted on /var as ext3 (acl,user_xattr). The timing threads runs 
with prio -80 (nice 0) at SCHED_RR, so I guess other processes won't disturb 
the measurements much. I see no other threads using a real-time scheduling 
policy in the VM; system tasks seem to run at prio 0 with some negative nice 
value instead...
(On the xen host corosync, DLM and OCFS2 runs with prio -2)

Now the story: The performance of the root disk inside the VM (IOTW-PV) has a 
typical read delay of less than 2ms with peaks below 40ms (A comparable local 
disk in bare metal) would have less than 0.2ms delay with peaks below 7ms).  
However when timing the var LV (IOTW_FS), the average is below 4ms with peaks 
up to 80ms.

The storage system behind is a FC-based 3PAR StorServ with all SSDs and the 
service time for reads is (according to the storage system'S own perfomance 
monitor (SSMC)) significantly below 0.25ms at the same time interval.

So I wonder: How can LVM in the VM add another 40ms peak to the base timing? 
The other thing that puzzles  me is this: While the timing for the root disk is 
basically good with very few peaks, the timing of the LV has mainly three 
levels: First, most common level is good performance. the next level is like 
20ms (more), and the third level are peaks of another 20 or 40 ms.

Is there any explanation for this? The VM is SLES12 SP5, while the Xen Host is 
still SLES11 SP4.

At the moment I'm thinking how to implement VM disks in a way that is efficient 
while supporting live migration of VMs.
In the past we were using filesystem images stored in OCFS2 which itself was 
put in a mirrored cLVM LV. Performance was rather poor, so I skipped the OCFS2 
layer and created a separate LV for each VM. Unfortunately mirroring all VM 
images to different storage systems is an absolute requirement.


Hi Ulrich,

In your use case, the (clustered) LVM2 mirroring layer is known to be a performance sensive concern. OCFS2 and SLES12 SP5 VM should not be a performance concern in your stack.

I do see an improvement to upgrade your host to SLES12 SP5 if possible. Hence you can evolve clustered LVM2 mirroring to clustered md raid1 which intends to resolve LVM2 mirroring performance concern. You can play with the following migration doc. It should apply to SLES12 SP5 too.

https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-clvm.html#sec-ha-clvm-migrate

Cheers,
Roger



I'd be glad to get some insights.

Regards,
Ulrich




_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to