Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 06/09/2018 18:23, Hans van Kranenburg wrote: > > Anyway, I think the future proof solution here is to have clear > documentation about how to configure related settings, instead of trying > to find values that suit all users and that are not ridiculously high. I just assisted a user in #xen on freenode with this exact issue again. The user had already experienced three maintenance windows in which it was tried to upgrade a domU with quite some big sized disks and cpus from Jessie to Stretch, every time failing again with random symptoms. Disk doesn't work, network does not ping, and had spent quite some hours searching for solutions already. This reminded me of something else... which is better error logging when the issue happens. This is an upstream thing to fix I guess, if possible. As soon as there's a useful error message in logging or on the console of the domU, then the user has something specific to search for on ze interwebz. Hans
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 02/28/2018 08:54 AM, Valentin Vidic wrote: > On Tue, Feb 27, 2018 at 08:22:50PM +0100, Valentin Vidic wrote: >> Since I can't reproduce it easily anymore I suspect something was >> fixed in the meanwhile. My original report was for 4.9.30-2+deb9u2 >> and since then there seems to be a number of fixes that could be >> related to this: > > Just rebooted both dom0 and domU with 4.9.30-2+deb9u2 and the the > postgresql domU is having problems right away after boot: > > domid=1: nr_frames=32, max_nr_frames=32 > > [ 242.652100] INFO: task kworker/u90:0:6 blocked for more than 120 seconds. > > Upgrading the kernels and I can't get it above 11 anymore: > > domid=1: nr_frames=11, max_nr_frames=32 > > So some of those many kernel fixes did the trick and things just > work fine with the newer kernels without raising gnttab_max_frames. During my testing I also couldn't quickly cause the nr_frames exhaustion to happen with block devices, but I still can with a decent amount of network interfaces inside the domU. Anyway, I think the future proof solution here is to have clear documentation about how to configure related settings, instead of trying to find values that suit all users and that are not ridiculously high. In Xen 4.10/4.11 the settings changed by the way. The default for in the dom0 is 64 now, and the default for domUs can be set in xl.conf (which is still 32), I have it at max_grant_frames=64 currently. It can also be set per domU, but I like setting it system-wide more. There's still a xen kernel option for this, which causes the dom0 value to be set, and which determines the upper limit for the xl.conf option, iirc. Oh, and the setting for a domU can also be changed while it's running. Mind blown. So yeah, it's a bit complicated, like 4 or 6 knobs to turn which you all need to get in the right direction, instead of only the old option. I only don't know where to put the info pointing the user at the right places to config this. NEWS.Debian? Somewhere else? There is reference documentation about this in the man pages, but I don't think there's a tutorial/howto kind of documentation. Hans
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Tue, Feb 27, 2018 at 08:22:50PM +0100, Valentin Vidic wrote: > Since I can't reproduce it easily anymore I suspect something was > fixed in the meanwhile. My original report was for 4.9.30-2+deb9u2 > and since then there seems to be a number of fixes that could be > related to this: Just rebooted both dom0 and domU with 4.9.30-2+deb9u2 and the the postgresql domU is having problems right away after boot: domid=1: nr_frames=32, max_nr_frames=32 [ 242.652100] INFO: task kworker/u90:0:6 blocked for more than 120 seconds. Upgrading the kernels and I can't get it above 11 anymore: domid=1: nr_frames=11, max_nr_frames=32 So some of those many kernel fixes did the trick and things just work fine with the newer kernels without raising gnttab_max_frames. -- Valentin
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
I much appreciate the effort you all did and like the idea to ship the xen-diag tool and maybe a hint somewhere about the issues that occurred and the possible solution by raising max_nr_frames. On 27.02.2018 17:05, Hans van Kranenburg wrote: ad 1. Christian, Valentin, can you give more specific info that can help someone else to set up a test environment to trigger > 32 values. As this isn't my own system, but a productive system of one of my customers, I'm really reluctant to use it for invasive testing. Just for recap: The issues hit me with kernel 4.9.51-1 hardware: xeon E5-2620 v4 board supermicro X10SRi-F 32gb ecc ram two 10tb server disk two I350 network adapter (onboard) dom0: debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 4.8.1-1+deb9u3, the two network as adapter as a bond in a bridge the discs: gpt, 4 part (1M, 256M esp, 256M md mirror for boot, rest as md mirror for lvm) domu: memory: 8192, 2 vcpus uses a network interface on the bridge 16 lvm volumes as phys devices debian stretch issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1 system runs mostly some smb, some web services, cal/card dav, psql, ldap, postfix, cyrus ... In my early tests before the issue was discussed here I tried linux-image-4.13.0-0.bpo.1-amd64 an the system went stable for a week. Oh and It's worth to mention that I tried thin lvm in the beginning, but I dropped that due to (write)performance and boot issues (thinpool was always inactive after boot and took about 5-10 minutes to activate after there where about 4TB of data within). currently the system is running stable with max_nr_frames=256 (I wanted to be on the save side) and kernel 4.9.65-3+deb9u2. Maybe I can try to get some values with xen-diag Valentin provided to see the current state of the system, but I'm really busy at the moment job wise and private, I hope next week gets better (had some bad luck with our water installation - much mopping).
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
I much appreciate the effort you all did and like the idea to ship the xen-diag tool and maybe a hint somewhere about the issues that occurred and the possible solution by raising max_nr_frames. On 27.02.2018 17:05, Hans van Kranenburg wrote: ad 1. Christian, Valentin, can you give more specific info that can help someone else to set up a test environment to trigger > 32 values. As this isn't my own system, but a productive system of one of my customers, I'm really reluctant to use it for invasive testing. Just for recap: The issues hit me with kernel 4.9.51-1 hardware: xeon E5-2620 v4 board supermicro X10SRi-F 32gb ecc ram two 10tb server disk two I350 network adapter (onboard) dom0: debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 4.8.1-1+deb9u3, the two network as adapter as a bond in a bridge the discs: gpt, 4 part (1M, 256M esp, 256M md mirror for boot, rest as md mirror for lvm) domu: memory: 8192, 2 vcpus uses a network interface on the bridge 16 lvm volumes as phys devices debian stretch issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1 system runs mostly some smb, some web services, cal/card dav, psql, ldap, postfix, cyrus ... In my early tests before the issue was discussed here I tried linux-image-4.13.0-0.bpo.1-amd64 an the system went stable for a week. Oh and It's worth to mention that I tried thin lvm in the beginning, but I dropped that due to (write)performance and boot issues (thinpool was always inactive after boot and took about 5-10 minutes to activate after there where about 4TB of data within). currently the system is running stable with max_nr_frames=256 (I wanted to be on the save side) and kernel 4.9.65-3+deb9u2. Maybe I can try to get some values with xen-diag Valentin provided to see the current state of the system, but I'm really busy at the moment job wise and private, I hope next week gets better (had some bad luck with our water installation - much mopping).
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Tue, Feb 27, 2018 at 05:05:06PM +0100, Hans van Kranenburg wrote: > ad 1. Christian, Valentin, can you give more specific info that can help > someone else to set up a test environment to trigger > 32 values. I can't touch the original VM that had this issue and tried to reproduce on another host with recent stretch kernels but without success. The maximum number I can get now is nr_frames=11. Another info that I forgot to mention before is that my VMs were using DRBD disks. Since DRBD acts like a slow disk it could cause IO requests to pile up and hit the limit faster. Since I can't reproduce it easily anymore I suspect something was fixed in the meanwhile. My original report was for 4.9.30-2+deb9u2 and since then there seems to be a number of fixes that could be related to this: linux (4.9.65-3) stretch; urgency=medium * xen/time: do not decrease steal time after live migration on xen linux (4.9.65-1) stretch; urgency=medium - swiotlb-xen: implement xen_swiotlb_dma_mmap callback - xen-netback: Use GFP_ATOMIC to allocate hash - xen/gntdev: avoid out of bounds access in case of partial gntdev_mmap() - xen/manage: correct return value check on xenbus_scanf() - xen: don't print error message in case of missing Xenstore entry - xen/netback: set default upper limit of tx/rx queues to 8 linux (4.9.47-1) stretch; urgency=medium - nvme: use blk_mq_start_hw_queues() in nvme_kill_queues() - nvme: avoid to use blk_mq_abort_requeue_list() - efi: Don't issue error message when booted under Xen - xen/privcmd: Support correctly 64KB page granularity when mapping memory - xen/blkback: fix disconnect while I/Os in flight - xen/blkback: don't use xen_blkif_get() in xen-blkback kthread - xen/blkback: don't free be structure too early - xen-netback: fix memory leaks on XenBus disconnect - xen-netback: protect resource cleaning on XenBus disconnect - swiotlb-xen: update dev_addr after swapping pages - xen-netfront: Fix Rx stall during network stress and OOM - [x86] mm: Fix flush_tlb_page() on Xen - xen-netfront: Rework the fix for Rx stall during OOM and network stress - xen/scsiback: Fix a TMR related use-after-free - [x86] xen: allow userspace access during hypercalls - [armhf] Xen: Zero reserved fields of xatp before making hypervisor call - xen-netback: correctly schedule rate-limited queues - nbd: blk_mq_init_queue returns an error code on failure, not NULL - xen: fix bio vec merging (CVE-2017-12134) (Closes: #866511) - blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULL - xen-blkfront: use a right index when checking requests linux (4.9.30-2+deb9u4) stretch-security; urgency=high * xen: fix bio vec merging (CVE-2017-12134) (Closes: #866511) linux (4.9.30-2+deb9u3) stretch-security; urgency=high * xen-blkback: don't leak stack data via response ring * (CVE-2017-10911) * mqueue: fix a use-after-free in sys_mq_notify() (CVE-2017-11176) In fact the original big VM with this problem runs happily with: domid=1: nr_frames=11, max_nr_frames=256 so it is quite possible raising the limit is not needed anymore with the latest stretch kernels. If no-one else can reproduce this anymore I suggest you close the issue but include the xen-diag tool in the updated package. That way if someone reports the problem again it should be easy to detect. -- Valentin
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 02/27/2018 05:05 PM, Hans van Kranenburg wrote: > [...] > > ...I doubt if it's useful (priority wise) to keep spending a lot of time > on this, since the work is really time consuming. It is, but it's also an interesting problem. Idle just started domU starts at nr_frames=6 or 7 in all cases. Same test as before 64 vcpu, 10 disks, trying to hit as many vcpu/disk combinations: 1. With new modprobe limits applied: 3.16.51-3+deb8u1 -> nr_frames=25 4.9.30-2+deb9u5 -> nr_frames=24 4.9.51-1 -> nr_frames=25 4.14.13-1~bpo9+1 -> nr_frames=23 2. Rebooting dom0, removing limits: 3.16.51-3+deb8u1 -> nr_frames=25 4.9.30-2+deb9u5 -> nr_frames=25 4.9.51-1 -> nr_frames=24 4.14.13-1~bpo9+1 -> nr_frames=46 <-- Well, there it is. However, I can not, I repeat, not, see a difference between 4.9.30-2+deb9u5 and 4.9.51-1, the versions used to report with in the very first message on this bug. 1. If you're running into the problem with a 4.9 stretch domU kernel, you're likely hitting the limits in the same way that I already also hit them like 10 years ago, just having quite some of either vcpu, vbd or vif. 2. If you're upgrading a domU to use the stretch-backports kernel, you're suddenly much more likely to bump into the limit. So: For 1. the solution is to change the boot parameter by the user, or to reconsider patching DEFAULT_MAX_NR_GRANT_FRAMES 32 to something else (xen/include/xen/grant_table.h) but that would require another rounds of testing to see if it does what we might think it does. I vote no. To accommodate 2. the better is to ship the modprobe config for 4.8, since running stretch-backports is a valid 'normal' use case. I vote yes. Ian, up to you to make a final decision. kthxbye, Hans
Bug#880554: [Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 02/27/2018 12:40 AM, Hans van Kranenburg wrote: > [...] > > But, the main thing I wanted to test is if the change would result in a > much lower total amount of grants, which is not the case. So, * I couldn't reproduce a number > 32 * The proposed fix doesn't help. There's two scenarios which can be happening: 1. Bug reporters are running a really exceptional sizing and workload. 2. "It's on fire and we don't know how big the fire is" (quote Ian) ad 1. Christian, Valentin, can you give more specific info that can help someone else to set up a test environment to trigger > 32 values. ad 2. e.g. how many users run into this and do not report it, don't understand, switch to KVM and tell their friends that Xen only is unstable and crashes? OTOH: Since... * this problem has been fixed in newer Xen already in a different way * there's a sufficient workaround now (setting max frames) ...I doubt if it's useful (priority wise) to keep spending a lot of time on this, since the work is really time consuming. Hans
Bug#880554: [Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 02/26/2018 07:35 PM, Hans van Kranenburg wrote: > On 02/26/2018 03:52 PM, Ian Jackson wrote: >> Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64"): >>> I can try, but the only system I can really test this is a productive >>> system, as this 'reliable' shows this issue (and I don't want to crash >>> it on purpose on a regular basis). Since I set gnttab_max_frame to a >>> higher value it runs smooth. If you're confident this will work I can >>> try this in the eventing, when all users logged off. >> >> Thanks. I understand your reluctance. I don't want to mislead you. >> I think the odds of it working are probably ~75%. >> >> Unless you want to tolerate that risk, it might be better for us to >> try to come up with a better way to test it. > > I can try this. > > I can run a dom0 with Xen 4.8 and 4.9 domU, I already have the xen-diag > for it (so confirmed the patch in this bug report builds ok, we should > include it for stretch, it's really useful). > > I think it's mainly trying to get a domU running with various > combinations of domU kernel, number of disks and vcpus, and then look at > the output of xen-diag. Ok, I spent some time trying things. Xen: 4.8.3+comet2+shim4.10.0+comet3-1+deb9u4.1 dom0 kernel 4.9.65-3+deb9u2 domU (PV) kernel 4.9.82-1+deb9u2 Observation so far: nr_frames increases as soon as a combination of disk+vcpu has actually been doing disk activity, and then never decreases. I ended up with a 64-vcpu domU with additional 10 1GiB disks (xvdc, xvdd, etc). I created ext4 fs on the disks and mounted them. I used fio to throw some IO at the disk, trying to hit as many combinations of vcpu and disk. [things] rw=randwrite rwmixread=75 size=8M directory=/mnt/xvdBLAH ioengine=libaio direct=1 iodepth=16 numjobs=64 with BLAH replaced by c, d, e, f etc... -# rm */things*; for i in c d e f g h i j k l; do fio fio-xvd$i; done -# while true; do /usr/lib/xen-4.8/bin/xen-diag gnttab_query_size 2; sleep 10; done domid=2: nr_frames=6, max_nr_frames=128 domid=2: nr_frames=7, max_nr_frames=128 domid=2: nr_frames=7, max_nr_frames=128 domid=2: nr_frames=10, max_nr_frames=128 domid=2: nr_frames=10, max_nr_frames=128 domid=2: nr_frames=11, max_nr_frames=128 domid=2: nr_frames=13, max_nr_frames=128 domid=2: nr_frames=14, max_nr_frames=128 domid=2: nr_frames=15, max_nr_frames=128 domid=2: nr_frames=16, max_nr_frames=128 domid=2: nr_frames=18, max_nr_frames=128 domid=2: nr_frames=18, max_nr_frames=128 domid=2: nr_frames=19, max_nr_frames=128 domid=2: nr_frames=21, max_nr_frames=128 domid=2: nr_frames=21, max_nr_frames=128 domid=2: nr_frames=23, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 So I can push it up to about 24 when doing this. -# grep . /sys/module/xen_blkback/parameters/* /sys/module/xen_blkback/parameters/log_stats:0 /sys/module/xen_blkback/parameters/max_buffer_pages:1024 /sys/module/xen_blkback/parameters/max_persistent_grants:1056 /sys/module/xen_blkback/parameters/max_queues:4 /sys/module/xen_blkback/parameters/max_ring_page_order:4 Now, I rebooted my test domo and put the modprobe file in place. (Note: the filename has to end in .conf)!! -# grep . /sys/module/xen_blkback/parameters/* /sys/module/xen_blkback/parameters/log_stats:0 /sys/module/xen_blkback/parameters/max_buffer_pages:1024 /sys/module/xen_blkback/parameters/max_persistent_grants:1056 /sys/module/xen_blkback/parameters/max_queues:1 /sys/module/xen_blkback/parameters/max_ring_page_order:0 After doing the same tests, the result ends up being exactly 24 again. So, the modprobe settings don't seem to do anything. -# tree /sys/block/xvda/mq /sys/block/xvda/mq └── 0 ├── active ├── cpu0 │ ├── completed │ ├── dispatched │ ├── merged │ └── rq_list ├── cpu1 │ ├── completed │ ├── dispatched │ ├── merged │ └── rq_list [...] ├── cpu63 │ ├── completed │ ├── dispatched │ ├── merged │ └── rq_list [...] ├── cpu_list ├── dispatched ├── io_poll ├── pending ├── queued ├── run └── tags 65 directories, 264 files Mwooop mwooop mwoop mwo (failure trombone). It obviously didn't involve network traffic yet. And, all is stretch kernels etc, which are reported to already be problematic. But, the main thing I wanted to test is if the change would result in a much lower total amount of grants, which is not the case. So, anyone a better idea, or should we just add some clear documentation for the max frames setting in the grub config example? Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 02/26/2018 03:52 PM, Ian Jackson wrote: > Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes with kernel > linux-image-4.9.0-4-amd64"): >> I can try, but the only system I can really test this is a productive >> system, as this 'reliable' shows this issue (and I don't want to crash >> it on purpose on a regular basis). Since I set gnttab_max_frame to a >> higher value it runs smooth. If you're confident this will work I can >> try this in the eventing, when all users logged off. > > Thanks. I understand your reluctance. I don't want to mislead you. > I think the odds of it working are probably ~75%. > > Unless you want to tolerate that risk, it might be better for us to > try to come up with a better way to test it. I can try this. I can run a dom0 with Xen 4.8 and 4.9 domU, I already have the xen-diag for it (so confirmed the patch in this bug report builds ok, we should include it for stretch, it's really useful). I think it's mainly trying to get a domU running with various combinations of domU kernel, number of disks and vcpus, and then look at the output of xen-diag. Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64"): > I can try, but the only system I can really test this is a productive > system, as this 'reliable' shows this issue (and I don't want to crash > it on purpose on a regular basis). Since I set gnttab_max_frame to a > higher value it runs smooth. If you're confident this will work I can > try this in the eventing, when all users logged off. Thanks. I understand your reluctance. I don't want to mislead you. I think the odds of it working are probably ~75%. Unless you want to tolerate that risk, it might be better for us to try to come up with a better way to test it. Ian.
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi Hans, I can try, but the only system I can really test this is a productive system, as this 'reliable' shows this issue (and I don't want to crash it on purpose on a regular basis). Since I set gnttab_max_frame to a higher value it runs smooth. If you're confident this will work I can try this in the eventing, when all users logged off. Best regards, Chritian On 23.02.2018 16:18, Hans van Kranenburg wrote: Hi Valentin, Christian, Finally getting back to you about the max grant frames issue. We discussed this with upstream Xen developers, and a different fix was proposed. I would really appreciate if you could test it and confirm it also solves the issue. Testing does not involve recompiling the hypervisor with patches etc. The deadline for changes for the 9.4 Stretch point release is end next week, so we aim to get it in then. The cause of the problem is, like earlier discused, the "blkback multipage ring" changes a.k.a. "multi-queue xen blk driver" which eats grant frame resources way too fast. As shown in the reports, this issue already exists while using the normal stretch kernel (not only newer backports) in combination with Xen 4.8. The upstream change we found earlier that doubles the max number to 64 is part of a bigger change that touches more of the inner workings, making Xen better able to handle the domU kernel behavior. This whole change is not going to be backported to Xen 4.8. Can you please test the following, instead of setting the gnttab_max_frames value: Create the file /etc/modprobe.d/xen-blkback-fewer-gnttab-frames with contents... # apropos of #880554 # workaround is not required for Xen 4.9 and later options xen_blkback max_ring_page_order=0 options xen_blkback max_queues=1 ...and reboot. This will cause the domU kernels to behave more in a way that Xen 4.8 can cope with. Regards, Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi Valentin, Christian, Finally getting back to you about the max grant frames issue. We discussed this with upstream Xen developers, and a different fix was proposed. I would really appreciate if you could test it and confirm it also solves the issue. Testing does not involve recompiling the hypervisor with patches etc. The deadline for changes for the 9.4 Stretch point release is end next week, so we aim to get it in then. The cause of the problem is, like earlier discused, the "blkback multipage ring" changes a.k.a. "multi-queue xen blk driver" which eats grant frame resources way too fast. As shown in the reports, this issue already exists while using the normal stretch kernel (not only newer backports) in combination with Xen 4.8. The upstream change we found earlier that doubles the max number to 64 is part of a bigger change that touches more of the inner workings, making Xen better able to handle the domU kernel behavior. This whole change is not going to be backported to Xen 4.8. Can you please test the following, instead of setting the gnttab_max_frames value: Create the file /etc/modprobe.d/xen-blkback-fewer-gnttab-frames with contents... # apropos of #880554 # workaround is not required for Xen 4.9 and later options xen_blkback max_ring_page_order=0 options xen_blkback max_queues=1 ...and reboot. This will cause the domU kernels to behave more in a way that Xen 4.8 can cope with. Regards, Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Mon, Jan 15, 2018 at 11:12:03AM +0100, Christian Schwamborn wrote: > Is there a easy way to get/monitor the used 'grants' frames? As I understand > it, the xen-diag tool you mentioned doesn't compile in xen 4.8? Here is a status from another host: domid=0: nr_frames=4, max_nr_frames=256 domid=487: nr_frames=6, max_nr_frames=256 domid=488: nr_frames=5, max_nr_frames=256 domid=489: nr_frames=4, max_nr_frames=256 domid=490: nr_frames=6, max_nr_frames=256 domid=491: nr_frames=7, max_nr_frames=256 domid=492: nr_frames=4, max_nr_frames=256 domid=493: nr_frames=4, max_nr_frames=256 domid=494: nr_frames=29, max_nr_frames=256 domid=495: nr_frames=4, max_nr_frames=256 domid=496: nr_frames=4, max_nr_frames=256 domid=497: nr_frames=5, max_nr_frames=256 domid=498: nr_frames=4, max_nr_frames=256 domid=499: nr_frames=4, max_nr_frames=256 domid=500: nr_frames=4, max_nr_frames=256 domid=501: nr_frames=4, max_nr_frames=256 domid=503: nr_frames=5, max_nr_frames=256 domid=572: nr_frames=13, max_nr_frames=256 domid=575: nr_frames=7, max_nr_frames=256 Most of the hosts have older kernels and nr_frames < 10. And than 494 has a stretch kernel and only 4 vcpus but is quite close to the current default of 32. Maybe it just depends on the amount of disk IO? -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Mon, Jan 15, 2018 at 11:12:03AM +0100, Christian Schwamborn wrote: > Is there a easy way to get/monitor the used 'grants' frames? As I understand > it, the xen-diag tool you mentioned doesn't compile in xen 4.8? I just gave it another try and after modifying xen-diag.c a bit to work with 4.8 here is what I get: # ./xen-diag gnttab_query_size 0 domid=0: nr_frames=4, max_nr_frames=256 # ./xen-diag gnttab_query_size 1 domid=1: nr_frames=11, max_nr_frames=256 # ./xen-diag gnttab_query_size 0 domid=0: nr_frames=4, max_nr_frames=256 # ./xen-diag gnttab_query_size 1 domid=1: nr_frames=11, max_nr_frames=256 # ./xen-diag gnttab_query_size 5 domid=5: nr_frames=11, max_nr_frames=256 so currently at 11, not high at all. Attaching a patch for stretch xen package if you want to check your hosts. -- Valentin --- a/tools/misc/Makefile +++ b/tools/misc/Makefile @@ -31,6 +31,7 @@ INSTALL_SBIN += xenpm INSTALL_SBIN += xenwatchdogd INSTALL_SBIN += xen-livepatch +INSTALL_SBIN += xen-diag INSTALL_SBIN += $(INSTALL_SBIN-y) # Everything to be installed in a private bin/ @@ -98,6 +99,9 @@ xen-livepatch: xen-livepatch.o $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) +xen-diag: xen-diag.o + $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS) + xen-lowmemd: xen-lowmemd.o $(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenevtchn) $(LDLIBS_libxenctrl) $(LDLIBS_libxenstore) $(APPEND_LDFLAGS) --- /dev/null +++ b/tools/misc/xen-diag.c @@ -0,0 +1,129 @@ +/* + * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved. + */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +static xc_interface *xch; + +#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0])) + +void show_help(void) +{ +fprintf(stderr, +"xen-diag: xen diagnostic utility\n" +"Usage: xen-diag command [args]\n" +"Commands:\n" +" help display this help\n" +" gnttab_query_size dump the current and max grant frames for \n"); +} + +/* wrapper function */ +static int help_func(int argc, char *argv[]) +{ +show_help(); +return 0; +} + +static int gnttab_query_size_func(int argc, char *argv[]) +{ +int domid, rc = 1; +struct gnttab_query_size query; + +if ( argc != 1 ) +{ +show_help(); +return rc; +} + +domid = strtol(argv[0], NULL, 10); +query.dom = domid; +rc = xc_gnttab_op(xch, GNTTABOP_query_size, , sizeof(query), 1); + +if ( rc == 0 && (query.status == GNTST_okay) ) +printf("domid=%d: nr_frames=%d, max_nr_frames=%d\n", + query.dom, query.nr_frames, query.max_nr_frames); + +return rc == 0 && (query.status == GNTST_okay) ? 0 : 1; +} + +struct { +const char *name; +int (*function)(int argc, char *argv[]); +} main_options[] = { +{ "help", help_func }, +{ "gnttab_query_size", gnttab_query_size_func}, +}; + +int main(int argc, char *argv[]) +{ +int ret, i; + +/* + * Set stdout to be unbuffered to avoid having to fflush when + * printing without a newline. + */ +setvbuf(stdout, NULL, _IONBF, 0); + +if ( argc <= 1 ) +{ +show_help(); +return 0; +} + +for ( i = 0; i < ARRAY_SIZE(main_options); i++ ) +if ( !strncmp(main_options[i].name, argv[1], strlen(argv[1])) ) +break; + +if ( i == ARRAY_SIZE(main_options) ) +{ +show_help(); +return 0; +} +else +{ +xch = xc_interface_open(0, 0, 0); +if ( !xch ) +{ +fprintf(stderr, "failed to get the handler\n"); +return 0; +} + +ret = main_options[i].function(argc - 2, argv + 2); + +xc_interface_close(xch); +} + +/* + * Exitcode 0 for success. + * Exitcode 1 for an error. + * Exitcode 2 if the operation should be retried for any reason (e.g. a + * timeout or because another operation was in progress). + */ + +#define EXIT_TIMEOUT (EXIT_FAILURE + 1) + +BUILD_BUG_ON(EXIT_SUCCESS != 0); +BUILD_BUG_ON(EXIT_FAILURE != 1); +BUILD_BUG_ON(EXIT_TIMEOUT != 2); + +switch ( ret ) +{ +case 0: +return EXIT_SUCCESS; +case EAGAIN: +case EBUSY: +return EXIT_TIMEOUT; +default: +return EXIT_FAILURE; +} +}
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi Hans and Valentin, first of all: Thanks for your help and explanations, that is very helpfull. I was on vacation last week and couldn't answer right away. On 07.01.2018 19:36, Hans van Kranenburg wrote: If this is something users are going to run into while not doing more unusual things like having dozens of vcpus or network interfaces, then changing the default could prevent hours of frustration and debugging for them. As a reference: Dom0 is stretch. 0 root@zero:~# xl list Name ID Mem VCPUs State Time(s) Domain-00 1961 2 r- 407972.8 xaver-jessie 10 2048 2 -b 177520.8 ustrich-jessie 12 2048 2 -b8555.9 ourea-stretch 14 8192 2 -b 167352.7 arriba 17 4096 2 -b5108.3 All DomU's have one network interface on a bridge. xaver-jessie has 5 block devices (phys, lvm) ustrich-jessie has 4 block devices (phys, lvm) ourea-stretch has 16 block devices (phys, lvm) arriba has just one (phys, lvm) and is a hvm windows system As you can see, nothing crazy with lots of vcpus or network interfaces. The crashing (freezing) DomU was ourea-stretch, which is the one with the most load (smb, some web services, cal/card dav, psql, ldap, postfix, cyrus ...). As mentioned, the freezes stopped after using the backports kernel, nothing else changed. I was desperate at that time to get this new installed system to work and frankly stopped all planed updates to stretch on other systems at that point until I know what is going on. Is there a easy way to get/monitor the used 'grants' frames? As I understand it, the xen-diag tool you mentioned doesn't compile in xen 4.8? Christian
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 01/12/2018 12:43 PM, Valentin Vidic wrote: > On Fri, Jan 12, 2018 at 01:34:10AM +0100, Hans van Kranenburg wrote: >> Is the 59 your lots-o-vcpu-monster? > > Yes, that is the one with a larger vcpu count. Check. >> I just finished with the initial preparation of a Xen 4.10 package for >> unstable and have it running in my test environment. > > Unrelated to this issue, but can you tell me if there is a way to > mitigate Meltdown with the Xen 4.8 dom0/domU(PV) running stretch? There are no updates for the hypervisor itself yet that we can distribute in Debian. This is your starting point for information: https://xenbits.xen.org/xsa/advisory-254.html https://blog.xenproject.org/2018/01/04/xen-project-spectremeltdown-faq/ So, 64-bit PV guests can attack the hypervisor and other guests. If you have untrusted PV guests the short term choices are to 1) convert them to HVM or 2) shield your hypervisor from them by following the instructions for the 'PV-in-PVH/HVM shim approach' (where currently for Xen 4.8 only PV-in-HVM is relevant). There's still a pending security update for Stretch to address the previous XSA (up to 251), and it seems best to piggyback on that put some guidance and information for users in there as well. If you use IRC, you can also join #debian-xen on OFTC if you want, to discuss things. There's a bunch of people there sharing information and strategies about what to do with their debian systems. >> Since this has been reported multiple times already, and upstream has >> bumped it to 64, my verdict would be: >> >> * Bump default to 64 already like upstream did in a later version. >> * Properly document this issue in NEWS.Debian and also mention the >> option with documentation in the template grub config file, so there's a >> bigger chance users who run unusual big numbers of disks/nics/cpus/etc >> will find it. >> >> ...so we also better accomodate users who are using newer kernels in the >> domU with blk-mq, and prevent them from wasting too much time and >> getting frustrated for no reason. >> >> I wouldn't be comfortable with bumping it above the current latest >> greatest upstream default, since it would mean we would need to keep a >> patch in later versions. >> >> I'll prepare a patch to bump the default to 64 in 4.8, taking changes >> from the upstream patch. I probably have to ask upstream (Juergen Gross) >> why the commit that was referenced earlier bumps the default without >> mentioning it in the commit message. > > Thanks, 64 should be a good start. If there are still problems > reported with that it can be reconsidered. Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Fri, Jan 12, 2018 at 01:34:10AM +0100, Hans van Kranenburg wrote: > Is the 59 your lots-o-vcpu-monster? Yes, that is the one with a larger vcpu count. > I just finished with the initial preparation of a Xen 4.10 package for > unstable and have it running in my test environment. Unrelated to this issue, but can you tell me if there is a way to mitigate Meltdown with the Xen 4.8 dom0/domU(PV) running stretch? > Since this has been reported multiple times already, and upstream has > bumped it to 64, my verdict would be: > > * Bump default to 64 already like upstream did in a later version. > * Properly document this issue in NEWS.Debian and also mention the > option with documentation in the template grub config file, so there's a > bigger chance users who run unusual big numbers of disks/nics/cpus/etc > will find it. > > ...so we also better accomodate users who are using newer kernels in the > domU with blk-mq, and prevent them from wasting too much time and > getting frustrated for no reason. > > I wouldn't be comfortable with bumping it above the current latest > greatest upstream default, since it would mean we would need to keep a > patch in later versions. > > I'll prepare a patch to bump the default to 64 in 4.8, taking changes > from the upstream patch. I probably have to ask upstream (Juergen Gross) > why the commit that was referenced earlier bumps the default without > mentioning it in the commit message. Thanks, 64 should be a good start. If there are still problems reported with that it can be reconsidered. -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi, On 08/01/2018 13:38, Valentin Vidic wrote: > On Sun, Jan 07, 2018 at 07:36:40PM +0100, Hans van Kranenburg wrote: >> Recently a tool was added to "dump guest grant table info". You could >> see if it compiles on the 4.8 source and see if it works? Would be >> interesting to get some idea about how high or low these numbers are in >> different scenarios. I mean, I'm using 128, you 256, and we even don't >> know if the actual value is maybe just above 32? :] >> >> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a > > The diag tool does not build inside xen-4.8: > > xen-diag.c: In function ‘gnttab_query_size_func’: > xen-diag.c:50:10: error: implicit declaration of function > ‘xc_gnttab_query_size’ [-Werror=implicit-function-declaration] > rc = xc_gnttab_query_size(xch, ); > ^~~~ Too bad. :| > but I think the same info is available in the thread on xen-devel: > > https://www.mail-archive.com/xen-devel@lists.xen.org/msg116910.html Ah, great, didn't see that one yet. > When the domU hangs crash reports nr_grant_frames=32. After increasing > the gnttab_max_frames=256 the domU reports using nr_grant_frames=59. > > So the new default of gnttab_max_frames=64 might be a bit close to 59, > but I suppose 128 would be just as safe as 256 I currently use (if > you prefer 128). Is the 59 your lots-o-vcpu-monster? I just finished with the initial preparation of a Xen 4.10 package for unstable and have it running in my test environment. So, yay, I have xen-diag now. -# /usr/lib/xen-4.10/bin/xen-diag xen-diag: xen diagnostic utility Usage: xen-diag command [args] Commands: help display this help gnttab_query_size dump the current and max grant frames for -# /usr/lib/xen-4.10/bin/xen-diag gnttab_query_size 0 domid=0: nr_frames=1, max_nr_frames=64 That's a 10vcpu PVHv2 domU with two disks attached, running 4.14 guest kernel, which has only been booted up and is idling now. So, at least, nice to have some extra tooling available to help. >> If this is something users are going to run into while not doing more >> unusual things like having dozens of vcpus or network interfaces, then >> changing the default could prevent hours of frustration and debugging >> for them. > > Yes, the failure case is quite nasty, as the domU just hangs without > even suggesting grant frames might be the problem. Not sure if domU > can detect this situation at all? I can't comment on that, since I don't know. Anyone who does, please chime in. > Anyway, if the value cannot be increased, the situation should at least > be mentioned in the NEWS.Debian of the xen package. Since this has been reported multiple times already, and upstream has bumped it to 64, my verdict would be: * Bump default to 64 already like upstream did in a later version. * Properly document this issue in NEWS.Debian and also mention the option with documentation in the template grub config file, so there's a bigger chance users who run unusual big numbers of disks/nics/cpus/etc will find it. ...so we also better accomodate users who are using newer kernels in the domU with blk-mq, and prevent them from wasting too much time and getting frustrated for no reason. I wouldn't be comfortable with bumping it above the current latest greatest upstream default, since it would mean we would need to keep a patch in later versions. I'll prepare a patch to bump the default to 64 in 4.8, taking changes from the upstream patch. I probably have to ask upstream (Juergen Gross) why the commit that was referenced earlier bumps the default without mentioning it in the commit message. Since I just joined the Debian Xen team, I'll run anything I can come up with through the team to get it approved. We'll target the next Stretch stable update to get it in. Thanks, Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Sun, Jan 07, 2018 at 07:36:40PM +0100, Hans van Kranenburg wrote: > Recently a tool was added to "dump guest grant table info". You could > see if it compiles on the 4.8 source and see if it works? Would be > interesting to get some idea about how high or low these numbers are in > different scenarios. I mean, I'm using 128, you 256, and we even don't > know if the actual value is maybe just above 32? :] > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a The diag tool does not build inside xen-4.8: xen-diag.c: In function ‘gnttab_query_size_func’: xen-diag.c:50:10: error: implicit declaration of function ‘xc_gnttab_query_size’ [-Werror=implicit-function-declaration] rc = xc_gnttab_query_size(xch, ); ^~~~ but I think the same info is available in the thread on xen-devel: https://www.mail-archive.com/xen-devel@lists.xen.org/msg116910.html When the domU hangs crash reports nr_grant_frames=32. After increasing the gnttab_max_frames=256 the domU reports using nr_grant_frames=59. So the new default of gnttab_max_frames=64 might be a bit close to 59, but I suppose 128 would be just as safe as 256 I currently use (if you prefer 128). > If this is something users are going to run into while not doing more > unusual things like having dozens of vcpus or network interfaces, then > changing the default could prevent hours of frustration and debugging > for them. Yes, the failure case is quite nasty, as the domU just hangs without even suggesting grant frames might be the problem. Not sure if domU can detect this situation at all? Anyway, if the value cannot be increased, the situation should at least be mentioned in the NEWS.Debian of the xen package. -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 01/07/2018 10:05 AM, Valentin Vidic wrote: > On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote: >> I agree that the upstream default, 32 is quite low. This is indeed a >> configuration issue. I myself ran into this years ago with a growing >> number of domUs and network interfaces in use. We have been using >> gnttab_max_nr_frames=128 for a long time already instead. >> >> I was tempted to reassign src:xen, but in the meantime, this option has >> already been removed again, so this bug does not apply to unstable >> (well, as soon as we get something new in there) any more (as far as I >> can see quickly now). >> >> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30 > > It does not seem to be removed but increased the default from 32 to 64? Ehm, yes you are correct. I was misreading and mixing up things. Let's try again... The referenced commit is talking about removal of the obsolete gnttab_max_nr_frames from the documentation, so not related. >> Including a better default for gnttab_max_nr_frames in the grub config >> in the debian xen package in stable sounds reasonable from a best >> practices point of view. So, that's gnttab_max_frames, not gnttab_max_nr_frames... I was reading out loud from my old Jessie dom0 grub config. >> But, I would be interested in learning more about the relation with >> block mq although. Does using newer linux kernels (like from >> stretch-backports) for the domU always put a bigger strain on this? Or, >> is it just related to the overall number of network devices and block >> devices you are adding to your domUs in your specific own situation, and >> did you just trip over the default limit? > > After upgrading the domU and dom0 from jessie to stretch on a big postgresql > database server (50 VCPUs, 200GB RAM) it starting freezing very soon > after boot as posted there here: > > https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html > > It did not have these problems while running jessie versions of the > hypervisor and the kernels. The problem seems to be related to the > number of CPUs used, as smaller domUs with a few VCPUs did not hang > like this. Could it be that large number of VCPUs -> more queues in > Xen mq driver -> faster exhaustion of allocated pages? That exactly seems to be the case yes. Maybe this is also one of the reasons that the default max was increased in Xen. "xen/blkback: make pool of persistent grants and free pages per-queue" https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4bf0065b7251afb723a29b2fd58f7c38f8ce297 Recently a tool was added to "dump guest grant table info". You could see if it compiles on the 4.8 source and see if it works? Would be interesting to get some idea about how high or low these numbers are in different scenarios. I mean, I'm using 128, you 256, and we even don't know if the actual value is maybe just above 32? :] https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a If this is something users are going to run into while not doing more unusual things like having dozens of vcpus or network interfaces, then changing the default could prevent hours of frustration and debugging for them. The least invasive option is to add the option to the documentation of GRUB_CMDLINE_XEN_DEFAULT in /etc/default/grub.d/xen.cfg like "If you have more than xyz disks or network interfaces in a domU, use this, blah blah." Actually setting the option there is not a good idea, because people can still have GRUB_CMDLINE_XEN_DEFAULT set in e.g. /etc/default/grub, so that would override and damage things. Other option is to add a patch to drag the defaults in the upstream code from 32 to 64, including documentation etc. Sorry for the earlier confusion, Hans
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote: > I agree that the upstream default, 32 is quite low. This is indeed a > configuration issue. I myself ran into this years ago with a growing > number of domUs and network interfaces in use. We have been using > gnttab_max_nr_frames=128 for a long time already instead. > > I was tempted to reassign src:xen, but in the meantime, this option has > already been removed again, so this bug does not apply to unstable > (well, as soon as we get something new in there) any more (as far as I > can see quickly now). > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30 It does not seem to be removed but increased the default from 32 to 64? > Including a better default for gnttab_max_nr_frames in the grub config > in the debian xen package in stable sounds reasonable from a best > practices point of view. > > But, I would be interested in learning more about the relation with > block mq although. Does using newer linux kernels (like from > stretch-backports) for the domU always put a bigger strain on this? Or, > is it just related to the overall number of network devices and block > devices you are adding to your domUs in your specific own situation, and > did you just trip over the default limit? After upgrading the domU and dom0 from jessie to stretch on a big postgresql database server (50 VCPUs, 200GB RAM) it starting freezing very soon after boot as posted there here: https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html It did not have these problems while running jessie versions of the hypervisor and the kernels. The problem seems to be related to the number of CPUs used, as smaller domUs with a few VCPUs did not hang like this. Could it be that large number of VCPUs -> more queues in Xen mq driver -> faster exhaustion of allocated pages? -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi Christian and everyone else, Ack on reassign to Xen. On 01/06/2018 04:11 PM, Yves-Alexis Perez wrote: > control: reassign -1 xen-hypervisor-4.8-amd64 > > On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote: >> On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: >>> According to that link, the fix seems to be configuration rather than >>> code. >>> Does this mean this bug against the kernel should be closed? >> >> Yes, the problem seems to be in the Xen hypervisor and not the Linux >> kernel itself. The default value for the gnttab_max_frames parameter >> needs to be increased to avoid domU disk IO hangs, for example: >> >> GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" >> >> So either close the bug or reassign it to xen-hypervisor package so >> they can increase the default value for this parameter in the >> hypervisor code. >> > Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable > update). > > @Xen maintainers: see the complete bug log for more information, but basically > it seems that a domu freezes happens with the “new” multi-queue xen blk > driver, and the fix is to increase a configuration value. Valentin suggests > adding that to the default. The dom0 gnttab_max_frames boot setting is about how many pages are allocated to fill with 'grants'. The grant concept is related to sharing information between the dom0 and domU. It allows memory pages to be shared back and forth, so that e.g. a domU can fill a page with outgoing network packets or disk writes. Then the dom0 can take over ownership of the memory page and read the contents and do its trick with it. In this way, zero-copy IO is implemented. When running xen domUs, the total amount of network interfaces and block devices that are attached to all of the domUs that are running (and, apparently, how heavy they are used) cause the usage of these grant guys to increase. At some point you run out of grants because all of the pages are filled. I agree that the upstream default, 32 is quite low. This is indeed a configuration issue. I myself ran into this years ago with a growing number of domUs and network interfaces in use. We have been using gnttab_max_nr_frames=128 for a long time already instead. I was tempted to reassign src:xen, but in the meantime, this option has already been removed again, so this bug does not apply to unstable (well, as soon as we get something new in there) any more (as far as I can see quickly now). https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30 Including a better default for gnttab_max_nr_frames in the grub config in the debian xen package in stable sounds reasonable from a best practices point of view. But, I would be interested in learning more about the relation with block mq although. Does using newer linux kernels (like from stretch-backports) for the domU always put a bigger strain on this? Or, is it just related to the overall number of network devices and block devices you are adding to your domUs in your specific own situation, and did you just trip over the default limit? In any case, the grub option thing is a conffile, so any user upgrading has to accept/merge the change, so we won't cause a stable user to just run out of memory because of a few extra kilobytes of memory usage without notice. Hans van Kranenburg P.S. Debian Xen team is in the process of being "rebooted" while the current shitstorm about meltdown/spectre is going on, so don't hold your breath. :) signature.asc Description: OpenPGP digital signature
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
control: reassign -1 xen-hypervisor-4.8-amd64 On Sat, 2018-01-06 at 15:23 +0100, Valentin Vidic wrote: > On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: > > According to that link, the fix seems to be configuration rather than > > code. > > Does this mean this bug against the kernel should be closed? > > Yes, the problem seems to be in the Xen hypervisor and not the Linux > kernel itself. The default value for the gnttab_max_frames parameter > needs to be increased to avoid domU disk IO hangs, for example: > > GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" > > So either close the bug or reassign it to xen-hypervisor package so > they can increase the default value for this parameter in the > hypervisor code. > Ok, I'll reassign and let the Xen maintainers handle that (maybe in a stable update). @Xen maintainers: see the complete bug log for more information, but basically it seems that a domu freezes happens with the “new” multi-queue xen blk driver, and the fix is to increase a configuration value. Valentin suggests adding that to the default. Regards, -- Yves-Alexis signature.asc Description: This is a digitally signed message part
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Sat, Jan 06, 2018 at 03:08:26PM +0100, Yves-Alexis Perez wrote: > According to that link, the fix seems to be configuration rather than code. > Does this mean this bug against the kernel should be closed? Yes, the problem seems to be in the Xen hypervisor and not the Linux kernel itself. The default value for the gnttab_max_frames parameter needs to be increased to avoid domU disk IO hangs, for example: GRUB_CMDLINE_XEN="dom0_mem=10240M gnttab_max_frames=256" So either close the bug or reassign it to xen-hypervisor package so they can increase the default value for this parameter in the hypervisor code. -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On Fri, 2017-11-17 at 07:39 +0100, Valentin Vidic wrote: > Hi, > > The problem seems to be caused by the new multi-queue xen blk driver > and I was advised by the Xen devs to increase the gnttab_max_frames=256 > parameter for the hypervisor. This has solved the blocking issue > for me and it has been running without problems for a few months now. I'm not really fluent in Xen, but does this relate to the kernel in dom0 or one of the domU then? > > I/O to LUNs hang / stall under high load when using xen-blkfront > https://www.novell.com/support/kb/doc.php?id=7018590 According to that link, the fix seems to be configuration rather than code. Does this mean this bug against the kernel should be closed? Regards, -- Yves-Alexis signature.asc Description: This is a digitally signed message part
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Hi, The problem seems to be caused by the new multi-queue xen blk driver and I was advised by the Xen devs to increase the gnttab_max_frames=256 parameter for the hypervisor. This has solved the blocking issue for me and it has been running without problems for a few months now. I/O to LUNs hang / stall under high load when using xen-blkfront https://www.novell.com/support/kb/doc.php?id=7018590 -- Valentin
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
We're having the same problem here. For some reason, only 2 domUs are affected (the dom0 has a total of 22 domUs, 14 of those are running Debian stretch, and 13 of those are running Linux 4.9.51-1). The `xl console` output of the first domU (according to our monitoring it hangs since yesterday 14:06): [ 3746.780086] INFO: task ntpd:670 blocked for more than 120 seconds. [ 3746.780094] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3746.780097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3746.780223] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds. [ 3746.780228] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3746.780233] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3746.780304] INFO: task rsync:8188 blocked for more than 120 seconds. [ 3746.780308] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3746.780311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612083] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds. [ 3867.612091] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612148] INFO: task ntpd:670 blocked for more than 120 seconds. [ 3867.612150] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612152] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612238] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds. [ 3867.612242] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3867.612287] INFO: task rsync:8188 blocked for more than 120 seconds. [ 3867.612291] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3867.612294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3988.444071] INFO: task jbd2/xvda1-8:203 blocked for more than 120 seconds. [ 3988.444080] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3988.444084] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3988.444154] INFO: task ntpd:670 blocked for more than 120 seconds. [ 3988.444159] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3988.444162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3988.444266] INFO: task kworker/2:0:1533 blocked for more than 120 seconds. [ 3988.444271] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 3988.444274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The other domU had a similar error message before a coworker downgraded the kernel to 3.16 get it working again: INFO: task jbd2/xvda1-8:191 blocked for more than 120 seconds. [ 605.148107] Not tainted 4.9.0-4-amd64 #1 Debian 4.9.51-1 [ 605.148111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first domU is a backup machine, it mainly uses rsync --link-dest to pull backups from other machines, and is therefore rather IO intensive. The other domU is a firewall/router and shouldn't be IO intensive at all. -- Mit freundlichen Grüßen Martin v. Wittich IServ GmbH Bültenweg 73 38106 Braunschweig Telefon: 0531-2243666-0 Fax: 0531-2243666-9 E-Mail:i...@iserv.eu Internet: iserv.eu USt-IdNr. DE265149425 | Amtsgericht Braunschweig | HRB 201822 Geschäftsführer: Benjamin Heindl, Jörg Ludwig
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Update: First of all: Forget my observation about the 'system boot time'. I mixed up something, the dom0 boot time was increased, but this happened probably due to the not (well/propper) handled lvm thin activation during system boot. One last thing I pulled from domu with the original kernel (4.9.51-1) was this top output: top - 20:41:03 up 6:18, 2 users, load average: 17.03, 6.98, 2.62 Tasks: 231 total, 1 running, 230 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni, 0.0 id,100.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 0.3 sy, 0.0 ni, 0.0 id, 99.7 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8212616 total, 1907568 free, 1485276 used, 4819772 buff/cache KiB Swap: 2097148 total, 2097148 free,0 used. 6558984 avail Mem at this point, the system is more or less unusable, everything depending on IO is dead. Currently my production system domu is running for over a week with the last backports kernel (linux-image-4.13.0-0.bpo.1-amd64) dom0 is still on the current stretch kernel (4.9.51-1) and it seems stable for now. My guess would be some issue with the xen blkfront driver. About end of last year I experiences something similar with jessie. After some kernel updates those issues got better. They are not completely gone, some jessie domu's need a reboot from time to time due to raising wa, but the system is still responsive then, it's just getting slower and slower by the minute.
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Update: Sadly the my productive system froze in the early afternoon today again with the older kernel as well (4.9.30-2+deb9u5). so that wasn't a temp workaround. Paradoxically nothing showed up on the xl console (within a screen) at dom0. No errors, nothing, the vm just stopped responding. As I was monitoring the system, there where still two open shell connections. Some basic stuff still worked, but as soon as tried to open a file, the shell got unresponsive. I tried a shutdown on the other shell, but that didn't got very far. Searching the net for that issue I found this post at the xen project mailing list: https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html which sounds similar. He got some traces out of it, but no answer on the mailing list. Some information about my setup: hardware: xeon E5-2620 v4 board supermicro X10SRi-F 32gb ecc ram two 10tb server disk two I350 network adapter (onboard) dom0: debian stretch (up to date), kernel 4.9.51-1, xen-hypervisor 4.8.1-1+deb9u3, the two network as adapter as a bond in a bridge the discs: gpt, 4 part (1M, 256M esp, 256M md mirror with boot, rest as md mirror for lvm) domu: memory: 8192, 2 vcpus uses a network interface on the bridge several (thin)lvm volumes as phys devices debian stretch (up to date) issue with both kernel versions: 4.9.30-2+deb9u5 and 4.9.51-1 Some other domu's (wheezy, jessie and a windows 7) seem to run fine Next I'll try some newer kernels for the domu, starting with the stretch backport kernels.
Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
Package: linux-image-4.9.0-4-amd64 Version: 4.9.51-1 Severity: critical As I can tell right now, the domu system simply freezes. The logs simply end at some point until the new reboot stuff comes up. Sometimes it's still possible to log on to the system, but nothing really works. It is like all IO to the virtual block devices is suspended indefinitely. Until this happens, the systems seems to work without issues. As the new kernel isn't out that long, I can't tell how often this happens. first time was the day before yesterday and yesterday afternoon it happened twice within two hours. Something like 'ls' on a directory listed before still gets a result, but everything 'new', i.e. 'vim somefile' will cause the shell to stall. Sadly there is no visible error, services just fails to answer one by one (maybe when the try to read/write something new to the disk, then they simply wait for IO to happen). For testing I installed the older kernel (last linux-image-4.9.0-3-amd64 from security - 4.9.30-2+deb9u5) and realized immediately that the system boot time is a fraction with the old kernel in opposite to the new one. For the time being, I'm staying with that nn the production system. To see if anything will be dumped on the console, I started one within a screen on a test machine. Now I have to generate some activity and IO and see if something happens there. I haven't had the time to test the impact on the dom0 kernel jet, as far as I observed, the dom0 seems to be unaffected by the kernel update.