Bug#1051862: (Debian) Bug#1051862: server flooded with xen_mc_flush warnings with xen 4.17 + linux 6.1

2023-09-13 Thread Juergen Gross

Hi Hans,

On 13.09.23 23:38, Hans van Kranenburg wrote:

Hi Radoslav,

Thanks for your report...

Hi Juergen, Boris and xen-devel,

At Debian, we got the report below. (Also at
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1051862)

This hardware, with only Xen and Dom0 running is hitting the failed
multicall warning and logging in arch/x86/xen/multicalls.c. Can you help
advise what we can do to further debug this issue?

Since this looks like pretty low level Xen/hardware stuff, I'd rather
ask upstream for directions first. If needed the Debian Xen Team can
assist the end user with the debugging process.

Thanks,

More reply inline...

On 9/13/23 20:12, Radoslav Bodó wrote:

Package: xen-system-amd64
Version: 4.17.1+2-gb773c48e36-1
Severity: important

Hello,

after upgrade from Bullseye to Bookworm one of our dom0's
became unusable due to logs/system being continuously flooded
with warnings from arch/x86/xen/multicalls.c:102 xen_mc_flush, and the
system become unusable.

The issue starts at some point where system services starts to come up,
but nothing very special is on that box (dom0, nftables, fail2ban,
prometheus-node-exporter, 3x domU). We have tried to disable all domU's
and fail2ban as the name of the process would suggest, but issue is
still present. We have tried also some other elaboration but none of
them have helped so far:

* the issue arise when xen 4.17 + linux >= 6.1 is booted
* xen + bookworm-backports linux-image-6.4.0-0.deb12.2-amd64 have same isuue
* without xen hypervisor, linux 6.1 runs just fine
* systemrescue cd boot and xfs_repair rootfs did not helped
* memtest seem to be fine running for hours


Thanks for already trying out all these combinations.


As a workaround we have booted xen 4.17 + linux 5.10.0-25 (5.10.191-1)
and the system is running fine as for last few months.

Hardware:
* Dell PowerEdge R750xs
* 2x Intel Xeon Silver 4310 2.1G
* 256GB RAM
* PERC H755 Adapter, 12x 18TB HDDs


I have a few quick additional questions already:

1. For clarification.. From your text, I understand that only this one
single server is showing the problem after the Debian version upgrade.
Does this mean that this is the only server you have running with
exactly this combination of hardware (and BIOS version, CPU microcode
etc etc)? Or, is there another one with same hardware which does not
show the problem?

2. Can you reply with the output of 'xl dmesg' when the problem happens?
Or, if the system gets unusable too quick, do you have a serial console
connection to capture the output?

3. To confirm... I understand that there are many of these messages.
Since you pasted only one, does that mean that all of them look exactly
the same, with "1 of 1 multicall(s) failed: cpu 10" "call  1: op=1
arg=[a1a9eb10] result=-22"? Or are there variations? If so, can
you reply with a few different ones?

Since this very much looks like an issue of Xen related code where the
Xen hypervisor, dom0 kernel and hardware has to work together correctly,
(and not a Debian packaging problem) I'm already asking upstream for
advice about what we should/could do next, instead of trying to make a
guess myself.

Thanks,
Hans


Any help, advice or bug confirmation would be appreciated

Best regards
bodik


(log also in attachment)

```
kernel: [   99.762402] WARNING: CPU: 10 PID: 1301 at
arch/x86/xen/multicalls.c:102 xen_mc_flush+0x196/0x220
kernel: [   99.762598] Modules linked in: nvme_fabrics nvme_core bridge
xen_acpi_processor xen_gntdev stp llc xen_evtchn xenfs xen_privcmd
binfmt_misc intel_rapl_msr ext4 intel_rapl_common crc16
intel_uncore_frequency_common mbcache ipmi_ssif jbd2 nfit libnvdimm
ghash_clmulni_intel sha512_ssse3 sha512_generic aesni_intel acpi_ipmi
nft_ct crypto_simd cryptd mei_me mgag200 ipmi_si iTCO_wdt intel_pmc_bxt
ipmi_devintf drm_shmem_helper dell_smbios nft_masq iTCO_vendor_support
isst_if_mbox_pci drm_kms_helper isst_if_mmio dcdbas mei intel_vsec
isst_if_common dell_wmi_descriptor wmi_bmof watchdog pcspkr
intel_pch_thermal ipmi_msghandler i2c_algo_bit acpi_power_meter button
nft_nat joydev evdev sg nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 nf_tables nfnetlink drm fuse loop efi_pstore configfs
ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic hid_generic
usbhid hid dm_mod sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif
crct10dif_generic ahci libahci xhci_pci libata xhci_hcd
kernel: [   99.762633]  megaraid_sas tg3 crct10dif_pclmul
crct10dif_common crc32_pclmul crc32c_intel bnxt_en usbcore scsi_mod
i2c_i801 libphy i2c_smbus usb_common scsi_common wmi
kernel: [   99.764765] CPU: 10 PID: 1301 Comm: python3 Tainted: G
W  6.1.0-12-amd64 #1  Debian 6.1.52-1
kernel: [   99.764989] Hardware name: Dell Inc. PowerEdge R750xs/0441XG,
BIOS 1.8.2 09/14/2022
kernel: [   99.765214] RIP: e030:xen_mc_flush+0x196/0x220
kernel: [   99.765436] Code: e2 06 48 01 da 85 c0 0f 84 23 ff ff ff 48
8b 43 18 48 83 c3 40 48 c1 e8 3f 41 01 c5 48 39 d3 75 ec 45 85 ed 0f 84
06 

Bug#850425: mpt3sas "swiotlb buffer is full" problem only under Xen

2019-03-12 Thread Juergen Gross
On 12/03/2019 17:41, Hans van Kranenburg wrote:
> On 3/12/19 5:04 PM, Juergen Gross wrote:
>> On 11/03/2019 20:50, Hans van Kranenburg wrote:
>>> On 3/11/19 7:34 AM, Juergen Gross wrote:
>>>>
>>>> I'm not sure. Patch 3 of this series is basically already there (see
>>>> commit c6d4381220a0087ce19dbf6984d92c451bd6b364). So maybe all we need
>>>> is patch 4, which should really be easy to do?
>>>>
>>>> Hans, could you give it a try? You'd need to use a 4.20 kernel at least.
>>>> I can do the official patch posting in case you confirm it working.
>>>
>>> Ehm ok, well... This is interesting.
>>>
>>> I just built a 4.20.13 (without the patch), and I did it from the Debian
>>> kernel team repo, because then I just get all latest config options like
>>> I would get them in Debian.
>>>
>>> I rebooted the HP Z820 with it (with Xen 4.11) and I don't see any
>>> errors similar to the ones I pasted earlier.
>>>
>>> I haven't been running any domU on it yet (just installed it), but this
>>> is not what I expected.
>>
>> Well, commit c6d4381220a0087ce19dbf6984d92c451bd6b364 is part of a
>> rather large series making the dma interface cleaner and using it more
>> correctly where appropriate. Maybe your use case is covered by this
>> series already.
> 
> It seems so. That's good, of course, but it also means that I cannot be
> of any use here any more to test the additional proposed change. ;]

I don't think the change is needed any longer.

Christoph's series was meant to fix stuff like that and it did that very
well.


Juergen



Bug#850425: mpt3sas "swiotlb buffer is full" problem only under Xen

2019-03-12 Thread Juergen Gross
On 11/03/2019 20:50, Hans van Kranenburg wrote:
> On 3/11/19 7:34 AM, Juergen Gross wrote:
>>
>> I'm not sure. Patch 3 of this series is basically already there (see
>> commit c6d4381220a0087ce19dbf6984d92c451bd6b364). So maybe all we need
>> is patch 4, which should really be easy to do?
>>
>> Hans, could you give it a try? You'd need to use a 4.20 kernel at least.
>> I can do the official patch posting in case you confirm it working.
> 
> Ehm ok, well... This is interesting.
> 
> I just built a 4.20.13 (without the patch), and I did it from the Debian
> kernel team repo, because then I just get all latest config options like
> I would get them in Debian.
> 
> I rebooted the HP Z820 with it (with Xen 4.11) and I don't see any
> errors similar to the ones I pasted earlier.
> 
> I haven't been running any domU on it yet (just installed it), but this
> is not what I expected.

Well, commit c6d4381220a0087ce19dbf6984d92c451bd6b364 is part of a
rather large series making the dma interface cleaner and using it more
correctly where appropriate. Maybe your use case is covered by this
series already.


Juergen



Bug#850425: mpt3sas "swiotlb buffer is full" problem only under Xen

2019-03-11 Thread Juergen Gross
On 10/03/2019 23:03, Andrew Cooper wrote:
> On 10/03/2019 21:35, Hans van Kranenburg wrote:
>> found -1 4.19.20-1
>> thanks
>>
>> Hi,
>>
>> Reviving a thing from Jan 2017 here. I don't have this thread in my
>> mailbox, so no inline quotes.
>>
>> I just installed some HP z820 workstation and rebooted it into Xen
>> 4.11.1+26-g87f51bf366-3 with linux 4.19.20-1 as dom0 kernel.
>>
>> During boot I'm greeted by a long list of...
>>
>> [   14.518793] mpt3sas :02:00.0: swiotlb buffer is full (sz: 65536
>> bytes)
>> [   14.518899] mpt3sas :02:00.0: swiotlb buffer is full
>> [   14.518956] mpt3sas :02:00.0: swiotlb buffer is full (sz: 65536
>> bytes)
>> [   14.518988] sd 6:0:3:0: pci_map_sg failed: request for 786432 bytes!
>> [   14.519081] mpt3sas :02:00.0: swiotlb buffer is full
>> [   14.519309] sd 6:0:1:0: pci_map_sg failed: request for 1310720 bytes!
>> [   14.524611] mpt3sas :02:00.0: swiotlb buffer is full (sz: 65536
>> bytes)
>> [   14.527309] mpt3sas :02:00.0: swiotlb buffer is full
>> [   14.527405] sd 6:0:3:0: pci_map_sg failed: request for 786432 bytes!
>> [...]
>>
>> ...and some hangs here and there. This indeed did not happen when
>> booting just Linux, without Xen.
>>
>> Some searching brought me to this Debian bug. So, thanks for writing
>> down all kinds of research here already. Even if it's not fixed upstream
>> yet, this helps a lot. :-)
>>
>> Using dom0_mem=2GiB,max:4GiB instead of dom0_mem=2GiB,max:2GiB (which I
>> started with) makes the errors go away, so workaround confirmed.
>>
>> I can try any of the linked patches, but I see that in message 54,
>>   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850425#54
>> Andrew says: "IIRC, they were essentially rejected,". Next message, Ian
>> asks "Do you have a reference ?", but I don't see any fup on that.
>>
>> I think I'm fine with this workaround.
>>
>> If someone will ever work on the upstream patches, then this is just to
>> let know that I might be able to help testing. However, I only have one
>> of this type of box and it's gonna be installed as server at some
>> non-profit organization without OOB access, replacing even older donated
>> hardware, so, it will be kinda limited... :)
> 
> I think
> https://lists.xen.org/archives/html/xen-devel/2014-12/msg00699.html is
> the last attempt David made to upstream the fixes.

Attached is a rebase of the last part missing.

Should apply on top of 5.0 kernel, 4.20 should be okay, too. Earlier
kernels will miss some prerequisites.


Juergen
From: David Vrabel 
Date: Mon, 11 Mar 2019 14:40:00 +0100
Subject: [PATCH] x86/xen: assume a 64-bit DMA mask is required

On a Xen PV guest the DMA addresses and physical addresses are not 1:1
(such as Xen PV guests) and the generic dma_get_required_mask() does
not return the correct mask (since it uses max_pfn).

Some device drivers (such as mptsas, mpt2sas) use
dma_get_required_mask() to set the device's DMA mask to allow them to
use only 32-bit DMA addresses in hardware structures.  This results in
unnecessary use of the SWIOTLB if DMA addresses are more than 32-bits,
impacting performance significantly.

We could base the DMA mask on the maximum MFN but:

a) The hypercall op to get the maximum MFN (XENMEM_maximum_ram_page)
will truncate the result to an int in 32-bit guests.

b) Future uses of the IOMMU in Xen may map frames at bus addresses
above the end of RAM.

So, just assume a 64-bit DMA mask is always required.

Signed-off-by: David Vrabel 
Reviewed-by: Juergen Gross 
---
 drivers/xen/swiotlb-xen.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index bb7888429be6..75e6e440d982 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -680,6 +680,11 @@ xen_swiotlb_get_sgtable(struct device *dev, struct sg_table *sgt,
 	return dma_common_get_sgtable(dev, sgt, cpu_addr, handle, size, attrs);
 }
 
+static u64 xen_swiotlb_get_required_mask(struct device *dev)
+{
+	return DMA_BIT_MASK(64);
+}
+
 const struct dma_map_ops xen_swiotlb_dma_ops = {
 	.alloc = xen_swiotlb_alloc_coherent,
 	.free = xen_swiotlb_free_coherent,
@@ -694,4 +699,5 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
 	.dma_supported = xen_swiotlb_dma_supported,
 	.mmap = xen_swiotlb_dma_mmap,
 	.get_sgtable = xen_swiotlb_get_sgtable,
+	.get_required_mask = xen_swiotlb_get_required_mask,
 };


Bug#850425: mpt3sas "swiotlb buffer is full" problem only under Xen

2019-03-11 Thread Juergen Gross
On 10/03/2019 23:03, Andrew Cooper wrote:
> On 10/03/2019 21:35, Hans van Kranenburg wrote:
>> found -1 4.19.20-1
>> thanks
>>
>> Hi,
>>
>> Reviving a thing from Jan 2017 here. I don't have this thread in my
>> mailbox, so no inline quotes.
>>
>> I just installed some HP z820 workstation and rebooted it into Xen
>> 4.11.1+26-g87f51bf366-3 with linux 4.19.20-1 as dom0 kernel.
>>
>> During boot I'm greeted by a long list of...
>>
>> [   14.518793] mpt3sas :02:00.0: swiotlb buffer is full (sz: 65536
>> bytes)
>> [   14.518899] mpt3sas :02:00.0: swiotlb buffer is full
>> [   14.518956] mpt3sas :02:00.0: swiotlb buffer is full (sz: 65536
>> bytes)
>> [   14.518988] sd 6:0:3:0: pci_map_sg failed: request for 786432 bytes!
>> [   14.519081] mpt3sas :02:00.0: swiotlb buffer is full
>> [   14.519309] sd 6:0:1:0: pci_map_sg failed: request for 1310720 bytes!
>> [   14.524611] mpt3sas :02:00.0: swiotlb buffer is full (sz: 65536
>> bytes)
>> [   14.527309] mpt3sas :02:00.0: swiotlb buffer is full
>> [   14.527405] sd 6:0:3:0: pci_map_sg failed: request for 786432 bytes!
>> [...]
>>
>> ...and some hangs here and there. This indeed did not happen when
>> booting just Linux, without Xen.
>>
>> Some searching brought me to this Debian bug. So, thanks for writing
>> down all kinds of research here already. Even if it's not fixed upstream
>> yet, this helps a lot. :-)
>>
>> Using dom0_mem=2GiB,max:4GiB instead of dom0_mem=2GiB,max:2GiB (which I
>> started with) makes the errors go away, so workaround confirmed.
>>
>> I can try any of the linked patches, but I see that in message 54,
>>   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850425#54
>> Andrew says: "IIRC, they were essentially rejected,". Next message, Ian
>> asks "Do you have a reference ?", but I don't see any fup on that.
>>
>> I think I'm fine with this workaround.
>>
>> If someone will ever work on the upstream patches, then this is just to
>> let know that I might be able to help testing. However, I only have one
>> of this type of box and it's gonna be installed as server at some
>> non-profit organization without OOB access, replacing even older donated
>> hardware, so, it will be kinda limited... :)
> 
> I think
> https://lists.xen.org/archives/html/xen-devel/2014-12/msg00699.html is
> the last attempt David made to upstream the fixes.
> 
> Linux is still broken, and these fixes are still necessary.
> 
> Boris/Juergen: Any chance you could look into these patches?  I have no
> idea what they they're in against master, but its also liable its now
> more complicated with the host max mfn calculations which have gone in
> more recently.

I'm not sure. Patch 3 of this series is basically already there (see
commit c6d4381220a0087ce19dbf6984d92c451bd6b364). So maybe all we need
is patch 4, which should really be easy to do?

Hans, could you give it a try? You'd need to use a 4.20 kernel at least.
I can do the official patch posting in case you confirm it working.

Adding Konrad as the swiotlb-xen maintainer.


Juergen



Bug#908154: Any progress?

2018-10-14 Thread Juergen Gross
This bug is now older than one month without any activity from gcc
maintainers.

Up to now this bug has been observed on debian only. It has been
observed in multiple gcc versions.

This bug is blocking Linux kernel patches to go in!


Juergen



Bug#908154: gcc-6: gcc either hangs or throws error about impossible asm constraints

2018-09-06 Thread Juergen Gross
Package: gcc-6
Version: 6.3.0-18+deb9u1
Severity: important

Dear Maintainer,

when trying to compile the Linux kernel for arch=i386 some configurations
of the kernel result in an error. See:

https://lore.kernel.org/lkml/alpine.lrh.2.21.1808262340040.22...@math.ut.ee/
https://lists.01.org/pipermail/kbuild-all/2018-September/052162.html

I have modified the asm statemnet in question with the following patch:

--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -245,7 +245,7 @@ extern void __add_wrong_size(void)
asm volatile(pfx "cmpxchg%c4b %2; sete %0"  \
 : "=a" (__ret), "+d" (__old2), \
   "+m" (*(p1)), "+m" (*(p2))   \
-: "i" (2 * sizeof(long)), "a" (__old1),\
+: "i" (2 * sizeof(long)), "0" (__old1),\
   "b" (__new1), "c" (__new2)); \
__ret;  \
 })

With that kernel patch in place the error is gone, but gcc will be in an
endless loop when compiling mm/slub.c


-- System Information:
Debian Release: 9.5
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.9.0-7-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US:en (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages gcc-6 depends on:
ii  binutils  2.28-5
ii  cpp-6 6.3.0-18+deb9u1
ii  gcc-6-base6.3.0-18+deb9u1
ii  libc6 2.24-11+deb9u3
ii  libcc1-0  6.3.0-18+deb9u1
ii  libgcc-6-dev  6.3.0-18+deb9u1
ii  libgcc1   1:6.3.0-18+deb9u1
ii  libgmp10  2:6.1.2+dfsg-1
ii  libisl15  0.18-1
ii  libmpc3   1.0.3-1+b2
ii  libmpfr4  3.1.5-1
ii  libstdc++66.3.0-18+deb9u1
ii  zlib1g1:1.2.8.dfsg-5

Versions of packages gcc-6 recommends:
ii  libc6-dev  2.24-11+deb9u3

Versions of packages gcc-6 suggests:
pn  gcc-6-doc 
pn  gcc-6-locales 
pn  gcc-6-multilib
pn  libasan3-dbg  
pn  libatomic1-dbg
pn  libcilkrts5-dbg   
pn  libgcc1-dbg   
pn  libgomp1-dbg  
pn  libitm1-dbg   
pn  liblsan0-dbg  
pn  libmpx2-dbg   
pn  libquadmath0-dbg  
pn  libtsan0-dbg  
pn  libubsan0-dbg 

-- no debconf information