Re: Serious AMD-Vi(?) issue
On Wed, May 15, 2024 at 02:40:31PM +0100, Kelly Choi wrote: > > As explained previously, we are happy to help resolve issues and provide > advice where necessary. However, to do this, our developers need the > relevant information to provide accurate resolutions. Given that our > developers have repeatedly voiced their concerns, and are debugging this > out of interest, please help us by providing all the necessary information. > > Until we have this information, it will be very difficult to help you > further. Should anything change, we would be glad to assist you. Usually private submission of logs (PGP) is acceptable. Note, I am not claiming Xen's `dmesg` contains truly concerning information. The issue is there is enough data for problematic information to unintentionally leak in. Alternatively no pieces would be individually concerning, but all together information may leak. Hopefully ACPI table addresses nor table order are effected by the motherboard serial number, yet those could readily leak information. So far this is acting like a major bug. The paucity of reports is likely due to few people using RAID1 with flash (most people relying greater reliability even before the first large studies came out). -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
Hello Elliott, Most of our developers are based in the EU timezone, however we are a worldwide community. The Xen Project is an open source community that everyone contributes to and we do not divide how we provide help, based on location. As explained previously, we are happy to help resolve issues and provide advice where necessary. However, to do this, our developers need the relevant information to provide accurate resolutions. Given that our developers have repeatedly voiced their concerns, and are debugging this out of interest, please help us by providing all the necessary information. Until we have this information, it will be very difficult to help you further. Should anything change, we would be glad to assist you. Many thanks, Kelly Choi Community Manager Xen Project On Tue, May 14, 2024 at 9:51 PM Elliott Mitchell wrote: > On Tue, May 14, 2024 at 10:22:51AM +0200, Jan Beulich wrote: > > On 13.05.2024 22:11, Elliott Mitchell wrote: > > > On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote: > > >> Why do you mask the device SBDF in the above snippet? I would really > > >> like to understand what's so privacy relevant in a PCI SBDF number. > > > > > > I doubt it reveals much. Simply seems unlikely to help debugging and > > > therefore I prefer to mask it. > > > > SBDF in one place may be matchable against a memory address in another > > place. _Any_ hiding of information is hindering analysis. Please can > > you finally accept that it needs to be the person doing the analysis > > to judge what is or is not relevant to them? > > Not going to happen as I'd accepted this long ago. The usual approach > is all developers have PGP keys (needed for security issues anyway) and > you don't require all logs to be public. > > I've noticed the core of the Xen project appears centered in the EU. Yet > you're not catering to data privacy at all? Or is this a service > exclusively provided to people who prove they're EU citizens? > > > -- > (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) > \BS (| ehem+sig...@m5p.com PGP 87145445 |) / > \_CS\ | _ -O #include O- _ | / _/ > 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 > > >
Re: Serious AMD-Vi(?) issue
On Tue, May 14, 2024 at 10:22:51AM +0200, Jan Beulich wrote: > On 13.05.2024 22:11, Elliott Mitchell wrote: > > On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote: > >> Why do you mask the device SBDF in the above snippet? I would really > >> like to understand what's so privacy relevant in a PCI SBDF number. > > > > I doubt it reveals much. Simply seems unlikely to help debugging and > > therefore I prefer to mask it. > > SBDF in one place may be matchable against a memory address in another > place. _Any_ hiding of information is hindering analysis. Please can > you finally accept that it needs to be the person doing the analysis > to judge what is or is not relevant to them? Not going to happen as I'd accepted this long ago. The usual approach is all developers have PGP keys (needed for security issues anyway) and you don't require all logs to be public. I've noticed the core of the Xen project appears centered in the EU. Yet you're not catering to data privacy at all? Or is this a service exclusively provided to people who prove they're EU citizens? -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On 13.05.2024 22:11, Elliott Mitchell wrote: > On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote: >> Why do you mask the device SBDF in the above snippet? I would really >> like to understand what's so privacy relevant in a PCI SBDF number. > > I doubt it reveals much. Simply seems unlikely to help debugging and > therefore I prefer to mask it. SBDF in one place may be matchable against a memory address in another place. _Any_ hiding of information is hindering analysis. Please can you finally accept that it needs to be the person doing the analysis to judge what is or is not relevant to them? Jan
Re: Serious AMD-Vi(?) issue
On 13.05.2024 10:44, Roger Pau Monné wrote: > On Fri, May 10, 2024 at 09:09:54PM -0700, Elliott Mitchell wrote: >> On Thu, Apr 18, 2024 at 09:33:31PM -0700, Elliott Mitchell wrote: >>> >>> I suspect this is a case of there is some step which is missing from >>> Xen's IOMMU handling. Perhaps something which Linux does during an early >>> DMA setup stage, but the current Xen implementation does lazily? >>> Alternatively some flag setting or missing step? >>> >>> I should be able to do another test approach in a few weeks, but I would >>> love if something could be found sooner. >> >> Turned out to be disturbingly easy to get the first entry when it >> happened. Didn't even need `dbench`, it simply showed once the OS was >> fully loaded. I did get some additional data points. >> >> Appears this requires an AMD IOMMUv2. A test system with known >> functioning AMD IOMMUv1 didn't display the issue at all. >> >> (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr fffdf800 flags 0x8 >> I > > I would expect the address field to contain more information about the > fault, but I'm not finding any information on the AMD-Vi specification > apart from that it contains the DVA, which makes no sense when the > fault is caused by an interrupt. Isn't the address above in the "magic" HT range (and hence still meaningful as an address)? Jan
Re: Serious AMD-Vi(?) issue
On Mon, May 13, 2024 at 10:44:59AM +0200, Roger Pau Monné wrote: > On Fri, May 10, 2024 at 09:09:54PM -0700, Elliott Mitchell wrote: > > On Thu, Apr 18, 2024 at 09:33:31PM -0700, Elliott Mitchell wrote: > > > > > > I suspect this is a case of there is some step which is missing from > > > Xen's IOMMU handling. Perhaps something which Linux does during an early > > > DMA setup stage, but the current Xen implementation does lazily? > > > Alternatively some flag setting or missing step? > > > > > > I should be able to do another test approach in a few weeks, but I would > > > love if something could be found sooner. > > > > Turned out to be disturbingly easy to get the first entry when it > > happened. Didn't even need `dbench`, it simply showed once the OS was > > fully loaded. I did get some additional data points. > > > > Appears this requires an AMD IOMMUv2. A test system with known > > functioning AMD IOMMUv1 didn't display the issue at all. > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr fffdf800 flags > > 0x8 I > > I would expect the address field to contain more information about the > fault, but I'm not finding any information on the AMD-Vi specification > apart from that it contains the DVA, which makes no sense when the > fault is caused by an interrupt. > > > (XEN) :bb:dd.f root @ 83b5f5 (3 levels) dfn=fffdf8000 > > (XEN) L3[1f7] = 0 np > > Attempting to print the page table walk for an Interrupt remapping > fault is useless, we should likely avoid that when the I flag is set. > > I find it surprising this required "iommu=debug" to get this level of > > detail. This amount of output seems more appropriate for "verbose". > > "verbose" should also print this information. Mostly I've noticed Xen's dmesg seems a bit sparse at default settings. Confirming IOMMU was recognized and operational had been a challenge. On the flip side this does mean less potentially sensitive data gets in. > > I strongly prefer to provide snippets. There is a fair bit of output, > > I'm unsure which portion is most pertinent. > > I've already voiced my concern that I think what yo uare doing is not > fair. We are debugging this out of interest, and hence you refusing > to provide all information just hampers our ability to debug, and > makes us spend more time than required just thinking what snippets we > need to ask for. > > I will ask again, what's there in the Xen or the Linux dmesgs that you > are so worried about leaking? Please provide an specific example. I cannot point to specific data in Xen's dmesg which is known to be sensitive. On the flip side all the addresses could readily function as a subliminal channel. Might only be kernels from certain vendors, but hardware serial numbers frequently make it into Linux's dmesg. All the data coming from ACPI tables could readily hide something. Worse, data which seems harmless now might later turn out to reveal things. The usual approach is everyone has PGP keys and logs are kept private on request. > Why do you mask the device SBDF in the above snippet? I would really > like to understand what's so privacy relevant in a PCI SBDF number. I doubt it reveals much. Simply seems unlikely to help debugging and therefore I prefer to mask it. One more Xen dmesg line: (XEN) AMD-Vi: Setup I/O page table: device id = 0xbbdd, type = 0x1, root table = 0xADDRADDR, domain = 0, paging mode = 3 > Does booting with `iommu=no-intremap` lead to any issues being > reported? I'll try that next time I restart the system. Another viable approach. I imagine one or more of the Xen developers have computers with AMD processors. I could send a pair of SATA devices which are known to exhibit the behavior to someone. The known reproductions have featured ASUS motherboards. I doubt this is a requirement, but if one of the main developers has such a system that is a better target. I also note these are plugged into motherboard SATA ports. It is possible add-on card SATA ports might not exhibit the behavior. Then you may discover not much log data is being provided simply due to not much log data being generated. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On Fri, May 10, 2024 at 09:09:54PM -0700, Elliott Mitchell wrote: > On Thu, Apr 18, 2024 at 09:33:31PM -0700, Elliott Mitchell wrote: > > > > I suspect this is a case of there is some step which is missing from > > Xen's IOMMU handling. Perhaps something which Linux does during an early > > DMA setup stage, but the current Xen implementation does lazily? > > Alternatively some flag setting or missing step? > > > > I should be able to do another test approach in a few weeks, but I would > > love if something could be found sooner. > > Turned out to be disturbingly easy to get the first entry when it > happened. Didn't even need `dbench`, it simply showed once the OS was > fully loaded. I did get some additional data points. > > Appears this requires an AMD IOMMUv2. A test system with known > functioning AMD IOMMUv1 didn't display the issue at all. > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr fffdf800 flags 0x8 I I would expect the address field to contain more information about the fault, but I'm not finding any information on the AMD-Vi specification apart from that it contains the DVA, which makes no sense when the fault is caused by an interrupt. > (XEN) :bb:dd.f root @ 83b5f5 (3 levels) dfn=fffdf8000 > (XEN) L3[1f7] = 0 np Attempting to print the page table walk for an Interrupt remapping fault is useless, we should likely avoid that when the I flag is set. > > I find it surprising this required "iommu=debug" to get this level of > detail. This amount of output seems more appropriate for "verbose". "verbose" should also print this information. > > I strongly prefer to provide snippets. There is a fair bit of output, > I'm unsure which portion is most pertinent. I've already voiced my concern that I think what yo uare doing is not fair. We are debugging this out of interest, and hence you refusing to provide all information just hampers our ability to debug, and makes us spend more time than required just thinking what snippets we need to ask for. I will ask again, what's there in the Xen or the Linux dmesgs that you are so worried about leaking? Please provide an specific example. Why do you mask the device SBDF in the above snippet? I would really like to understand what's so privacy relevant in a PCI SBDF number. Does booting with `iommu=no-intremap` lead to any issues being reported? Regards, Roger.
Re: Serious AMD-Vi(?) issue
On Thu, Apr 18, 2024 at 09:33:31PM -0700, Elliott Mitchell wrote: > > I suspect this is a case of there is some step which is missing from > Xen's IOMMU handling. Perhaps something which Linux does during an early > DMA setup stage, but the current Xen implementation does lazily? > Alternatively some flag setting or missing step? > > I should be able to do another test approach in a few weeks, but I would > love if something could be found sooner. Turned out to be disturbingly easy to get the first entry when it happened. Didn't even need `dbench`, it simply showed once the OS was fully loaded. I did get some additional data points. Appears this requires an AMD IOMMUv2. A test system with known functioning AMD IOMMUv1 didn't display the issue at all. (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr fffdf800 flags 0x8 I (XEN) :bb:dd.f root @ 83b5f5 (3 levels) dfn=fffdf8000 (XEN) L3[1f7] = 0 np I find it surprising this required "iommu=debug" to get this level of detail. This amount of output seems more appropriate for "verbose". I strongly prefer to provide snippets. There is a fair bit of output, I'm unsure which portion is most pertinent. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On Thu, Apr 18, 2024 at 09:09:51AM +0200, Jan Beulich wrote: > On 18.04.2024 08:45, Elliott Mitchell wrote: > > On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote: > >> On 11.04.2024 04:41, Elliott Mitchell wrote: > >>> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: > On 27.03.2024 18:27, Elliott Mitchell wrote: > > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > >>> > >>> In fact when running into trouble, the usual course of action would > >>> be to > >>> increase verbosity in both hypervisor and kernel, just to make sure no > >>> potentially relevant message is missed. > >> > >> More/better information might have been obtained if I'd been engaged > >> earlier. > > > > This is still true, things are in full mitigation mode and I'll be > > quite unhappy to go back with experiments at this point. > > Well, it very likely won't work without further experimenting by someone > able to observe the bad behavior. Recall we're on xen-devel here; it is > kind of expected that without clear (and practical) repro instructions > experimenting as well as info collection will remain with the reporter. > >>> > >>> After looking at the situation and considering the issues, I /may/ be > >>> able to setup for doing more testing. I guess I should confirm, which of > >>> those criteria do you think currently provided information fails at? > >>> > >>> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) + > >>> dbench; seems a pretty specific setup. > >> > >> Indeed. If that's the only way to observe the issue, it suggests to me > >> that it'll need to be mainly you to do further testing, and perhaps even > >> debugging. Which isn't to say we're not available to help, but from all > >> I have gathered so far we're pretty much in the dark even as to which > >> component(s) may be to blame. As can still be seen at the top in reply > >> context, some suggestions were given as to obtaining possible further > >> information (or confirming the absence thereof). > > > > There may be other ways which haven't yet been found. > > > > I've been left with the suspicion AMD was to some degree sponsoring > > work to ensure Xen works on their hardware. Given the severity of this > > problem I would kind of expect them not want to gain a reputation for > > having data loss issues. Assuming a suitable pair of devices weren't > > already on-hand, I would kind of expect this to be well within their > > budget. > > You've got to talk to AMD then. Plus I assume it's clear to you that > even if the (presumably) necessary hardware was available, it still > would require respective setup, leaving open whether the issue then > could indeed be reproduced. I had a vain hope your links to AMD would allow you to say "we've got a major problem in need of addressing ASAP". I suspect it will reproduce readily. The sparsity of reports is likely due to few people using RAID1 for flash. Yet even though the initial surveys suggest flash has a rather lower initial failure rate, they're still pointing to rather non-zero failures in the first 5 years. > >> I'd also like to come back to the vague theory you did voice, in that > >> you're suspecting flushes to take too long. I continue to have trouble > >> with this, and I would therefore like to ask that you put this down in > >> more technical terms, making connections to actual actions taken by > >> software / hardware. > > > > I'm trying to figure out a pattern. > > > > Nominally all the devices are roughly on par (only a very cheap flash > > device will be unable to overwhelm SATA's bandwidth). Yet why did the > > Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe > > device demonstrate the issue. > > > > My guess is the flash controllers Samsung uses may be able to start > > executing commands faster than the ones Crucial uses. Meanwhile NVMe > > is lower overhead and latency than SATA (SATA's overhead isn't an issue > > for actual disks). Perhaps the IOMMU is still flushing its TLB, or > > hasn't loaded the new tables. > > Which would be an IOMMU issue then, that software at best may be able to > work around. Yet even if uses of RAID1 with flash are uncommon or rare, I would expect this to have already manifested on Linux without Xen. In turn this would suggest Linux likely already has some sort of workaround. I suspect this is a case of there is some step which is missing from Xen's IOMMU handling. Perhaps something which Linux does during an early DMA setup stage, but the current Xen implementation does lazily? Alternatively some flag setting or missing step? I should be able to do another test approach in a few weeks, but I would love if something could be found sooner. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (|
Re: Serious AMD-Vi(?) issue
On 18.04.2024 08:45, Elliott Mitchell wrote: > On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote: >> On 11.04.2024 04:41, Elliott Mitchell wrote: >>> On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: On 27.03.2024 18:27, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: >>> >>> In fact when running into trouble, the usual course of action would be >>> to >>> increase verbosity in both hypervisor and kernel, just to make sure no >>> potentially relevant message is missed. >> >> More/better information might have been obtained if I'd been engaged >> earlier. > > This is still true, things are in full mitigation mode and I'll be > quite unhappy to go back with experiments at this point. Well, it very likely won't work without further experimenting by someone able to observe the bad behavior. Recall we're on xen-devel here; it is kind of expected that without clear (and practical) repro instructions experimenting as well as info collection will remain with the reporter. >>> >>> After looking at the situation and considering the issues, I /may/ be >>> able to setup for doing more testing. I guess I should confirm, which of >>> those criteria do you think currently provided information fails at? >>> >>> AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) + >>> dbench; seems a pretty specific setup. >> >> Indeed. If that's the only way to observe the issue, it suggests to me >> that it'll need to be mainly you to do further testing, and perhaps even >> debugging. Which isn't to say we're not available to help, but from all >> I have gathered so far we're pretty much in the dark even as to which >> component(s) may be to blame. As can still be seen at the top in reply >> context, some suggestions were given as to obtaining possible further >> information (or confirming the absence thereof). > > There may be other ways which haven't yet been found. > > I've been left with the suspicion AMD was to some degree sponsoring > work to ensure Xen works on their hardware. Given the severity of this > problem I would kind of expect them not want to gain a reputation for > having data loss issues. Assuming a suitable pair of devices weren't > already on-hand, I would kind of expect this to be well within their > budget. You've got to talk to AMD then. Plus I assume it's clear to you that even if the (presumably) necessary hardware was available, it still would require respective setup, leaving open whether the issue then could indeed be reproduced. >> I'd also like to come back to the vague theory you did voice, in that >> you're suspecting flushes to take too long. I continue to have trouble >> with this, and I would therefore like to ask that you put this down in >> more technical terms, making connections to actual actions taken by >> software / hardware. > > I'm trying to figure out a pattern. > > Nominally all the devices are roughly on par (only a very cheap flash > device will be unable to overwhelm SATA's bandwidth). Yet why did the > Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe > device demonstrate the issue. > > My guess is the flash controllers Samsung uses may be able to start > executing commands faster than the ones Crucial uses. Meanwhile NVMe > is lower overhead and latency than SATA (SATA's overhead isn't an issue > for actual disks). Perhaps the IOMMU is still flushing its TLB, or > hasn't loaded the new tables. Which would be an IOMMU issue then, that software at best may be able to work around. Jan > I suspect when the MD-RAID1 issues block requests to a pair of devices, > it likely sends the block to one device and then reuses most/all of the > structures for the second device. As a result the second request would > likely get a command to the device rather faster than the first request. > > Perhaps look into what structures the MD-RAID1 subsystem reuses are. > Then see whether doing early setup of those structures triggers the > issue? > > (okay I'm deep into speculation here, but this seems the simplest > explanation for what could be occuring) > >
Re: Serious AMD-Vi(?) issue
On Wed, Apr 17, 2024 at 02:40:09PM +0200, Jan Beulich wrote: > On 11.04.2024 04:41, Elliott Mitchell wrote: > > On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: > >> On 27.03.2024 18:27, Elliott Mitchell wrote: > >>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > > > > In fact when running into trouble, the usual course of action would be > > to > > increase verbosity in both hypervisor and kernel, just to make sure no > > potentially relevant message is missed. > > More/better information might have been obtained if I'd been engaged > earlier. > >>> > >>> This is still true, things are in full mitigation mode and I'll be > >>> quite unhappy to go back with experiments at this point. > >> > >> Well, it very likely won't work without further experimenting by someone > >> able to observe the bad behavior. Recall we're on xen-devel here; it is > >> kind of expected that without clear (and practical) repro instructions > >> experimenting as well as info collection will remain with the reporter. > > > > After looking at the situation and considering the issues, I /may/ be > > able to setup for doing more testing. I guess I should confirm, which of > > those criteria do you think currently provided information fails at? > > > > AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) + > > dbench; seems a pretty specific setup. > > Indeed. If that's the only way to observe the issue, it suggests to me > that it'll need to be mainly you to do further testing, and perhaps even > debugging. Which isn't to say we're not available to help, but from all > I have gathered so far we're pretty much in the dark even as to which > component(s) may be to blame. As can still be seen at the top in reply > context, some suggestions were given as to obtaining possible further > information (or confirming the absence thereof). There may be other ways which haven't yet been found. I've been left with the suspicion AMD was to some degree sponsoring work to ensure Xen works on their hardware. Given the severity of this problem I would kind of expect them not want to gain a reputation for having data loss issues. Assuming a suitable pair of devices weren't already on-hand, I would kind of expect this to be well within their budget. > I'd also like to come back to the vague theory you did voice, in that > you're suspecting flushes to take too long. I continue to have trouble > with this, and I would therefore like to ask that you put this down in > more technical terms, making connections to actual actions taken by > software / hardware. I'm trying to figure out a pattern. Nominally all the devices are roughly on par (only a very cheap flash device will be unable to overwhelm SATA's bandwidth). Yet why did the Crucial SATA device /seem/ not to have the issue? Why did a Crucial NVMe device demonstrate the issue. My guess is the flash controllers Samsung uses may be able to start executing commands faster than the ones Crucial uses. Meanwhile NVMe is lower overhead and latency than SATA (SATA's overhead isn't an issue for actual disks). Perhaps the IOMMU is still flushing its TLB, or hasn't loaded the new tables. I suspect when the MD-RAID1 issues block requests to a pair of devices, it likely sends the block to one device and then reuses most/all of the structures for the second device. As a result the second request would likely get a command to the device rather faster than the first request. Perhaps look into what structures the MD-RAID1 subsystem reuses are. Then see whether doing early setup of those structures triggers the issue? (okay I'm deep into speculation here, but this seems the simplest explanation for what could be occuring) -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On 11.04.2024 04:41, Elliott Mitchell wrote: > On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: >> On 27.03.2024 18:27, Elliott Mitchell wrote: >>> On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > > In fact when running into trouble, the usual course of action would be to > increase verbosity in both hypervisor and kernel, just to make sure no > potentially relevant message is missed. More/better information might have been obtained if I'd been engaged earlier. >>> >>> This is still true, things are in full mitigation mode and I'll be >>> quite unhappy to go back with experiments at this point. >> >> Well, it very likely won't work without further experimenting by someone >> able to observe the bad behavior. Recall we're on xen-devel here; it is >> kind of expected that without clear (and practical) repro instructions >> experimenting as well as info collection will remain with the reporter. > > After looking at the situation and considering the issues, I /may/ be > able to setup for doing more testing. I guess I should confirm, which of > those criteria do you think currently provided information fails at? > > AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) + > dbench; seems a pretty specific setup. Indeed. If that's the only way to observe the issue, it suggests to me that it'll need to be mainly you to do further testing, and perhaps even debugging. Which isn't to say we're not available to help, but from all I have gathered so far we're pretty much in the dark even as to which component(s) may be to blame. As can still be seen at the top in reply context, some suggestions were given as to obtaining possible further information (or confirming the absence thereof). I'd also like to come back to the vague theory you did voice, in that you're suspecting flushes to take too long. I continue to have trouble with this, and I would therefore like to ask that you put this down in more technical terms, making connections to actual actions taken by software / hardware. Jan > I could see this being criticised as impractical if /new/ devices were > required, but the confirmed flash devices are several generations old. > Difficulty is cheaper candidate devices are being recycled for their > precious metal content, rather than resold as used. > >
Re: Serious AMD-Vi(?) issue
On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: > On 27.03.2024 18:27, Elliott Mitchell wrote: > > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > >>> > >>> In fact when running into trouble, the usual course of action would be to > >>> increase verbosity in both hypervisor and kernel, just to make sure no > >>> potentially relevant message is missed. > >> > >> More/better information might have been obtained if I'd been engaged > >> earlier. > > > > This is still true, things are in full mitigation mode and I'll be > > quite unhappy to go back with experiments at this point. > > Well, it very likely won't work without further experimenting by someone > able to observe the bad behavior. Recall we're on xen-devel here; it is > kind of expected that without clear (and practical) repro instructions > experimenting as well as info collection will remain with the reporter. After looking at the situation and considering the issues, I /may/ be able to setup for doing more testing. I guess I should confirm, which of those criteria do you think currently provided information fails at? AMD-IOMMU + Linux MD RAID1 + dual Samsung SATA (or various NVMe) + dbench; seems a pretty specific setup. I could see this being criticised as impractical if /new/ devices were required, but the confirmed flash devices are several generations old. Difficulty is cheaper candidate devices are being recycled for their precious metal content, rather than resold as used. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On Thu, Mar 28, 2024 at 08:22:31AM -0700, Elliott Mitchell wrote: > On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: > > On 27.03.2024 18:27, Elliott Mitchell wrote: > > > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > > >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > > >>> > > >>> In fact when running into trouble, the usual course of action would be > > >>> to > > >>> increase verbosity in both hypervisor and kernel, just to make sure no > > >>> potentially relevant message is missed. > > >> > > >> More/better information might have been obtained if I'd been engaged > > >> earlier. > > > > > > This is still true, things are in full mitigation mode and I'll be > > > quite unhappy to go back with experiments at this point. > > > > Well, it very likely won't work without further experimenting by someone > > able to observe the bad behavior. Recall we're on xen-devel here; it is > > kind of expected that without clear (and practical) repro instructions > > experimenting as well as info collection will remain with the reporter. > > The first reporter: https://bugs.debian.org/988477 gave pretty specific > details about their setups. > > While the exact border isn't very well defined, that seems enough to give > a pretty good start. We don't know whether all Samsung SATA devices are > effected, but most of the recent ones (<5 years old) are. This requires > a pair of devices in software RAID1. Likely reproduces better with AMD > AM4/AM5 processors, but almost certainly needs a fully operational IOMMU. > > (ASUS motherboards tend to have well setup IOMMUs) > > I would be surprised if you don't have all of the hardware on-hand. Only > issue would be finding an appropriate pair of SATA devices, since those > tend to remain in service. I would look for older devices which were > removed from service due to being too small (128GB 840 PRO from the first > report), or were pulled from service due to having had too many writes. Come to think of it, one more possible ingredient to this. Similar to the first report, when the problem occurred, the SATA device was plugged into an on chipset SATA port, not the extra controller this motherboard has. I don't know whether the performance difference of an off-main chip controller would influence this, but it might. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On Thu, Mar 28, 2024 at 07:25:02AM +0100, Jan Beulich wrote: > On 27.03.2024 18:27, Elliott Mitchell wrote: > > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > >>> On 22.03.2024 20:22, Elliott Mitchell wrote: > On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > > > > I can see you've recently engaged with our community with some issues > > you'd > > like help with. > > We love the fact you are participating in our project, however, our > > developers aren't able to help if you do not provide the specific > > details. > > Please point to specific details which have been omitted. Fairly little > data has been provided as fairly little data is available. The primary > observation is large numbers of: > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags > 0x8 I > > Lines in Xen's ring buffer. > >>> > >>> Yet this is (part of) the problem: By providing only the messages that > >>> appear > >>> relevant to you, you imply that you know that no other message is in any > >>> way > >>> relevant. That's judgement you'd better leave to people actually trying to > >>> investigate. Unless of course you were proposing an actual code change, > >>> with > >>> suitable justification. > >> > >> Honestly, I forgot about the very small number of messages from the SATA > >> subsystem. The question of whether the current mitigation actions are > >> effective right now was a bigger issue. As such monitoring `xl dmesg` > >> was a priority to looking at SATA messages which failed to reliably > >> indicate status. > >> > >> I *thought* I would be able to retrieve those via other slow means, but a > >> different and possibly overlapping issue has shown up. Unfortunately > >> this means those are no longer retrievable. :-( > > > > With some persistence I was able to retrieve them. There are other > > pieces of software with worse UIs than Xen. > > > >>> In fact when running into trouble, the usual course of action would be to > >>> increase verbosity in both hypervisor and kernel, just to make sure no > >>> potentially relevant message is missed. > >> > >> More/better information might have been obtained if I'd been engaged > >> earlier. > > > > This is still true, things are in full mitigation mode and I'll be > > quite unhappy to go back with experiments at this point. > > Well, it very likely won't work without further experimenting by someone > able to observe the bad behavior. Recall we're on xen-devel here; it is > kind of expected that without clear (and practical) repro instructions > experimenting as well as info collection will remain with the reporter. The first reporter: https://bugs.debian.org/988477 gave pretty specific details about their setups. While the exact border isn't very well defined, that seems enough to give a pretty good start. We don't know whether all Samsung SATA devices are effected, but most of the recent ones (<5 years old) are. This requires a pair of devices in software RAID1. Likely reproduces better with AMD AM4/AM5 processors, but almost certainly needs a fully operational IOMMU. (ASUS motherboards tend to have well setup IOMMUs) I would be surprised if you don't have all of the hardware on-hand. Only issue would be finding an appropriate pair of SATA devices, since those tend to remain in service. I would look for older devices which were removed from service due to being too small (128GB 840 PRO from the first report), or were pulled from service due to having had too many writes. > > I now see why I left those out. The messages from the SATA subsystem > > were from a kernel which a bad patch had leaked into a LTS branch. Looks > > like the SATA subsystem was significantly broken and I'm unsure whether > > any useful information could be retrieved. Notably there is quite a bit > > of noise from SATA devices not effected by this issue. > > > > Some of the messages /might/ be useful, but the amount of noise is quite > > high. Do messages from a broken kernel interest you? > > If this was a less vague (in terms of possible root causes) issue, I'd > probably have answered "yes". But in the case here I'm afraid such might > further confuse things rather than clarifying them. Okay. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On 27.03.2024 18:27, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: >> On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: >>> On 22.03.2024 20:22, Elliott Mitchell wrote: On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > > I can see you've recently engaged with our community with some issues > you'd > like help with. > We love the fact you are participating in our project, however, our > developers aren't able to help if you do not provide the specific details. Please point to specific details which have been omitted. Fairly little data has been provided as fairly little data is available. The primary observation is large numbers of: (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 0x8 I Lines in Xen's ring buffer. >>> >>> Yet this is (part of) the problem: By providing only the messages that >>> appear >>> relevant to you, you imply that you know that no other message is in any way >>> relevant. That's judgement you'd better leave to people actually trying to >>> investigate. Unless of course you were proposing an actual code change, with >>> suitable justification. >> >> Honestly, I forgot about the very small number of messages from the SATA >> subsystem. The question of whether the current mitigation actions are >> effective right now was a bigger issue. As such monitoring `xl dmesg` >> was a priority to looking at SATA messages which failed to reliably >> indicate status. >> >> I *thought* I would be able to retrieve those via other slow means, but a >> different and possibly overlapping issue has shown up. Unfortunately >> this means those are no longer retrievable. :-( > > With some persistence I was able to retrieve them. There are other > pieces of software with worse UIs than Xen. > >>> In fact when running into trouble, the usual course of action would be to >>> increase verbosity in both hypervisor and kernel, just to make sure no >>> potentially relevant message is missed. >> >> More/better information might have been obtained if I'd been engaged >> earlier. > > This is still true, things are in full mitigation mode and I'll be > quite unhappy to go back with experiments at this point. Well, it very likely won't work without further experimenting by someone able to observe the bad behavior. Recall we're on xen-devel here; it is kind of expected that without clear (and practical) repro instructions experimenting as well as info collection will remain with the reporter. > I now see why I left those out. The messages from the SATA subsystem > were from a kernel which a bad patch had leaked into a LTS branch. Looks > like the SATA subsystem was significantly broken and I'm unsure whether > any useful information could be retrieved. Notably there is quite a bit > of noise from SATA devices not effected by this issue. > > Some of the messages /might/ be useful, but the amount of noise is quite > high. Do messages from a broken kernel interest you? If this was a less vague (in terms of possible root causes) issue, I'd probably have answered "yes". But in the case here I'm afraid such might further confuse things rather than clarifying them. Jan
Re: Serious AMD-Vi(?) issue
On Mon, Mar 25, 2024 at 02:43:44PM -0700, Elliott Mitchell wrote: > On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > > On 22.03.2024 20:22, Elliott Mitchell wrote: > > > On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > > >> > > >> I can see you've recently engaged with our community with some issues > > >> you'd > > >> like help with. > > >> We love the fact you are participating in our project, however, our > > >> developers aren't able to help if you do not provide the specific > > >> details. > > > > > > Please point to specific details which have been omitted. Fairly little > > > data has been provided as fairly little data is available. The primary > > > observation is large numbers of: > > > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags > > > 0x8 I > > > > > > Lines in Xen's ring buffer. > > > > Yet this is (part of) the problem: By providing only the messages that > > appear > > relevant to you, you imply that you know that no other message is in any way > > relevant. That's judgement you'd better leave to people actually trying to > > investigate. Unless of course you were proposing an actual code change, with > > suitable justification. > > Honestly, I forgot about the very small number of messages from the SATA > subsystem. The question of whether the current mitigation actions are > effective right now was a bigger issue. As such monitoring `xl dmesg` > was a priority to looking at SATA messages which failed to reliably > indicate status. > > I *thought* I would be able to retrieve those via other slow means, but a > different and possibly overlapping issue has shown up. Unfortunately > this means those are no longer retrievable. :-( With some persistence I was able to retrieve them. There are other pieces of software with worse UIs than Xen. > > In fact when running into trouble, the usual course of action would be to > > increase verbosity in both hypervisor and kernel, just to make sure no > > potentially relevant message is missed. > > More/better information might have been obtained if I'd been engaged > earlier. This is still true, things are in full mitigation mode and I'll be quite unhappy to go back with experiments at this point. I now see why I left those out. The messages from the SATA subsystem were from a kernel which a bad patch had leaked into a LTS branch. Looks like the SATA subsystem was significantly broken and I'm unsure whether any useful information could be retrieved. Notably there is quite a bit of noise from SATA devices not effected by this issue. Some of the messages /might/ be useful, but the amount of noise is quite high. Do messages from a broken kernel interest you? -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On Mon, Mar 25, 2024 at 08:55:56AM +0100, Jan Beulich wrote: > On 22.03.2024 20:22, Elliott Mitchell wrote: > > On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > >> > >> I can see you've recently engaged with our community with some issues you'd > >> like help with. > >> We love the fact you are participating in our project, however, our > >> developers aren't able to help if you do not provide the specific details. > > > > Please point to specific details which have been omitted. Fairly little > > data has been provided as fairly little data is available. The primary > > observation is large numbers of: > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags > > 0x8 I > > > > Lines in Xen's ring buffer. > > Yet this is (part of) the problem: By providing only the messages that appear > relevant to you, you imply that you know that no other message is in any way > relevant. That's judgement you'd better leave to people actually trying to > investigate. Unless of course you were proposing an actual code change, with > suitable justification. Honestly, I forgot about the very small number of messages from the SATA subsystem. The question of whether the current mitigation actions are effective right now was a bigger issue. As such monitoring `xl dmesg` was a priority to looking at SATA messages which failed to reliably indicate status. I *thought* I would be able to retrieve those via other slow means, but a different and possibly overlapping issue has shown up. Unfortunately this means those are no longer retrievable. :-( > In fact when running into trouble, the usual course of action would be to > increase verbosity in both hypervisor and kernel, just to make sure no > potentially relevant message is missed. More/better information might have been obtained if I'd been engaged earlier. > > The most overt sign was telling the Linux kernel to scan for > > inconsistencies and the kernel finding some. The domain didn't otherwise > > appear to notice trouble. > > > > This is from memory, it would take some time to discover whether any > > messages were missed. Present mitigation action is inhibiting the > > messages, but the trouble is certainly still lurking. > > Iirc you were considering whether any of this might be a timing issue. Yet > beyond voicing that suspicion, you didn't provide any technical details as > to why you think so. Such technical details would include taking into > account how IOMMU mappings and associated IOMMU TLB flushing are carried > out. Right now, to me at least, your speculation in this regard fails > basic sanity checking. Therefore the scenario that you're thinking of > would need better describing, imo. True. Mostly I'm analyzing the known information and considering what the patterns suggest. Presently I'm aware of two reports (Imre Szőllősi and mine). Both of these feature AMD processor machines. Could be people with AMD processors are less trustful of flash storage or could be an AMD-only IOMMU issue. Ideally someone would test and confirm there is no issue with Linux software RAID1 on flash on an Intel machine. Both reports feature two flash storage devices being run through Linux MD RAID1. Could be the MD RAID1 subsystem is abusing the DMA interface in some fashion. While Imre Szőllősi reported this not occuring with a single device, the report does not explicitly state whether that was a degenerate RAID1 versus non-RAID. I'm unaware of any testing with 3x devices in RAID1. Both reports feature Samsung SATA flash devices. My case also includes a Crucial NVMe device. My case also features a Crucial SATA flash device for which the problem did NOT occur. So the question becomes, why did the problem not occur for this Crucial SATA device? According to the specifications, the Crucial SATA device is roughly on par with the Samsung SATA devices in terms of read/write speeds. The NVMe device's specifications are massively better. What comes to mind is the Crucial SATA device might have higher latency before executing commands. Specifications don't mention command execution latency, so it isn't possible to know whether this is the issue. Yes, latency/timing is speculation. Does seem a good fit for the pattern though. This could be a Linux MD RAID1 bug or a Xen bug. Unfortunately data loss is a very serious type of bug so I'm highly reluctant to let go of mitigations without hope for progress. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
On 22.03.2024 20:22, Elliott Mitchell wrote: > On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: >> >> I can see you've recently engaged with our community with some issues you'd >> like help with. >> We love the fact you are participating in our project, however, our >> developers aren't able to help if you do not provide the specific details. > > Please point to specific details which have been omitted. Fairly little > data has been provided as fairly little data is available. The primary > observation is large numbers of: > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 0x8 I > > Lines in Xen's ring buffer. Yet this is (part of) the problem: By providing only the messages that appear relevant to you, you imply that you know that no other message is in any way relevant. That's judgement you'd better leave to people actually trying to investigate. Unless of course you were proposing an actual code change, with suitable justification. In fact when running into trouble, the usual course of action would be to increase verbosity in both hypervisor and kernel, just to make sure no potentially relevant message is missed. > I recall spotting 3 messages from Linux's > SATA driver (which weren't saved due to other causes being suspected), > which would likely be associated with hundreds or thousands of the above > log messages. I never observed any messages from the NVMe subsystem > during that phase. > > The most overt sign was telling the Linux kernel to scan for > inconsistencies and the kernel finding some. The domain didn't otherwise > appear to notice trouble. > > This is from memory, it would take some time to discover whether any > messages were missed. Present mitigation action is inhibiting the > messages, but the trouble is certainly still lurking. Iirc you were considering whether any of this might be a timing issue. Yet beyond voicing that suspicion, you didn't provide any technical details as to why you think so. Such technical details would include taking into account how IOMMU mappings and associated IOMMU TLB flushing are carried out. Right now, to me at least, your speculation in this regard fails basic sanity checking. Therefore the scenario that you're thinking of would need better describing, imo. Jan
Re: Serious AMD-Vi(?) issue
On Fri, Mar 22, 2024 at 04:41:45PM +, Kelly Choi wrote: > > I can see you've recently engaged with our community with some issues you'd > like help with. > We love the fact you are participating in our project, however, our > developers aren't able to help if you do not provide the specific details. Please point to specific details which have been omitted. Fairly little data has been provided as fairly little data is available. The primary observation is large numbers of: (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 0x8 I Lines in Xen's ring buffer. I recall spotting 3 messages from Linux's SATA driver (which weren't saved due to other causes being suspected), which would likely be associated with hundreds or thousands of the above log messages. I never observed any messages from the NVMe subsystem during that phase. The most overt sign was telling the Linux kernel to scan for inconsistencies and the kernel finding some. The domain didn't otherwise appear to notice trouble. This is from memory, it would take some time to discover whether any messages were missed. Present mitigation action is inhibiting the messages, but the trouble is certainly still lurking. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi(?) issue
Hi Elliott, I hope you're well. I'm Kelly, the community manager at the Xen Project. I can see you've recently engaged with our community with some issues you'd like help with. We love the fact you are participating in our project, however, our developers aren't able to help if you do not provide the specific details. As an open-source project, our developers are committed to helping and contributing as much as possible. We welcome you to continue participating, however, please refrain from requesting help without providing the necessary details as this takes up a lot of our community's time to analyze what is possible and what assistance you might need. I'd recommend providing logs or specific information so the community can help you further. If you'd like to chat more, let me know. Many thanks, Kelly Choi Community Manager Xen Project On Mon, Mar 18, 2024 at 7:42 PM Elliott Mitchell wrote: > I sent a ping on this about 2 weeks ago. Since the plan is to move x86 > IOMMU under general x86, the other x86 maintainers should be aware of > this: > > On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote: > > On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: > > > Apparently this was first noticed with 4.14, but more recently I've > been > > > able to reproduce the issue: > > > > > > https://bugs.debian.org/988477 > > > > > > The original observation features MD-RAID1 using a pair of Samsung > > > SATA-attached flash devices. The main line shows up in `xl dmesg`: > > > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 > flags 0x8 I > > > > > > Where the device points at the SATA controller. I've ended up > > > reproducing this with some noticable differences. > > > > > > A major goal of RAID is to have different devices fail at different > > > times. Hence my initial run had a Samsung device plus a device from > > > another reputable flash manufacturer. > > > > > > I initially noticed this due to messages in domain 0's dmesg about > > > errors from the SATA device. Wasn't until rather later that I noticed > > > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages > should > > > be duplicated into domain 0's dmesg?). > > > > > > All of the failures consistently pointed at the Samsung device. Due to > > > the expectation it would fail first (lower quality offering with > > > lesser guarantees), I proceeded to replace it with a NVMe device. > > > > > > With some monitoring I discovered the NVMe device was now triggering > > > IOMMU errors, though not nearly as many as the Samsung SATA device did. > > > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some > sort > > > of IOMMU issue with Xen. > > > > > > > > > All I can do is offer speculation about the underlying cause. There > > > does seem to be a pattern of higher-performance flash storage devices > > > being more severely effected. > > > > > > I was speculating about the issue being the MD-RAID1 driver abusing > > > Linux's DMA infrastructure in some fashion. > > > > > > Upon further consideration, I'm wondering if this is perhaps a latency > > > issue. I imagine there is some sort of flush after the IOMMU tables > are > > > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying > to > > > execute commands before reloading the IOMMU tables is complete. > > > > Ping! > > > > The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe. > > > > To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU > > errors matched the Samsung SATA (a number of times the SATA driver > > complained). > > > > As stated, I'm speculating lower latency devices starting to execute > > commands before IOMMU tables have finished reloading. When originally > > implemented fast flash devices were rare. > > Both reproductions of this issue I'm aware of were on systems with AMD > processors. I'm doubtul suspicion of flash storage hardware is unique > to owners of AMD systems. As a result while this /could/ also effect > Intel systems, the lack of reports /suggests/ otherwise. > > I've noticed two things when glancing at the original report. LVM is not > in use here, so that doesn't seem to effect the problem. The Phenom II > the original reporter tested as not having the issue might have lacked > proper BIOS support, hence IOMMU not being functional. > > This being a latency issue is *speculation*, but would explain the > pattern of devices being effected. > > This is rather serious as it can lead to data loss (phew! glad I just > barely dodged this outcome). > > > -- > (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) > \BS (| ehem+sig...@m5p.com PGP 87145445 |) / > \_CS\ | _ -O #include O- _ | / _/ > 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 > > > >
Re: Serious AMD-Vi(?) issue
I sent a ping on this about 2 weeks ago. Since the plan is to move x86 IOMMU under general x86, the other x86 maintainers should be aware of this: On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote: > On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: > > Apparently this was first noticed with 4.14, but more recently I've been > > able to reproduce the issue: > > > > https://bugs.debian.org/988477 > > > > The original observation features MD-RAID1 using a pair of Samsung > > SATA-attached flash devices. The main line shows up in `xl dmesg`: > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags > > 0x8 I > > > > Where the device points at the SATA controller. I've ended up > > reproducing this with some noticable differences. > > > > A major goal of RAID is to have different devices fail at different > > times. Hence my initial run had a Samsung device plus a device from > > another reputable flash manufacturer. > > > > I initially noticed this due to messages in domain 0's dmesg about > > errors from the SATA device. Wasn't until rather later that I noticed > > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should > > be duplicated into domain 0's dmesg?). > > > > All of the failures consistently pointed at the Samsung device. Due to > > the expectation it would fail first (lower quality offering with > > lesser guarantees), I proceeded to replace it with a NVMe device. > > > > With some monitoring I discovered the NVMe device was now triggering > > IOMMU errors, though not nearly as many as the Samsung SATA device did. > > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort > > of IOMMU issue with Xen. > > > > > > All I can do is offer speculation about the underlying cause. There > > does seem to be a pattern of higher-performance flash storage devices > > being more severely effected. > > > > I was speculating about the issue being the MD-RAID1 driver abusing > > Linux's DMA infrastructure in some fashion. > > > > Upon further consideration, I'm wondering if this is perhaps a latency > > issue. I imagine there is some sort of flush after the IOMMU tables are > > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to > > execute commands before reloading the IOMMU tables is complete. > > Ping! > > The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe. > > To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU > errors matched the Samsung SATA (a number of times the SATA driver > complained). > > As stated, I'm speculating lower latency devices starting to execute > commands before IOMMU tables have finished reloading. When originally > implemented fast flash devices were rare. Both reproductions of this issue I'm aware of were on systems with AMD processors. I'm doubtul suspicion of flash storage hardware is unique to owners of AMD systems. As a result while this /could/ also effect Intel systems, the lack of reports /suggests/ otherwise. I've noticed two things when glancing at the original report. LVM is not in use here, so that doesn't seem to effect the problem. The Phenom II the original reporter tested as not having the issue might have lacked proper BIOS support, hence IOMMU not being functional. This being a latency issue is *speculation*, but would explain the pattern of devices being effected. This is rather serious as it can lead to data loss (phew! glad I just barely dodged this outcome). -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi issue
On Mon, Feb 12, 2024 at 03:23:00PM -0800, Elliott Mitchell wrote: > On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: > > Apparently this was first noticed with 4.14, but more recently I've been > > able to reproduce the issue: > > > > https://bugs.debian.org/988477 > > > > The original observation features MD-RAID1 using a pair of Samsung > > SATA-attached flash devices. The main line shows up in `xl dmesg`: > > > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags > > 0x8 I > > > > Where the device points at the SATA controller. I've ended up > > reproducing this with some noticable differences. > > > > A major goal of RAID is to have different devices fail at different > > times. Hence my initial run had a Samsung device plus a device from > > another reputable flash manufacturer. > > > > I initially noticed this due to messages in domain 0's dmesg about > > errors from the SATA device. Wasn't until rather later that I noticed > > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should > > be duplicated into domain 0's dmesg?). > > > > All of the failures consistently pointed at the Samsung device. Due to > > the expectation it would fail first (lower quality offering with > > lesser guarantees), I proceeded to replace it with a NVMe device. > > > > With some monitoring I discovered the NVMe device was now triggering > > IOMMU errors, though not nearly as many as the Samsung SATA device did. > > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort > > of IOMMU issue with Xen. > > > > > > All I can do is offer speculation about the underlying cause. There > > does seem to be a pattern of higher-performance flash storage devices > > being more severely effected. > > > > I was speculating about the issue being the MD-RAID1 driver abusing > > Linux's DMA infrastructure in some fashion. > > > > Upon further consideration, I'm wondering if this is perhaps a latency > > issue. I imagine there is some sort of flush after the IOMMU tables are > > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to > > execute commands before reloading the IOMMU tables is complete. > > Ping! > > The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe. > > To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU > errors matched the Samsung SATA (a number of times the SATA driver > complained). > > As stated, I'm speculating lower latency devices starting to execute > commands before IOMMU tables have finished reloading. When originally > implemented fast flash devices were rare. I guess I'm lucky I ended up with some slightly higher-latency hardware. This is a very serious issue as data loss can occur. AMD needs to fund their Xen engineers more, otherwise soon AMD hardware may no longer be viable with Xen. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Re: Serious AMD-Vi issue
On Thu, Jan 25, 2024 at 12:24:53PM -0800, Elliott Mitchell wrote: > Apparently this was first noticed with 4.14, but more recently I've been > able to reproduce the issue: > > https://bugs.debian.org/988477 > > The original observation features MD-RAID1 using a pair of Samsung > SATA-attached flash devices. The main line shows up in `xl dmesg`: > > (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 0x8 I > > Where the device points at the SATA controller. I've ended up > reproducing this with some noticable differences. > > A major goal of RAID is to have different devices fail at different > times. Hence my initial run had a Samsung device plus a device from > another reputable flash manufacturer. > > I initially noticed this due to messages in domain 0's dmesg about > errors from the SATA device. Wasn't until rather later that I noticed > the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should > be duplicated into domain 0's dmesg?). > > All of the failures consistently pointed at the Samsung device. Due to > the expectation it would fail first (lower quality offering with > lesser guarantees), I proceeded to replace it with a NVMe device. > > With some monitoring I discovered the NVMe device was now triggering > IOMMU errors, though not nearly as many as the Samsung SATA device did. > As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort > of IOMMU issue with Xen. > > > All I can do is offer speculation about the underlying cause. There > does seem to be a pattern of higher-performance flash storage devices > being more severely effected. > > I was speculating about the issue being the MD-RAID1 driver abusing > Linux's DMA infrastructure in some fashion. > > Upon further consideration, I'm wondering if this is perhaps a latency > issue. I imagine there is some sort of flush after the IOMMU tables are > modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to > execute commands before reloading the IOMMU tables is complete. Ping! The recipe seems to be Linux MD RAID1, plus Samsung SATA or any NVMe. To make it explicit, when I tried Crucial SATA + Samsung SATA. IOMMU errors matched the Samsung SATA (a number of times the SATA driver complained). As stated, I'm speculating lower latency devices starting to execute commands before IOMMU tables have finished reloading. When originally implemented fast flash devices were rare. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445
Serious AMD-Vi issue
Apparently this was first noticed with 4.14, but more recently I've been able to reproduce the issue: https://bugs.debian.org/988477 The original observation features MD-RAID1 using a pair of Samsung SATA-attached flash devices. The main line shows up in `xl dmesg`: (XEN) AMD-Vi: IO_PAGE_FAULT: :bb:dd.f d0 addr ff???000 flags 0x8 I Where the device points at the SATA controller. I've ended up reproducing this with some noticable differences. A major goal of RAID is to have different devices fail at different times. Hence my initial run had a Samsung device plus a device from another reputable flash manufacturer. I initially noticed this due to messages in domain 0's dmesg about errors from the SATA device. Wasn't until rather later that I noticed the IOMMU warnings in Xen's dmesg (perhaps post-domain 0 messages should be duplicated into domain 0's dmesg?). All of the failures consistently pointed at the Samsung device. Due to the expectation it would fail first (lower quality offering with lesser guarantees), I proceeded to replace it with a NVMe device. With some monitoring I discovered the NVMe device was now triggering IOMMU errors, though not nearly as many as the Samsung SATA device did. As such looks like AMD-Vi plus MD-RAID1 appears to be exposing some sort of IOMMU issue with Xen. All I can do is offer speculation about the underlying cause. There does seem to be a pattern of higher-performance flash storage devices being more severely effected. I was speculating about the issue being the MD-RAID1 driver abusing Linux's DMA infrastructure in some fashion. Upon further consideration, I'm wondering if this is perhaps a latency issue. I imagine there is some sort of flush after the IOMMU tables are modified. Perhaps the Samsung SATA (and all NVMe) devices were trying to execute commands before reloading the IOMMU tables is complete. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445