Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2020-01-06 Thread Claudia
December 30, 2019 7:12 PM, "qubes123"  wrote:

>> The improper grouping is probably somewhere in AGESA, which is provided
> 
>>> to the manufacturers by AMD. It could be because of hardware related
>>> limitations, which again are supplied by AMD. Sometimes vendors take
>>> liberties (cost cutting measures) with both and break functionality, as
>>> their primary/sole concern is that Windows boots. This can especially be
>>> the case with consumer class machines such as Ryzen. Agree it would be
>>> nice if Xen handled this failure mode more gracefully. Not sure there is
>>> much Qubes can do here, though. On the other hand, my older AMD
>>> (pre-Ryzen) consumer laptop running Coreboot has correct groupings.
> 
> I could be wrong, but aren't these PCI assignments and hierarchies coded 
> within the ACPI DSDT table
> in BIOS?

I guess in some cases they are, and in other cases they're in hardware. For 
example if you have two devices between a physical PCI bridge, communication 
between those two devices might be sent across the bridge without ever making 
it to the IOMMU. I don't think there's any software approach could do anything 
about that kind of situation.

In my case, the USB controllers and most of the other devices are functions of 
the same PCI device, 00:03.0{1,2,3,4,6}. Therefore most likely any 
communication between them is happening within the device and not going to the 
IOMMU (00:00.2). However I don't know if this is because of the physical 
structure, or if it could be changed by modifying ACPI tables. I guess the only 
way to know would be to try it.

> I remember as if in UEFI the ACPI tables could be overridden somehow...
> 
> Or - since kernel 5.3.x(?) you can supply certain ACPI tables (as files, 
> stored in initrd) to the
> kernel using commandline parameters* (some additional acpi manipulations are 
> needed to extract the
> current dsdt to see what is in there and make changes in aml...)

I understand the part about uploading the ACPI tables via initrd, but I would 
have no idea how to extract them, what they mean, or what changes to make to 
them.

Also, I haven't figured out if ACPI override actually changes the behavior of 
PCI devices, or if it just spoofs the information provided to the 
kernel/hypervisor (which would make it unnecessary/ineffective on Xen). 
According to the OSDev wiki: "AML interpreter can build up a database of all 
devices within a system and the properties and functions they support (in 
reference to configuration and power management)."

> Or - before all - you can simply try to boot the kernel with cmdline: 
> acpi=nocrs (or off) and let
> the kernel "enroll" the PCI devices. Maybe worth to try - just one reboot...

I did some tests by playing sound in a VM and then binding pciback to the USB 
controllers to simulate passthru. None of them were successful. At the time of 
the bind command, audio stopped, and the screen would freeze unless nomodeset 
was on. I did the testing in the 4.1 pre-release.

I tested four combinations of parameters: (none), acpi=off, acpi=nocrs, and 
acpi=nocrs pci=nocrs, each with and without Xen. In the non-Xen tests, 
iommu_groups was the same every time. In the Xen tests, xl dmesg and xl info 
were identical every time. In all tests, lspci and lspci -t were identical. 
Kernel logs and lspci -kvvnn had some differences each time, but nothing that 
looked important. If I should look for anything specific please let me know. 
Note, the data was collected right after I logged in, before I performed the 
passthru. Not one of my better decisions. 

However the only thing I recall seeing in the logs at the time of the passthru 
was this, with acpi=nocrs pci=nocrs:
xhci_hcd :03:00.4: Host halt failed, -110
xhci_hcd :03:00.4: Host controller not halted, aborting reset.
xhci_hcd :03:00.4: USB bus 3 deregistered
pciback :03:00.3: seizing device
xen: registering gsi 55 triggering 0 polarity 1
Already setup the GSI :55
pciback :03:00.4: seizing device
xen: registering gsi 52 triggering 0 polarity 1
Already setup the GSI :52

Could it be a PCI reset related problem?



Finally, a possible workaround I thought of is putting sys-usb into PV mode, 
since PV passthru doesn't use the IOMMU. It wouldn't be quite as secure as HVM, 
as it wouldn't prevent a DMA attack, but it would still be better than having 
USB in dom0. However it looks like Qubes 4.1 isn't going to support any kind of 
passthru for PVs, so I'll ultimately end up back where I started. I don't 
currently have sys-usb installed, but I might try it when I have some time.

-- 
You received this message because you are subscribed to the Google Groups 
"qubes-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-users/b88a85b62f2e84c57ec5fbf87f14b9ec%40disroot.org.


Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2019-12-30 Thread qubes123
> The improper grouping is probably somewhere in AGESA, which is provided

> > to the manufacturers by AMD. It could be because of hardware related
> > limitations, which again are supplied by AMD. Sometimes vendors take
> > liberties (cost cutting measures) with both and break functionality, as
> > their primary/sole concern is that Windows boots. This can especially be
> > the case with consumer class machines such as Ryzen. Agree it would be
> > nice if Xen handled this failure mode more gracefully. Not sure there is
> > much Qubes can do here, though. On the other hand, my older AMD
> > (pre-Ryzen) consumer laptop running Coreboot has correct groupings.
>

I could be wrong, but aren't these PCI assignments and hierarchies coded 
within the ACPI DSDT table in BIOS?
I remember as if in UEFI the ACPI tables could be overridden somehow...
Or - since kernel 5.3.x(?) you can supply certain ACPI tables (as files, 
stored in initrd) to the kernel using commandline parameters* (some 
additional acpi manipulations are needed to extract the current dsdt to see 
what is in there and make changes in aml...)

Or - before all - you can simply try to boot the kernel with cmdline: 
acpi=nocrs (or off) and let the kernel "enroll" the PCI devices. Maybe 
worth to try - just one reboot...

*:https://www.kernel.org/doc/html/latest/admin-guide/acpi/initrd_table_override.html

-- 
You received this message because you are subscribed to the Google Groups 
"qubes-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-users/604f5799-c810-468b-82f9-95bf5b340640%40googlegroups.com.


Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2019-12-29 Thread brendan . hoar
On Sunday, December 29, 2019 at 7:25:49 PM UTC-5, Claudia wrote:
>
> Ha. Now that you mention it, I do remember laptops used to have PCIe 
> slots. But I think those days are pretty much over.
>
> On a side note, I remembered I saw some error about the IOMMU in the 
> kernel logs at some point. I just ignored it at the time because I was 
> dealing with bigger problems. I'm going to start a new thread for that. 
>

Yup, many early-mid 2010s Lenovo Thinkpads have an externall expresscard 
slot: X230, T520, W520, T530, W530, T540, W540...

1 entire lane of PCIe 2.0 (3.2 Gbit/s ... ~300MB/s) bliss! 

But more seriously, people actually used to use these for external gaming 
GPUs way back when.

For Qubes, on some of these models, the slot is very helpful: you can add 
an additional USB 3.0 root hub for external devices that can be mapped 
independently, even if you can only get about half-throughput from it.

B

PS - Also, some internal laptop slots for wifi/etc. are mPCIe...but using 
them for other purposes generally means leaving the laptop disassembled, 
which means...well...why use a laptop?

-- 
You received this message because you are subscribed to the Google Groups 
"qubes-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-users/5b0b98c7-dd90-4b70-8d9c-f2a35b2122c1%40googlegroups.com.


Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2019-12-29 Thread Claudia
December 29, 2019 2:19 PM, "awokd' via qubes-users" 
 wrote:

> Claudia:
> 
>> December 26, 2019 12:59 PM, "awokd' via qubes-users" 
>>  wrote:
>> 
>>> Claudia:
>>> 
>>> TLDR; check bottom of https://community.amd.com/thread/241650, looks
>>> like there was a recently released related updated. Not sure if
>>> applicable to your situation.
>> 
>> Thanks for the link! I'm not sure if it affects me or not. I did install a 
>> Dell BIOS update dated
>> March 2019, so it sounds like that could have contained this Agesa update. 
>> So downgrading might fix
>> the grouping issue, but this update also contained an "urgent" security 
>> update which I'd have to
>> look into before downgrading.
> 
> I'd assumed AGESA version numbers were from a common code base, but
> apparently not. The one mentioned in that thread was released around
> Oct. 2019, but may not be applicable to your hardware. They also don't
> specifically reference USB controller grouping in that thread, so it
> might do nothing for you even if it is applicable.

The fixed version appears to be for 3000-series processors. At least, when I 
was googling around I didn't see any 2000's. I have the 2500U. And besides 
that, I don't think there's any way for me to install it without Dell releasing 
a firmware update, is there?. The fix was from October, but the original/broken 
Agesa update was from July or earlier. So I thought maybe the March firmware 
update broke it, but the first thing I did was update firmware so I don't know 
if grouping was any different before.
 
>> I sort of blame Xen for not enforcing IOMMU grouping, especially considering 
>> that it hides that
>> info from the OS. KVM does enforce IOMMU grouping rules, so I don't see why 
>> Xen wouldn't. Xen
>> leaves it up to the user software to be careful what it passes where, but 
>> that's kind of hard when
>> you don't have /sys/kernel/iommu_groups for a hint.
> 
> I am a bit fuzzy here too. It seems like if ACS is working correctly,
> you can get better granularity within IOMMU groups. It would be
> disappointing if it does not on recently released hardware. In your

(TL;DR - I don't think ACS matters in Xen)

I do recall seeing some info about ACS. I don't know how to check if it's 
supported/working. But I don't think it matters. When I say IOMMU grouping I'm 
actually talking about two different things. One is the grouping "policy" (so 
to speak), that shows up in /sys/kernel/iommu_groups, and the grouping 
structure is determined using the ACS protocol. This provides an interface so 
that software like KVM can prevent you from accidentally separating 
inter-dependent devices into different VMs, which can cause memory corruption 
or security holes or whatever. If ACS is not supported or not working, the 
kernel has to assume that basically all devices on the same bus(?) are 
interdependent, and then you end up with crappy grouping. However, unlike KVM, 
Xen does not, I repeat, does not enforce this policy. Xen leaves it up to the 
user to know what they're doing.

Hence this leads us to the second sense of IOMMU grouping: the "de facto" 
grouping (so to speak), which means some set of devices really actually truly 
are interdependent, by virtue of directly sharing untranslated memory addresses 
for example, and will cause a crash if separated. Case in point: KVM users 
sometimes install an unofficial "ACS override patch" that lies about the 
"policy" part, in order to separate devices that normally belong to the same 
group, and sometimes it will work mostly fine as long as the devices in 
question are not "de facto" interdependent. (Patches are also added to the 
official kernel for specific devices when the vendor certifies that they can 
safely be separated.) There is no such thing for Xen, because Xen doesn't 
attempt to enforce the grouping policy in the first place. So ACS should be a 
non-issue in Xen.

So in my case, I'm pretty sure that most of my devices are de facto 
interdependent, because separating the USB controllers from the rest of the 
group causes an instant crash. The de facto groups probably can be influenced 
by firmware/microcode in addition to the hardware.

That's my understanding anyway. I could be wrong.

> case, the USB controller appears as a different function of the same PCI
> device, which could be the case from a hardware perspective. This is
> even worse for a passthrough scenario than IOMMU grouping. There is a
> Realtek controller that often comes up on the list that makes people
> passthrough the SD card controller to their sys-net along with WIFI for
> the same reason.

That's something I haven't been able to figure out: are functions of the same 
device always inherently in the same de facto group? Or does the BDF structure 
have little to do with grouping? It seems likely that functions of the same 
device would communicate directly instead of via the bus/IOMMU. But it's also 
conceivable that some devices would intentionally send data 

Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2019-12-29 Thread 'awokd' via qubes-users
Claudia:
> December 26, 2019 12:59 PM, "awokd' via qubes-users" 
>  wrote:
> 
>> Claudia:
>>
>> TLDR; check bottom of https://community.amd.com/thread/241650, looks
>> like there was a recently released related updated. Not sure if
>> applicable to your situation.
> 
> Thanks for the link! I'm not sure if it affects me or not. I did install a 
> Dell BIOS update dated March 2019, so it sounds like that could have 
> contained this Agesa update. So downgrading might fix the grouping issue, but 
> this update also contained an "urgent" security update which I'd have to look 
> into before downgrading.

I'd assumed AGESA version numbers were from a common code base, but
apparently not. The one mentioned in that thread was released around
Oct. 2019, but may not be applicable to your hardware. They also don't
specifically reference USB controller grouping in that thread, so it
might do nothing for you even if it is applicable.

> I sort of blame Xen for not enforcing IOMMU grouping, especially considering 
> that it hides that
> info from the OS. KVM does enforce IOMMU grouping rules, so I don't see why 
> Xen wouldn't. Xen
> leaves it up to the user software to be careful what it passes where, but 
> that's kind of hard when
> you don't have /sys/kernel/iommu_groups for a hint.

I am a bit fuzzy here too. It seems like if ACS is working correctly,
you can get better granularity within IOMMU groups. It would be
disappointing if it does not on recently released hardware. In your
case, the USB controller appears as a different function of the same PCI
device, which could be the case from a hardware perspective. This is
even worse for a passthrough scenario than IOMMU grouping. There is a
Realtek controller that often comes up on the list that makes people
passthrough the SD card controller to their sys-net along with WIFI for
the same reason.

> This is a laptop, so I can't add any cards.

This didn't used to be mutually exclusive. Thanks, Apple.

-- 
- don't top post
Mailing list etiquette:
- trim quoted reply to only relevant portions
- when possible, copy and paste text instead of screenshots

-- 
You received this message because you are subscribed to the Google Groups 
"qubes-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-users/42654600-e501-aad9-f15b-a394d38b262f%40danwin1210.me.


Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2019-12-27 Thread Claudia
December 26, 2019 12:59 PM, "awokd' via qubes-users" 
 wrote:

> Claudia:
> 
> TLDR; check bottom of https://community.amd.com/thread/241650, looks
> like there was a recently released related updated. Not sure if
> applicable to your situation.

Thanks for the link! I'm not sure if it affects me or not. I did install a Dell 
BIOS update dated March 2019, so it sounds like that could have contained this 
Agesa update. So downgrading might fix the grouping issue, but this update also 
contained an "urgent" security update which I'd have to look into before 
downgrading.

>> This caused a very sneaky problem on my machine. My USB controllers are in 
>> the same group as my
>> GPU, sound card, and SATA controller. So when sys-usb (or 
>> rd.qubes.hide_all_usb) takes over those
>> two USB controllers, everything stops working. [4] It was quite difficult to 
>> trace. It would have
>> been much easier to diagnose if grouping was enforced somewhere. I would 
>> much rather have an error
>> in my logs about being unable to assign USB controllers, than have my whole 
>> screen freeze up with
>> no indication why. (I got lucky that it just crashed; if something 
>> interferes with your SATA
>> controller's address space it can cause disk corruption. [5])
>> 
>> I don't really know who's at fault here. Qubes? Xen? AMD? Dell?
> 
> The improper grouping is probably somewhere in AGESA, which is provided
> to the manufacturers by AMD. It could be because of hardware related
> limitations, which again are supplied by AMD. Sometimes vendors take
> liberties (cost cutting measures) with both and break functionality, as
> their primary/sole concern is that Windows boots. This can especially be
> the case with consumer class machines such as Ryzen. Agree it would be
> nice if Xen handled this failure mode more gracefully. Not sure there is
> much Qubes can do here, though. On the other hand, my older AMD
> (pre-Ryzen) consumer laptop running Coreboot has correct groupings.

Yeah, my impression is the firmware can influence IOMMU grouping to an extent, 
within the bounds of
the physical hardware. If this problem was indeed caused by an update then I 
assume it's (at least partly) firmware-related. According to that thread, a fix 
has been released for some boards/CPUs, "ComboPI", but the only feedback I can 
find on it is for Ryzen 3000-series which doesn't help me. Also I don't even 
know if or when my machine will receive a BIOS update with this Agesa fix.

I sort of blame Xen for not enforcing IOMMU grouping, especially considering 
that it hides that
info from the OS. KVM does enforce IOMMU grouping rules, so I don't see why Xen 
wouldn't. Xen
leaves it up to the user software to be careful what it passes where, but 
that's kind of hard when
you don't have /sys/kernel/iommu_groups for a hint.

>> Intel systems
>> seem to just to have better grouping usually (or, are less likely to crash 
>> when grouping rules are
>> violated). [6]
> 
> I think that is overbroad. There are plenty of Intel systems with broken
> passthrough. iommu=no-igfx itself is a workaround for broken passthrough
> of Intel graphics. There are also plenty of AMD systems with properly
> implemented passthrough.

Very possible. I don't have experience with a lot of other hardware, so I'm 
just going by what I've
heard. It definitely seems to be a Ryzen problem at least, maybe not AMD in 
general. I just seemed
to come across a lot more complaints about AMD than Intel, though. It would be 
nice if the HCL
contained more detailed information about the IOMMU such as grouping, so we 
could get a better
idea. At any rate, that's the least of my worries.

TBH I don't really understand what no-igfx does, so I don't know if an 
AMD-equivalent option would help in this case or not. It's just worth noting 
that it's an Intel-specific fix which could improve Intel compatibility 
compared to AMD generally.

>> Thoughts? Is there anything Qubes can do to do avoid splitting up IOMMU 
>> groups? Is there anything
>> Qubes *should* do? Should Qubes attempt to guess the IOMMU groups before 
>> taking over devices?
>> Should the USB Qube option be disabled on AMD systems (you can still 
>> manually set up sys-usb of
>> course)? Should we just blame Xen for not enforcing IOMMU groups in the 
>> first place?
> 
> Ultimately, it's a hardware/firmware issue. Threadripper and Epyc based
> AMD systems ought to be more thoroughly vetted to support passthrough.
> My suggestions are to disable automatic IOMMU grouping in your UEFI
> configuration, if possible. Otherwise, try a newer firmware version with
> updated AGESA code and see if it helps, or possibly add a card with
> additional USB controllers as that should appear in its own group.

There is no way to enable or disable automatic IOMMU grouping in my bios. The 
only options are IOMMU
enabled or disabled, as far as I can tell. There is no newer firmware for this 
machine at this
time. Not sure about microcode, though. This 

Re: [qubes-users] Qubes/Xen doesn't comply with IOMMU grouping rules for PCI passthru

2019-12-26 Thread 'awokd' via qubes-users
Claudia:

TLDR; check bottom of https://community.amd.com/thread/241650, looks
like there was a recently released related updated. Not sure if
applicable to your situation.

> This caused a very sneaky problem on my machine. My USB controllers are in 
> the same group as my
> GPU, sound card, and SATA controller. So when sys-usb (or 
> rd.qubes.hide_all_usb) takes over those
> two USB controllers, everything stops working. [4] It was quite difficult to 
> trace. It would have
> been much easier to diagnose if grouping was enforced somewhere. I would much 
> rather have an error
> in my logs about being unable to assign USB controllers, than have my whole 
> screen freeze up with
> no indication why. (I got lucky that it just crashed; if something interferes 
> with your SATA 
> controller's address space it can cause disk corruption. [5])
> 
> I don't really know who's at fault here. Qubes? Xen? AMD? Dell?

The improper grouping is probably somewhere in AGESA, which is provided
to the manufacturers by AMD. It could be because of hardware related
limitations, which again are supplied by AMD. Sometimes vendors take
liberties (cost cutting measures) with both and break functionality, as
their primary/sole concern is that Windows boots. This can especially be
the case with consumer class machines such as Ryzen. Agree it would be
nice if Xen handled this failure mode more gracefully. Not sure there is
much Qubes can do here, though. On the other hand, my older AMD
(pre-Ryzen) consumer laptop running Coreboot has correct groupings.

> Intel systems
> seem to just to have better grouping usually (or, are less likely to crash 
> when grouping rules are
> violated). [6]

I think that is overbroad. There are plenty of Intel systems with broken
passthrough. iommu=no-igfx itself is a workaround for broken passthrough
of Intel graphics. There are also plenty of AMD systems with properly
implemented passthrough.

> Thoughts? Is there anything Qubes can do to do avoid splitting up IOMMU 
> groups? Is there anything
> Qubes *should* do? Should Qubes attempt to guess the IOMMU groups before 
> taking over devices?
> Should the USB Qube option be disabled on AMD systems (you can still manually 
> set up sys-usb of
> course)? Should we just blame Xen for not enforcing IOMMU groups in the first 
> place? 

Ultimately, it's a hardware/firmware issue. Threadripper and Epyc based
AMD systems ought to be more thoroughly vetted to support passthrough.
My suggestions are to disable automatic IOMMU grouping in your UEFI
configuration, if possible. Otherwise, try a newer firmware version with
updated AGESA code and see if it helps, or possibly add a card with
additional USB controllers as that should appear in its own group.

-- 
You received this message because you are subscribed to the Google Groups 
"qubes-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to qubes-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/qubes-users/336a9dc9-8409-3496-f0e2-9d24c06d47ab%40danwin1210.me.