date:20220718

Re: [PATCH v2 0/4] tools/xenstore: add some new features to the documentation

2022-07-18 Thread Jan Beulich

On 18.07.2022 18:28, Julien Grall wrote:
> On 18/07/2022 17:12, Jan Beulich wrote:
>> On 27.05.2022 09:24, Juergen Gross wrote:
>>> In the past there have been spotted some shortcomings in the Xenstore
>>> interface, which should be repaired. Those are in detail:
>>>
>>> - Using driver domains for large number of domains needs per domain
>>>Xenstore quota [1]. The feedback sent was rather slim (one reply),
>>>but it was preferring a new set of wire commands.
>>>
>>> - XSA-349 [2] has shown that the current definition of watches is not
>>>optimal, as it will trigger lots of events when a single one would
>>>suffice: for detecting new backend devices the backends in the Linux
>>>kernel are registering a watch for e.g. "/local/domain/0/backend"
>>>which will fire for ANY sub-node written below this node (on a test
>>>machine this added up to 91 watch events for only 3 devices).
>>>This can be limited dramatically by extending the XS_WATCH command
>>>to take another optional parameter specifying the depth of
>>>subdirectories to be considered for sending watch events ("0" would
>>>trigger a watch event only if the watched node itself being written).
>>>
>>> - New features like above being added might make migration of guests
>>>between hosts with different Xenstore variants harder, so it should
>>>be possible to set the available feature set per domain. For socket
>>>connections it should be possible to read the available features.
>>>
>>> - The special watches @introduceDomain and @releaseDomain are rather
>>>cumbersome to use, as they only tell you that SOME domain has been
>>>introduced/released. Any consumer of those watches needs to scan
>>>all domains on the host in order to find out the domid, causing
>>>significant pressure on the dominfo hypercall (imagine a system
>>>with 1000 domains running and one domain dying - there will be more
>>>than 1000 watch events triggered and 1000 xl daemons will try to
>>>find out whether "their" domain has died). Those watches should be
>>>enhanced to optionally be specific to a single domain and to let the
>>>event carry the related domid.
>>>
>>> As some of those extensions will need to be considered in the Xenstore
>>> migration stream, they should be defined in one go (in fact the 4th one
>>> wouldn't need that, but it can easily be connected to the 2nd one).
>>> As such extensions need to be flagged in the "features" in the ring
>>> page anyway, it is fine to implement them independently.
>>>
>>> Add the documentation of the new commands/features.
>>>
>>> [1]: https://lists.xen.org/archives/html/xen-devel/2020-06/msg00291.html
>>> [2]: http://xenbits.xen.org/xsa/advisory-349.html
>>>
>>> Changes in V2:
>>> - added new patch 1
>>> - remove feature bits for dom0-only features
>>> - get-features without domid returns Xenstore supported features
>>> - get/set-quota without domid for global quota access
>>>
>>> Juergen Gross (4):
>>>tools/xenstore: modify feature bit specification in xenstore-ring.txt
>>>tools/xenstore: add documentation for new set/get-feature commands
>>>tools/xenstore: add documentation for new set/get-quota commands
>>>tools/xenstore: add documentation for extended watch command
>>
>> Hmm, looks like I did commit v1 of this series, not noticing the v2 _and_
>> seeing there had been R-b with no other follow-up (leaving aside the v2)
>> in a long time. Please advise if I should revert the commits. I'm sorry
>> for the confusion. (I also wonder why the R-b weren't carried over to v2.)
> 
> patch #1 is a new patch. The patch #2, #3, #4 have been reworded and the 
> overall interaction is different. So I don't think the reviewed-by 
> should have been carried.
> 
> I had some concerns in v1 which were addressed in v2. I have reviewed v2 
> a while ago. From my perspective, patch #1, #3, #4 are ready to go. 
> Patch #2 needs a respin and we also need to clarify the integration with 
> migration/live-update.
> 
> As you committed, I would be OK if this is addressed in a follow-up 
> series. But this *must* be addressed by the time 4.17 is released 
> because otherwise we will commit ourself to a broken interface. @Henry, 
> please add this in the blocker list.

If you hadn't answered, I would have reverted these right away this
morning, in particular to remove the (now wrong) feature bit part
(patches 2 and 3 have dropped their feature bit additions in v2).
If you nevertheless think an incremental update is going to be okay,
I'll leave things alone. But personally I think this mistake of mine
would better be corrected immediately.

Jan

[linux-linus test] 171674: regressions - FAIL

2022-07-18 Thread osstest service owner

flight 171674 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171674/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-arm64   6 xen-buildfail REGR. vs. 171664

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-examine  1 build-check(1)   blocked  n/a
 build-arm64-libvirt   1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit1   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit2   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-seattle   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-thunderx  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-vhd   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 171664
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 171664
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 171664
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 171664
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 171664
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 171664
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 171664
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 171664
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted for testing:
 linux80e19f34c2887a8881084b7bb7480e9544d56b91
baseline version:
 linuxff6992735ade75aae3e35d16b17da1008d753d28

Last test of basis   171664  2022-07-17 21:11:10 Z1 days
Testing same since   171674  2022-07-18 19:40:02 Z0 days1 attempts


People who touched revisions under test:
  Andy Shevchenko 
  Bartosz Golaszewski 
  Dipen Patel 
  Linus Torvalds 
  Thierry Reding 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  fail
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  blocked 
 build-armhf-libvirt

Re: [RFC PATCH 0/2] Yocto Gitlab CI support

2022-07-18 Thread Christopher Clark

On Thu, Jul 14, 2022 at 3:10 AM Bertrand Marquis
 wrote:
>
> This patch series is a first attempt to check if we could use Yocto in
> gitlab ci to build and run xen on qemu for arm, arm64 and x86.

Hi Bertrand, thanks for posting this.

I'm still making my way through it, and should be able to speak more
to the OE/Yocto aspects than the Xen automation integration but at
first pass, I think that this is work in the right direction.
A few quick early points:
- The build-yocto.sh script is clear to understand, which is helpful.
- The layers that you have selected to include in the build are good.
Might be worth considering using openembedded-core, which is poky's
upstream, but I think either is a valid choice.
- listing the layers one-per-line in the script might make it
easier to patch in additional layers downstream, if needed
- The target image of 'xen-image-minimal' is the right start; it would
be nice to be able to pass that as an input from the dockerfile to
allow for using this with other images.
- Possibly worth mentioning somewhere in the series description that
this introduces coverage for x86-64 but not 32-bit x86 guests - it's
the right choice given that this is just booting to a dom0.

Christopher

> The first patch is creating a container with all elements required to
> build Yocto, a checkout of the yocto layers required and an helper
> script to build and run xen on qemu with yocto.
>
> The second patch is creating containers with a first build of yocto done
> so that susbsequent build with those containers would only rebuild what
> was changed and take the rest from the cache.
>
> This is is mainly for discussion and sharing as there are still some
> issues/problem to solve:
> - building the qemu* containers can take several hours depending on the
>   network bandwith and computing power of the machine where those are
>   created
> - produced containers containing the cache have a size between 8 and
>   12GB depending on the architecture. We might need to store the build
>   cache somewhere else to reduce the size. If we choose to have one
>   single image, the needed size is around 20GB and we need up to 40GB
>   during the build, which is why I splitted them.
> - during the build and run, we use a bit more then 20GB of disk which is
>   over the allowed size in gitlab
>
> Once all problems passed, this can be used to build and run dom0 on qemu
> with a modified Xen on the 3 archs in less than 10 minutes.
>
> Bertrand Marquis (2):
>   automation: Add elements for Yocto test and run
>   automation: Add yocto containers with cache
>
>  automation/build/Makefile |   2 +
>  automation/build/yocto/build-yocto.sh | 322 ++
>  .../build/yocto/kirkstone-qemuarm.dockerfile  |  28 ++
>  .../yocto/kirkstone-qemuarm64.dockerfile  |  28 ++
>  .../yocto/kirkstone-qemux86-64.dockerfile |  28 ++
>  automation/build/yocto/kirkstone.dockerfile   |  98 ++
>  6 files changed, 506 insertions(+)
>  create mode 100755 automation/build/yocto/build-yocto.sh
>  create mode 100644 automation/build/yocto/kirkstone-qemuarm.dockerfile
>  create mode 100644 automation/build/yocto/kirkstone-qemuarm64.dockerfile
>  create mode 100644 automation/build/yocto/kirkstone-qemux86-64.dockerfile
>  create mode 100644 automation/build/yocto/kirkstone.dockerfile
>
> --
> 2.25.1
>
>

[ovmf test] 171677: all pass - PUSHED

2022-07-18 Thread osstest service owner

flight 171677 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171677/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf e21b2039112eeba9a93fdd7b70329a07a79c8f0e
baseline version:
 ovmf 586b4a104bfeae277ce16687f852c4f26c4a3b73

Last test of basis   171675  2022-07-18 21:11:53 Z0 days
Testing same since   171677  2022-07-19 01:11:58 Z0 days1 attempts


People who touched revisions under test:
  Guo Dong 
  James Lu 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   586b4a104b..e21b203911  e21b2039112eeba9a93fdd7b70329a07a79c8f0e -> 
xen-tested-master

[xen-unstable test] 171672: tolerable FAIL - PUSHED

2022-07-18 Thread osstest service owner

flight 171672 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171672/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 171666
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 171666
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 171666
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 171666
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 171666
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 171666
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 171666
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 171666
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 171666
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 171666
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 171666
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 171666
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  0af91dc0326cba12795e0b8fa8f665776e2a9e13
baseline version:
 xen

RE: [PATCH v2 0/4] tools/xenstore: add some new features to the documentation

2022-07-18 Thread Henry Wang

Hi Julien,

> -Original Message-
> From: Julien Grall 
> Subject: Re: [PATCH v2 0/4] tools/xenstore: add some new features to the
> documentation
> 
> Hi Jan,
> 
> On 18/07/2022 17:12, Jan Beulich wrote:
> > On 27.05.2022 09:24, Juergen Gross wrote:
> >>
> >> Changes in V2:
> >> - added new patch 1
> >> - remove feature bits for dom0-only features
> >> - get-features without domid returns Xenstore supported features
> >> - get/set-quota without domid for global quota access
> >>
> >> Juergen Gross (4):
> >>tools/xenstore: modify feature bit specification in xenstore-ring.txt
> >>tools/xenstore: add documentation for new set/get-feature commands
> >>tools/xenstore: add documentation for new set/get-quota commands
> >>tools/xenstore: add documentation for extended watch command
> >
> > Hmm, looks like I did commit v1 of this series, not noticing the v2 _and_
> > seeing there had been R-b with no other follow-up (leaving aside the v2)
> > in a long time. Please advise if I should revert the commits. I'm sorry
> > for the confusion. (I also wonder why the R-b weren't carried over to v2.)
> 
> patch #1 is a new patch. The patch #2, #3, #4 have been reworded and the
> overall interaction is different. So I don't think the reviewed-by
> should have been carried.
> 
> I had some concerns in v1 which were addressed in v2. I have reviewed v2
> a while ago. From my perspective, patch #1, #3, #4 are ready to go.
> Patch #2 needs a respin and we also need to clarify the integration with
> migration/live-update.
> 
> As you committed, I would be OK if this is addressed in a follow-up
> series. But this *must* be addressed by the time 4.17 is released
> because otherwise we will commit ourself to a broken interface. @Henry,
> please add this in the blocker list.

Thank you very much for this information. I've added this in blocker list.
I will keep that in mind and send (proper) reminders during the timeline
of the 4.17 release process.

Kind regards,
Henry 

> 
> Cheers,
> 
> --
> Julien Grall

[ovmf test] 171675: all pass - PUSHED

2022-07-18 Thread osstest service owner

flight 171675 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171675/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 586b4a104bfeae277ce16687f852c4f26c4a3b73
baseline version:
 ovmf 792ebb6374f2b92e2c4b84f5b8d151a129ed81cc

Last test of basis   171671  2022-07-18 17:10:27 Z0 days
Testing same since   171675  2022-07-18 21:11:53 Z0 days1 attempts


People who touched revisions under test:
  Nate DeSimone 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   792ebb6374..586b4a104b  586b4a104bfeae277ce16687f852c4f26c4a3b73 -> 
xen-tested-master

Ryzen 6000 (Mobile)

2022-07-18 Thread Dylanger Daly

Hi All,

I'm having issues getting QubesOS running on my Lenovo Yoga Slim 7 ProX (AMD 
Ryzen 6800HS)

Firstly in order to boot the device at all, I'm required to add 
`dom0_max_vcpus=1 dom0_vcpus_pin` to dom0's CMDLINE, this is similar to what I 
had to do previously - 
https://xen.markmail.org/search/?q=Ryzen#query:Ryzen+page:1+mid:f3hel4yj25qilabv+state:results
 with the Ryzen 4000 series, however without these options added dom0 never 
fully boots into Fedora.

The other interesting issue I'm having is upon booting any VM, just a normal 
simple VM without any PCI devices attached, it'll successfully start, about 1 
second will pass then the entire device will hang and reset, it's virtually 
impossible to get any logs at all out of the device when it's in this state.

FYI: QubesOS uses Xen 4.14

Thanks all

Re: Panic on CPU 0: FATAL TRAP: vec 7, #NM[0000]

2022-07-18 Thread Andrew Cooper

On 18/07/2022 22:31, ch...@dalessio.org wrote:
> I am trying to run Xen-4.16.1-4.fc36 on Fedora 36 on a brand new Lenovo
> ThinkStation p620, but I keep getting the following error booting the
> Xen kernel.
> 
> Panic on CPU 0:
> FATAL TRAP: vec 7, #NM[]
>
> Version info:
> Name        : xen
> Version     : 4.16.1
> Release     : 4.fc36

So https://koji.fedoraproject.org/koji/buildinfo?buildID=1991182 should
be the binary build in use, and looking at the debug syms, it really
does have:

82d040439c80 :
...
82d04043a00c:   0f 6e c2movd   %edx,%mm0
82d04043a00f:   0f 62 c0punpckldq %mm0,%mm0
82d04043a012:   49 89 87 c0 00 00 00mov%rax,0xc0(%r15)
82d04043a019:   41 0f 7f 87 d0 00 00movq   %mm0,0xd0(%r15)
82d04043a020:   00

So hardware is correct - this build of Xen is nonsense.

The binary is also full of .annobin_ stuff which appears to be some kind
of GCC plugin for watermarking.

Michael: Any idea what's going on here?  Something has caused GCC to
emit some MMX logic which is ultimately why things exploded, but this
probably means that some of the build CFLAGS got dropped.

Thanks,

~Andrew

[qemu-mainline test] 171670: regressions - trouble: broken/fail/pass

2022-07-18 Thread osstest service owner

flight 171670 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171670/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-libvirt-pair broken
 test-amd64-amd64-libvirt-pair 7 host-install/dst_host(7) broken REGR. vs. 
171648
 test-amd64-i386-pair 11 xen-install/dst_host fail REGR. vs. 171648
 test-amd64-amd64-xl-qcow2   21 guest-start/debian.repeat fail REGR. vs. 171648
 test-amd64-amd64-libvirt-vhd 20 guest-start.2fail REGR. vs. 171648

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 171648
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 171648
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 171648
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 171648
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 171648
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 171648
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 171648
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 171648
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

version targeted

[xen-unstable-smoke test] 171673: tolerable all pass - PUSHED

2022-07-18 Thread osstest service owner

flight 171673 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171673/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  0e60f1d9d1970cae49ee9d03f5759f44afc1fdee
baseline version:
 xen  0af91dc0326cba12795e0b8fa8f665776e2a9e13

Last test of basis   171669  2022-07-18 15:01:42 Z0 days
Testing same since   171673  2022-07-18 19:00:24 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Henry Wang 
  Jan Beulich 
  Juergen Gross 
  Wei Chen  # arm
  Xenia Ragiadakou 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   0af91dc032..0e60f1d9d1  0e60f1d9d1970cae49ee9d03f5759f44afc1fdee -> smoke

[PATCH v2 3/4] vpci: use pcidevs locking to protect MMIO handlers

2022-07-18 Thread Volodymyr Babchuk

From: Oleksandr Andrushchenko 

vPCI MMIO handlers are accessing pdevs without protecting this access
with pcidevs_{lock|unlock}. This is not a problem as of now as these
are only used by Dom0. But, towards vPCI is used also for guests, we need
to properly protect pdev and pdev->vpci from being removed while still
in use.

For that use pcidevs_read_{un}lock helpers.

This patch adds ASSERTs in the code to check that the rwlock is taken
and in appropriate mode. Some of such checks require changes to the
initialization of local variables which may be accessed before the
ASSERT checks the locking. For example see init_bars and mask_write.

Signed-off-by: Oleksandr Andrushchenko 
Signed-off-by: Volodymyr Babchuk 

---
Since v1:
- move pcidevs_read_{lock|unlock} into patch 1
---
 xen/arch/x86/hvm/vmsi.c   | 24 ++---
 xen/drivers/vpci/header.c | 24 +++--
 xen/drivers/vpci/msi.c| 21 ++-
 xen/drivers/vpci/msix.c   | 55 ++-
 xen/drivers/vpci/vpci.c   | 16 +---
 xen/include/xen/pci.h |  1 +
 xen/include/xen/vpci.h|  2 +-
 7 files changed, 121 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index c1ede676d0..3f250f81a4 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -891,10 +891,16 @@ void vpci_msix_arch_init_entry(struct vpci_msix_entry 
*entry)
 entry->arch.pirq = INVALID_PIRQ;
 }
 
-int vpci_msix_arch_print(const struct vpci_msix *msix)
+int vpci_msix_arch_print(const struct domain *d, const struct vpci_msix *msix)
 {
 unsigned int i;
 
+/*
+ * FIXME: this is not immediately correct, as the lock can be grabbed
+ * by a different CPU. But this is better then nothing.
+ */
+ASSERT(pcidevs_read_locked());
+
 for ( i = 0; i < msix->max_entries; i++ )
 {
 const struct vpci_msix_entry *entry = >entries[i];
@@ -911,11 +917,23 @@ int vpci_msix_arch_print(const struct vpci_msix *msix)
 if ( i && !(i % 64) )
 {
 struct pci_dev *pdev = msix->pdev;
+pci_sbdf_t sbdf = pdev->sbdf;
 
 spin_unlock(>pdev->vpci->lock);
+pcidevs_read_unlock();
+
+/* NB: we still hold rcu_read_lock(_read_lock); here. */
 process_pending_softirqs();
-/* NB: we assume that pdev cannot go away for an alive domain. */
-if ( !pdev->vpci || !spin_trylock(>vpci->lock) )
+
+if ( !pcidevs_read_trylock() )
+return -EBUSY;
+pdev = pci_get_pdev_by_domain(d, sbdf.seg, sbdf.bus, sbdf.devfn);
+/*
+ * FIXME: we may find a re-allocated pdev's copy here.
+ * Even occupying the same address as before. Do our best.
+ */
+if ( !pdev || (pdev != msix->pdev) || !pdev->vpci ||
+ !spin_trylock(>vpci->lock) )
 return -EBUSY;
 if ( pdev->vpci->msix != msix )
 {
diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index a1c928a0d2..e0461b1139 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -142,16 +142,19 @@ bool vpci_process_pending(struct vcpu *v)
 if ( rc == -ERESTART )
 return true;
 
+pcidevs_read_lock();
 spin_lock(>vpci.pdev->vpci->lock);
 /* Disable memory decoding unconditionally on failure. */
 modify_decoding(v->vpci.pdev,
 rc ? v->vpci.cmd & ~PCI_COMMAND_MEMORY : v->vpci.cmd,
 !rc && v->vpci.rom_only);
 spin_unlock(>vpci.pdev->vpci->lock);
+pcidevs_read_unlock();
 
 rangeset_destroy(v->vpci.mem);
 v->vpci.mem = NULL;
 if ( rc )
+{
 /*
  * FIXME: in case of failure remove the device from the domain.
  * Note that there might still be leftover mappings. While this is
@@ -159,7 +162,10 @@ bool vpci_process_pending(struct vcpu *v)
  * killed in order to avoid leaking stale p2m mappings on
  * failure.
  */
+pcidevs_write_lock();
 vpci_remove_device(v->vpci.pdev);
+pcidevs_write_unlock();
+}
 }
 
 return false;
@@ -172,7 +178,16 @@ static int __init apply_map(struct domain *d, const struct 
pci_dev *pdev,
 int rc;
 
 while ( (rc = rangeset_consume_ranges(mem, map_range, )) == -ERESTART 
)
+{
+/*
+ * It's safe to drop and re-acquire the lock in this context
+ * without risking pdev disappearing because devices cannot be
+ * removed until the initial domain has been started.
+ */
+pcidevs_write_unlock();
 process_pending_softirqs();
+pcidevs_write_lock();
+}
 rangeset_destroy(mem);
 if ( !rc )
 modify_decoding(pdev, cmd, false);
@@ -450,10 +465,15 @@ static int cf_check init_bars(struct pci_dev *pdev)

[PATCH v2 1/4] pci: add rwlock to pcidevs_lock machinery

2022-07-18 Thread Volodymyr Babchuk

From: Oleksandr Andrushchenko 

Currently pcidevs lock is a global recursive spinlock which is fine for
the existing use cases. It is used to both protect pdev instances
themselves from being removed while in use and to make sure the update
of the relevant pdev properties is synchronized.

Moving towards vPCI is used for guests this becomes problematic in terms
of lock contention. For example, during vpci_{read|write} the access to
pdev must be protected to prevent pdev disappearing under our feet.
This needs to be done with the help of pcidevs_{lock|unlock}.
On the other hand it is highly undesirable to lock all other pdev accesses
which only use pdevs in read mode, e.g. those which do not remove or
add pdevs.

For the above reasons introduce a read/write lock which will help
preventing locking contentions between pdev readers and writers. This
read/write lock replaces current recursive spinlock.

By default all existing code uses pcidevs_lock() which takes write
lock. This ensures that all existing locking logic stays the same.

To justify this change, replace locks in (V)MSI code with read
locks. This code is a perfect target, as (V)MSI code only requires
that pdevs will not disappear during handling, while (V)MSI state
is protected by own locks.

This is in preparation for vPCI to be used for guests.

Signed-off-by: Oleksandr Andrushchenko 
Signed-off-by: Volodymyr Babchuk 

---
Since v1:
- user per CPU variable to track recursive readers and writers
- use read locks in vmsi code
---
 xen/arch/x86/hvm/vmsi.c   | 22 ++--
 xen/arch/x86/irq.c|  8 ++---
 xen/arch/x86/msi.c| 16 -
 xen/drivers/passthrough/pci.c | 65 +++
 xen/include/xen/pci.h | 10 +-
 5 files changed, 91 insertions(+), 30 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index d4a8c953e2..c1ede676d0 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -466,7 +466,7 @@ int msixtbl_pt_register(struct domain *d, struct pirq 
*pirq, uint64_t gtable)
 struct msixtbl_entry *entry, *new_entry;
 int r = -EINVAL;
 
-ASSERT(pcidevs_locked());
+ASSERT(pcidevs_read_locked());
 ASSERT(spin_is_locked(>event_lock));
 
 if ( !msixtbl_initialised(d) )
@@ -536,7 +536,7 @@ void msixtbl_pt_unregister(struct domain *d, struct pirq 
*pirq)
 struct pci_dev *pdev;
 struct msixtbl_entry *entry;
 
-ASSERT(pcidevs_locked());
+ASSERT(pcidevs_read_locked());
 ASSERT(spin_is_locked(>event_lock));
 
 if ( !msixtbl_initialised(d) )
@@ -682,7 +682,7 @@ static int vpci_msi_update(const struct pci_dev *pdev, 
uint32_t data,
 {
 unsigned int i;
 
-ASSERT(pcidevs_locked());
+ASSERT(pcidevs_read_locked());
 
 if ( (address & MSI_ADDR_BASE_MASK) != MSI_ADDR_HEADER )
 {
@@ -724,7 +724,7 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const 
struct pci_dev *pdev)
 
 ASSERT(msi->arch.pirq != INVALID_PIRQ);
 
-pcidevs_lock();
+pcidevs_read_lock();
 for ( i = 0; i < msi->vectors && msi->arch.bound; i++ )
 {
 struct xen_domctl_bind_pt_irq unbind = {
@@ -743,7 +743,7 @@ void vpci_msi_arch_update(struct vpci_msi *msi, const 
struct pci_dev *pdev)
 
 msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address,
msi->vectors, msi->arch.pirq, 
msi->mask);
-pcidevs_unlock();
+pcidevs_read_unlock();
 }
 
 static int vpci_msi_enable(const struct pci_dev *pdev, unsigned int nr,
@@ -783,10 +783,10 @@ int vpci_msi_arch_enable(struct vpci_msi *msi, const 
struct pci_dev *pdev,
 return rc;
 msi->arch.pirq = rc;
 
-pcidevs_lock();
+pcidevs_read_lock();
 msi->arch.bound = !vpci_msi_update(pdev, msi->data, msi->address, vectors,
msi->arch.pirq, msi->mask);
-pcidevs_unlock();
+pcidevs_read_unlock();
 
 return 0;
 }
@@ -798,7 +798,7 @@ static void vpci_msi_disable(const struct pci_dev *pdev, 
int pirq,
 
 ASSERT(pirq != INVALID_PIRQ);
 
-pcidevs_lock();
+pcidevs_read_lock();
 for ( i = 0; i < nr && bound; i++ )
 {
 struct xen_domctl_bind_pt_irq bind = {
@@ -814,7 +814,7 @@ static void vpci_msi_disable(const struct pci_dev *pdev, 
int pirq,
 spin_lock(>domain->event_lock);
 unmap_domain_pirq(pdev->domain, pirq);
 spin_unlock(>domain->event_lock);
-pcidevs_unlock();
+pcidevs_read_unlock();
 }
 
 void vpci_msi_arch_disable(struct vpci_msi *msi, const struct pci_dev *pdev)
@@ -861,7 +861,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry 
*entry,
 
 entry->arch.pirq = rc;
 
-pcidevs_lock();
+pcidevs_read_lock();
 rc = vpci_msi_update(pdev, entry->data, entry->addr, 1, entry->arch.pirq,
  entry->masked);
 if ( rc )
@@ -869,7 +869,7 @@ int vpci_msix_arch_enable_entry(struct vpci_msix_entry 
*entry,
 vpci_msi_disable(pdev, entry->arch.pirq, 1,

[PATCH v2 4/4] vpci: include xen/vmap.h to fix build on ARM

2022-07-18 Thread Volodymyr Babchuk

Patch b4f211606011 ("vpci/msix: fix PBA accesses") introduced call to
iounmap(), but not added corresponding include.

Fixes: b4f211606011 ("vpci/msix: fix PBA accesses")

Signed-off-by: Volodymyr Babchuk 
---
 xen/drivers/vpci/vpci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 1559763479..674c9b347d 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -19,6 +19,7 @@
 
 #include 
 #include 
+#include 
 
 /* Internal struct to store the emulated PCI registers. */
 struct vpci_register {
-- 
2.36.1

[PATCH v2 2/4] vpci: restrict unhandled read/write operations for guests

2022-07-18 Thread Volodymyr Babchuk

From: Oleksandr Andrushchenko 

A guest would be able to read and write those registers which are not
emulated and have no respective vPCI handlers, so it will be possible
for it to access the hardware directly.
In order to prevent a guest from reads and writes from/to the unhandled
registers make sure only hardware domain can access the hardware directly
and restrict guests from doing so.

Suggested-by: Roger Pau Monné 
Signed-off-by: Oleksandr Andrushchenko 

---
Since v6:
- do not use is_hwdom parameter for vpci_{read|write}_hw and use
  current->domain internally
- update commit message
New in v6
Moved into another series
---
 xen/drivers/vpci/vpci.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index 9fb3c05b2b..c7a40a2f41 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -215,6 +215,10 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int 
reg,
 {
 uint32_t data;
 
+/* Guest domains are not allowed to read real hardware. */
+if ( !is_hardware_domain(current->domain) )
+return ~(uint32_t)0;
+
 switch ( size )
 {
 case 4:
@@ -255,9 +259,13 @@ static uint32_t vpci_read_hw(pci_sbdf_t sbdf, unsigned int 
reg,
 return data;
 }
 
-static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
-  uint32_t data)
+static void vpci_write_hw(pci_sbdf_t sbdf, unsigned int reg,
+  unsigned int size, uint32_t data)
 {
+/* Guest domains are not allowed to write real hardware. */
+if ( !is_hardware_domain(current->domain) )
+return;
+
 switch ( size )
 {
 case 4:
-- 
2.36.1

[PATCH v2 0/4] vpci: first series in preparation for vpci on ARM

2022-07-18 Thread Volodymyr Babchuk

Hello all,

While Oleksandr Andrushchenko is busy with more important matters, I
want to continue his work on preparing different Xen subsystems to
support PCI on ARM platform.

This patch series are mostly focused on next take of PCI locking
rework. It introduces reentrant read/write locking mechanism for
pcidevs, which will be immediatelly used to decouple readers and
writers in the PCI code. There are also some minor changes to a
related vpci code.

Oleksandr Andrushchenko (3):
  pci: add rwlock to pcidevs_lock machinery
  vpci: restrict unhandled read/write operations for guests
  vpci: use pcidevs locking to protect MMIO handlers

Volodymyr Babchuk (1):
  vpci: include xen/vmap.h to fix build on ARM

 xen/arch/x86/hvm/vmsi.c   | 46 +
 xen/arch/x86/irq.c|  8 ++---
 xen/arch/x86/msi.c| 16 -
 xen/drivers/passthrough/pci.c | 65 +++
 xen/drivers/vpci/header.c | 24 +++--
 xen/drivers/vpci/msi.c| 21 +++
 xen/drivers/vpci/msix.c   | 55 +
 xen/drivers/vpci/vpci.c   | 29 +---
 xen/include/xen/pci.h | 11 +-
 xen/include/xen/vpci.h|  2 +-
 10 files changed, 223 insertions(+), 54 deletions(-)

-- 
2.36.1

[PATCH 3/3] x86/spec-ctrl: Shrink further entry paths due to %rdx being 0

2022-07-18 Thread Andrew Cooper

This is a continuation of the observation from:
  e9b8d31981f1 ("x86/spec-ctrl: Rework SPEC_CTRL_ENTRY_FROM_INTR_IST")
  53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry")

With %rdx known to be zero and not clobbered on the early entry path, we don't
need to re-zero it every time want to write to an MSR.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 
---
 xen/arch/x86/hvm/vmx/entry.S | 4 +---
 xen/arch/x86/include/asm/spec_ctrl_asm.h | 3 +--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index 5f5de45a1309..392aca42b864 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -33,13 +33,12 @@ ENTRY(vmx_asm_vmexit_handler)
 movb $1,VCPU_vmx_launched(%rbx)
 mov  %rax,VCPU_hvm_guest_cr2(%rbx)
 
-/* SPEC_CTRL_ENTRY_FROM_VMXReq: b=curr %rsp=regs/cpuinfo, Clob: 
acd */
+/* SPEC_CTRL_ENTRY_FROM_VMXReq: %rsp=regs/cpuinfo, %rdx=0, Clob: 
acd */
 ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM
 
 .macro restore_spec_ctrl
 mov$MSR_SPEC_CTRL, %ecx
 movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax
-xor%edx, %edx
 wrmsr
 .endm
 ALTERNATIVE "", restore_spec_ctrl, X86_FEATURE_SC_MSR_HVM
@@ -49,7 +48,6 @@ ENTRY(vmx_asm_vmexit_handler)
 .macro restore_lbr
 mov $IA32_DEBUGCTLMSR_LBR, %eax
 mov $MSR_IA32_DEBUGCTLMSR, %ecx
-xor %edx, %edx
 wrmsr
 .endm
 ALTERNATIVE "", restore_lbr, X86_FEATURE_XEN_LBR
diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h 
b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index fab27ff5532b..61eed8510ba9 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -176,7 +176,7 @@
 .macro DO_SPEC_CTRL_ENTRY maybexen:req
 /*
  * Requires %rsp=regs (also cpuinfo if !maybexen)
- * Requires %r14=stack_end (if maybexen)
+ * Requires %r14=stack_end (if maybexen), %rdx=0
  * Clobbers %rax, %rcx, %rdx
  *
  * PV guests can't update MSR_SPEC_CTRL behind Xen's back, so no need to read
@@ -184,7 +184,6 @@
  * while entries from Xen must leave shadowing in its current state.
  */
 mov $MSR_SPEC_CTRL, %ecx
-xor %edx, %edx
 
 /*
  * Clear SPEC_CTRL shadowing *before* loading Xen's value.  If entering
-- 
2.11.0

[PATCH 2/3] x86/spec-ctrl: Make svm_vmexit_spec_ctrl conditional

2022-07-18 Thread Andrew Cooper

The logic was written this way out of an abundance of caution, but the reality
is that AMD parts don't currently have the RAS-flushing side effect, and nor
do they intend to gain it.

This removes one WRMSR from the VMExit path by default on Zen2 systems.

Fixes: 614cec7d79d7 ("x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL")
Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 

Zen3 doesn't get a speedup in general, because we use the WRMSR's to clear
IBRS to avoid forcing it behind a VM's back.
---
 xen/arch/x86/hvm/svm/entry.S | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index 0ff4008060fa..a60d759f7108 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -113,15 +113,15 @@ __UNLIKELY_END(nsvm_hap)
 ALTERNATIVE "", DO_OVERWRITE_RSB, X86_FEATURE_SC_RSB_HVM
 
 .macro svm_vmexit_spec_ctrl
-/*
- * Write to MSR_SPEC_CTRL unconditionally, for the RAS[:32]
- * flushing side effect.
- */
-mov$MSR_SPEC_CTRL, %ecx
 movzbl CPUINFO_xen_spec_ctrl(%rsp), %eax
+movzbl CPUINFO_last_spec_ctrl(%rsp), %edx
+cmp%edx, %eax
+je 1f /* Skip write if value is correct. */
+mov$MSR_SPEC_CTRL, %ecx
 xor%edx, %edx
 wrmsr
 mov%al, CPUINFO_last_spec_ctrl(%rsp)
+1:
 .endm
 ALTERNATIVE "", svm_vmexit_spec_ctrl, X86_FEATURE_SC_MSR_HVM
 /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
-- 
2.11.0

[PATCH 1/3] x86/spec-ctrl: Consistently halt speculation using int3

2022-07-18 Thread Andrew Cooper

The RSB stuffing loop and retpoline thunks date from the very beginning, when
halting speculation was a brand new field.

These days, we've largely settled on int3 for halting speculation in
non-architectural paths.  It's a single byte, and is fully serialising - a
requirement for delivering #BP if it were to execute.

Update the thunks.  Mostly for consistency across the codebase, but it does
shrink every entrypath in Xen by 6 bytes which is a marginal win.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 
---
 xen/arch/x86/include/asm/spec_ctrl_asm.h | 11 +++
 xen/arch/x86/indirect-thunk.S|  6 ++
 2 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/include/asm/spec_ctrl_asm.h 
b/xen/arch/x86/include/asm/spec_ctrl_asm.h
index 9eb4ad9ab71d..fab27ff5532b 100644
--- a/xen/arch/x86/include/asm/spec_ctrl_asm.h
+++ b/xen/arch/x86/include/asm/spec_ctrl_asm.h
@@ -126,9 +126,8 @@
  * change. Based on Google's performance numbers, the loop is unrolled to 16
  * iterations and two calls per iteration.
  *
- * The call filling the RSB needs a nonzero displacement.  A nop would do, but
- * we use "1: pause; lfence; jmp 1b" to safely contains any ret-based
- * speculation, even if the loop is speculatively executed prematurely.
+ * The call filling the RSB needs a nonzero displacement, and int3 halts
+ * speculation.
  *
  * %rsp is preserved by using an extra GPR because a) we've got plenty spare,
  * b) the two movs are shorter to encode than `add $32*8, %rsp`, and c) can be
@@ -141,11 +140,7 @@
 
 .irp n, 1, 2/* Unrolled twice. */
 call .L\@_insert_rsb_entry_\n   /* Create an RSB entry. */
-
-.L\@_capture_speculation_\n:
-pause
-lfence
-jmp .L\@_capture_speculation_\n /* Capture rogue speculation. */
+int3/* Halt rogue speculation. */
 
 .L\@_insert_rsb_entry_\n:
 .endr
diff --git a/xen/arch/x86/indirect-thunk.S b/xen/arch/x86/indirect-thunk.S
index 7cc22da0ef93..de6aef606832 100644
--- a/xen/arch/x86/indirect-thunk.S
+++ b/xen/arch/x86/indirect-thunk.S
@@ -12,11 +12,9 @@
 #include 
 
 .macro IND_THUNK_RETPOLINE reg:req
-call 2f
+call 1f
+int3
 1:
-lfence
-jmp 1b
-2:
 mov %\reg, (%rsp)
 ret
 .endm
-- 
2.11.0

[PATCH 0/3] XSA-407 followon fixes

2022-07-18 Thread Andrew Cooper

Fixes accumulated during XSA-407 which weren't security worthy in and of
themselves.

Andrew Cooper (3):
  x86/spec-ctrl: Consistently halt speculation using int3
  x86/spec-ctrl: Make svm_vmexit_spec_ctrl conditional
  x86/spec-ctrl: Shrink further entry paths due to %rdx being 0

 xen/arch/x86/hvm/svm/entry.S | 10 +-
 xen/arch/x86/hvm/vmx/entry.S |  4 +---
 xen/arch/x86/include/asm/spec_ctrl_asm.h | 14 --
 xen/arch/x86/indirect-thunk.S|  6 ++
 4 files changed, 12 insertions(+), 22 deletions(-)

-- 
2.11.0

[ovmf test] 171671: all pass - PUSHED

2022-07-18 Thread osstest service owner

flight 171671 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171671/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 792ebb6374f2b92e2c4b84f5b8d151a129ed81cc
baseline version:
 ovmf fc4a132c0e9d108d30115cb8729f8fcd5564b855

Last test of basis   171668  2022-07-18 13:13:08 Z0 days
Testing same since   171671  2022-07-18 17:10:27 Z0 days1 attempts


People who touched revisions under test:
  Pierre Gondois 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   fc4a132c0e..792ebb6374  792ebb6374f2b92e2c4b84f5b8d151a129ed81cc -> 
xen-tested-master

[xen-unstable-smoke test] 171669: tolerable all pass - PUSHED

2022-07-18 Thread osstest service owner

flight 171669 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171669/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  0af91dc0326cba12795e0b8fa8f665776e2a9e13
baseline version:
 xen  a5fb66f4513c2c2d222dcc3753163b15690bd003

Last test of basis   171661  2022-07-17 14:00:25 Z1 days
Testing same since   171669  2022-07-18 15:01:42 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Anthony PERARD 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   a5fb66f451..0af91dc032  0af91dc0326cba12795e0b8fa8f665776e2a9e13 -> smoke

[PATCH 5/5] Ignore failure to unmap -1

2022-07-18 Thread Demi Marie Obenour

[ Upstream commit 166d3863231667c4f64dee72b77d1102cdfad11f ]

The error paths of gntdev_mmap() can call unmap_grant_pages() even
though not all of the pages have been successfully mapped.  This will
trigger the WARN_ON()s in __unmap_grant_pages_done().  The number of
warnings can be very large; I have observed thousands of lines of
warnings in the systemd journal.

Avoid this problem by only warning on unmapping failure if the handle
being unmapped is not -1.  The handle field of any page that was not
successfully mapped will be -1, so this catches all cases where
unmapping can legitimately fail.

Suggested-by: Juergen Gross 
Cc: sta...@vger.kernel.org
Signed-off-by: Demi Marie Obenour 
Fixes: 79963021fd71 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
---
 drivers/xen/gntdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 
f415c056ff8ab8d808ee2bacfaa3cad57af28204..54fee4087bf1078803c230ad2081aafa8415cf53
 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -401,7 +401,8 @@ static void __unmap_grant_pages_done(int result,
unsigned int offset = data->unmap_ops - map->unmap_ops;
 
for (i = 0; i < data->count; i++) {
-   WARN_ON(map->unmap_ops[offset+i].status);
+   WARN_ON(map->unmap_ops[offset+i].status &&
+   map->unmap_ops[offset+i].handle != -1);
pr_debug("unmap handle=%d st=%d\n",
map->unmap_ops[offset+i].handle,
map->unmap_ops[offset+i].status);
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[PATCH 4/5] Ignore failure to unmap -1

2022-07-18 Thread Demi Marie Obenour

[ Upstream commit 166d3863231667c4f64dee72b77d1102cdfad11f ]

The error paths of gntdev_mmap() can call unmap_grant_pages() even
though not all of the pages have been successfully mapped.  This will
trigger the WARN_ON()s in __unmap_grant_pages_done().  The number of
warnings can be very large; I have observed thousands of lines of
warnings in the systemd journal.

Avoid this problem by only warning on unmapping failure if the handle
being unmapped is not -1.  The handle field of any page that was not
successfully mapped will be -1, so this catches all cases where
unmapping can legitimately fail.

Suggested-by: Juergen Gross 
Cc: sta...@vger.kernel.org
Signed-off-by: Demi Marie Obenour 
Fixes: ee25841221c1 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
---
 drivers/xen/gntdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 
f464793477650e631c8928e85c1990c5964c2e94..bba849e5d8a7b4d54925b842fbe3c6792e0f0214
 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -413,7 +413,8 @@ static void __unmap_grant_pages_done(int result,
unsigned int offset = data->unmap_ops - map->unmap_ops;
 
for (i = 0; i < data->count; i++) {
-   WARN_ON(map->unmap_ops[offset+i].status);
+   WARN_ON(map->unmap_ops[offset+i].status &&
+   map->unmap_ops[offset+i].handle != -1);
pr_debug("unmap handle=%d st=%d\n",
map->unmap_ops[offset+i].handle,
map->unmap_ops[offset+i].status);
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[PATCH 3/5] Ignore failure to unmap -1

2022-07-18 Thread Demi Marie Obenour

[ Upstream commit 166d3863231667c4f64dee72b77d1102cdfad11f ]

The error paths of gntdev_mmap() can call unmap_grant_pages() even
though not all of the pages have been successfully mapped.  This will
trigger the WARN_ON()s in __unmap_grant_pages_done().  The number of
warnings can be very large; I have observed thousands of lines of
warnings in the systemd journal.

Avoid this problem by only warning on unmapping failure if the handle
being unmapped is not -1.  The handle field of any page that was not
successfully mapped will be -1, so this catches all cases where
unmapping can legitimately fail.

Suggested-by: Juergen Gross 
Cc: sta...@vger.kernel.org
Signed-off-by: Demi Marie Obenour 
Fixes: 73e9e72247b9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
---
 drivers/xen/gntdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 
492084814f55d8ed46d2d656db75b28a91dd7f06..27d955c5d9f9076266f77b9fedfa1a6a2ba08f56
 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -416,7 +416,8 @@ static void __unmap_grant_pages_done(int result,
unsigned int offset = data->unmap_ops - map->unmap_ops;
 
for (i = 0; i < data->count; i++) {
-   WARN_ON(map->unmap_ops[offset+i].status);
+   WARN_ON(map->unmap_ops[offset+i].status &&
+   map->unmap_ops[offset+i].handle != -1);
pr_debug("unmap handle=%d st=%d\n",
map->unmap_ops[offset+i].handle,
map->unmap_ops[offset+i].status);
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[PATCH 2/5] Ignore failure to unmap -1

2022-07-18 Thread Demi Marie Obenour

[ Upstream commit 166d3863231667c4f64dee72b77d1102cdfad11f ]

The error paths of gntdev_mmap() can call unmap_grant_pages() even
though not all of the pages have been successfully mapped.  This will
trigger the WARN_ON()s in __unmap_grant_pages_done().  The number of
warnings can be very large; I have observed thousands of lines of
warnings in the systemd journal.

Avoid this problem by only warning on unmapping failure if the handle
being unmapped is not -1.  The handle field of any page that was not
successfully mapped will be -1, so this catches all cases where
unmapping can legitimately fail.

Suggested-by: Juergen Gross 
Cc: sta...@vger.kernel.org
Signed-off-by: Demi Marie Obenour 
Fixes: 2fe26a9a7048 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
---
 drivers/xen/gntdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 
2827015604fbaa45f0ca60c342bbf71ce4132dec..799173755b785e2f3e2483855f75af1eba8e9373
 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -393,7 +393,8 @@ static void __unmap_grant_pages_done(int result,
unsigned int offset = data->unmap_ops - map->unmap_ops;
 
for (i = 0; i < data->count; i++) {
-   WARN_ON(map->unmap_ops[offset+i].status);
+   WARN_ON(map->unmap_ops[offset+i].status &&
+   map->unmap_ops[offset+i].handle != -1);
pr_debug("unmap handle=%d st=%d\n",
map->unmap_ops[offset+i].handle,
map->unmap_ops[offset+i].status);
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

[PATCH 1/5] Ignore failure to unmap -1

2022-07-18 Thread Demi Marie Obenour

[ Upstream commit 166d3863231667c4f64dee72b77d1102cdfad11f ]

The error paths of gntdev_mmap() can call unmap_grant_pages() even
though not all of the pages have been successfully mapped.  This will
trigger the WARN_ON()s in __unmap_grant_pages_done().  The number of
warnings can be very large; I have observed thousands of lines of
warnings in the systemd journal.

Avoid this problem by only warning on unmapping failure if the handle
being unmapped is not -1.  The handle field of any page that was not
successfully mapped will be -1, so this catches all cases where
unmapping can legitimately fail.

Suggested-by: Juergen Gross 
Cc: sta...@vger.kernel.org
Signed-off-by: Demi Marie Obenour 
Fixes: 36cd49b071fc ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
---
 drivers/xen/gntdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 
2c3248e71e9c1a3e032b847d177b02855cdda1a1..a6585854a85fc6fffc16c3498ba73fbee84ad6ca
 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -390,7 +390,8 @@ static void __unmap_grant_pages_done(int result,
unsigned int offset = data->unmap_ops - map->unmap_ops;
 
for (i = 0; i < data->count; i++) {
-   WARN_ON(map->unmap_ops[offset+i].status);
+   WARN_ON(map->unmap_ops[offset+i].status &&
+   map->unmap_ops[offset+i].handle != -1);
pr_debug("unmap handle=%d st=%d\n",
map->unmap_ops[offset+i].handle,
map->unmap_ops[offset+i].status);
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Backport of 166d38632316 ("xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE")

2022-07-18 Thread Demi Marie Obenour

This series backports upstream commit 166d3863231667c4f64dee72b77d1102cdfad11f
to all supported stable kernel trees.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Re: [PATCH v2 3/3] xen/heap: pass order to free_heap_pages() in heap init

2022-07-18 Thread Julien Grall


Hi Jan,

On 18/07/2022 12:02, Jan Beulich wrote:

On 18.07.2022 12:24, Julien Grall wrote:
3)
 unsigned int inc_order = min(MAX_ORDER, flsl(e - s) - 1);

 if ( s )
 inc_order = min(inc_order, ffsl(s) - 1U);


I like this idea!



No compilation issues to expect here, afaict.


Correct, GCC is happy with this approach. Assuming there are no other 
comments, are you happy if I make the change on commit?


Cheers,


--
Julien Grall

Re: [PATCH v2 0/4] tools/xenstore: add some new features to the documentation

2022-07-18 Thread Julien Grall


Hi Jan,

On 18/07/2022 17:12, Jan Beulich wrote:

On 27.05.2022 09:24, Juergen Gross wrote:

In the past there have been spotted some shortcomings in the Xenstore
interface, which should be repaired. Those are in detail:

- Using driver domains for large number of domains needs per domain
   Xenstore quota [1]. The feedback sent was rather slim (one reply),
   but it was preferring a new set of wire commands.

- XSA-349 [2] has shown that the current definition of watches is not
   optimal, as it will trigger lots of events when a single one would
   suffice: for detecting new backend devices the backends in the Linux
   kernel are registering a watch for e.g. "/local/domain/0/backend"
   which will fire for ANY sub-node written below this node (on a test
   machine this added up to 91 watch events for only 3 devices).
   This can be limited dramatically by extending the XS_WATCH command
   to take another optional parameter specifying the depth of
   subdirectories to be considered for sending watch events ("0" would
   trigger a watch event only if the watched node itself being written).

- New features like above being added might make migration of guests
   between hosts with different Xenstore variants harder, so it should
   be possible to set the available feature set per domain. For socket
   connections it should be possible to read the available features.

- The special watches @introduceDomain and @releaseDomain are rather
   cumbersome to use, as they only tell you that SOME domain has been
   introduced/released. Any consumer of those watches needs to scan
   all domains on the host in order to find out the domid, causing
   significant pressure on the dominfo hypercall (imagine a system
   with 1000 domains running and one domain dying - there will be more
   than 1000 watch events triggered and 1000 xl daemons will try to
   find out whether "their" domain has died). Those watches should be
   enhanced to optionally be specific to a single domain and to let the
   event carry the related domid.

As some of those extensions will need to be considered in the Xenstore
migration stream, they should be defined in one go (in fact the 4th one
wouldn't need that, but it can easily be connected to the 2nd one).
As such extensions need to be flagged in the "features" in the ring
page anyway, it is fine to implement them independently.

Add the documentation of the new commands/features.

[1]: https://lists.xen.org/archives/html/xen-devel/2020-06/msg00291.html
[2]: http://xenbits.xen.org/xsa/advisory-349.html

Changes in V2:
- added new patch 1
- remove feature bits for dom0-only features
- get-features without domid returns Xenstore supported features
- get/set-quota without domid for global quota access

Juergen Gross (4):
   tools/xenstore: modify feature bit specification in xenstore-ring.txt
   tools/xenstore: add documentation for new set/get-feature commands
   tools/xenstore: add documentation for new set/get-quota commands
   tools/xenstore: add documentation for extended watch command


Hmm, looks like I did commit v1 of this series, not noticing the v2 _and_
seeing there had been R-b with no other follow-up (leaving aside the v2)
in a long time. Please advise if I should revert the commits. I'm sorry
for the confusion. (I also wonder why the R-b weren't carried over to v2.)


patch #1 is a new patch. The patch #2, #3, #4 have been reworded and the 
overall interaction is different. So I don't think the reviewed-by 
should have been carried.


I had some concerns in v1 which were addressed in v2. I have reviewed v2 
a while ago. From my perspective, patch #1, #3, #4 are ready to go. 
Patch #2 needs a respin and we also need to clarify the integration with 
migration/live-update.


As you committed, I would be OK if this is addressed in a follow-up 
series. But this *must* be addressed by the time 4.17 is released 
because otherwise we will commit ourself to a broken interface. @Henry, 
please add this in the blocker list.


Cheers,

--
Julien Grall

Re: [PATCH v2 0/4] tools/xenstore: add some new features to the documentation

2022-07-18 Thread Jan Beulich

On 27.05.2022 09:24, Juergen Gross wrote:
> In the past there have been spotted some shortcomings in the Xenstore
> interface, which should be repaired. Those are in detail:
> 
> - Using driver domains for large number of domains needs per domain
>   Xenstore quota [1]. The feedback sent was rather slim (one reply),
>   but it was preferring a new set of wire commands.
> 
> - XSA-349 [2] has shown that the current definition of watches is not
>   optimal, as it will trigger lots of events when a single one would
>   suffice: for detecting new backend devices the backends in the Linux
>   kernel are registering a watch for e.g. "/local/domain/0/backend"
>   which will fire for ANY sub-node written below this node (on a test
>   machine this added up to 91 watch events for only 3 devices).
>   This can be limited dramatically by extending the XS_WATCH command
>   to take another optional parameter specifying the depth of
>   subdirectories to be considered for sending watch events ("0" would
>   trigger a watch event only if the watched node itself being written).
> 
> - New features like above being added might make migration of guests
>   between hosts with different Xenstore variants harder, so it should
>   be possible to set the available feature set per domain. For socket
>   connections it should be possible to read the available features.
> 
> - The special watches @introduceDomain and @releaseDomain are rather
>   cumbersome to use, as they only tell you that SOME domain has been
>   introduced/released. Any consumer of those watches needs to scan
>   all domains on the host in order to find out the domid, causing
>   significant pressure on the dominfo hypercall (imagine a system
>   with 1000 domains running and one domain dying - there will be more
>   than 1000 watch events triggered and 1000 xl daemons will try to
>   find out whether "their" domain has died). Those watches should be
>   enhanced to optionally be specific to a single domain and to let the
>   event carry the related domid.
> 
> As some of those extensions will need to be considered in the Xenstore
> migration stream, they should be defined in one go (in fact the 4th one
> wouldn't need that, but it can easily be connected to the 2nd one).
> As such extensions need to be flagged in the "features" in the ring
> page anyway, it is fine to implement them independently.
> 
> Add the documentation of the new commands/features.
> 
> [1]: https://lists.xen.org/archives/html/xen-devel/2020-06/msg00291.html
> [2]: http://xenbits.xen.org/xsa/advisory-349.html
> 
> Changes in V2:
> - added new patch 1
> - remove feature bits for dom0-only features
> - get-features without domid returns Xenstore supported features
> - get/set-quota without domid for global quota access
> 
> Juergen Gross (4):
>   tools/xenstore: modify feature bit specification in xenstore-ring.txt
>   tools/xenstore: add documentation for new set/get-feature commands
>   tools/xenstore: add documentation for new set/get-quota commands
>   tools/xenstore: add documentation for extended watch command

Hmm, looks like I did commit v1 of this series, not noticing the v2 _and_
seeing there had been R-b with no other follow-up (leaving aside the v2)
in a long time. Please advise if I should revert the commits. I'm sorry
for the confusion. (I also wonder why the R-b weren't carried over to v2.)

Jan

Re: [PATCH v4 0/3] xen-blk{back,front}: Fix two bugs in 'feature_persistent'

2022-07-18 Thread Andrii Chepurnyi

Hello SeongJae,

Thanks for the efforts.
I've tested backend patches(1,2) on my custom 5.10 kernel (since I
can't test on vanilla) and it works for me.

Best regards,
Andrii


On Sat, Jul 16, 2022 at 1:51 AM SeongJae Park  wrote:
>
> Introduction of 'feature_persistent' made two bugs.  First one is wrong
> overwrite of 'vbd->feature_gnt_persistent' in 'blkback' due to wrong
> parameter value caching position, and the second one is unintended
> behavioral change that could break previous dynamic frontend/backend
> persistent feature support changes.  This patchset fixes the issues.
>
> Changes from v3
> (https://lore.kernel.org/xen-devel/20220715175521.126649-1...@kernel.org/)
> - Split 'blkback' patch for each of the two issues
> - Add 'Reported-by: Andrii Chepurnyi '
>
> Changes from v2
> (https://lore.kernel.org/xen-devel/20220714224410.51147-1...@kernel.org/)
> - Keep the behavioral change of v1
> - Update blkfront's counterpart to follow the changed behavior
> - Update documents for the changed behavior
>
> Changes from v1
> (https://lore.kernel.org/xen-devel/20220106091013.126076-1-mhe...@amazon.de/)
> - Avoid the behavioral change
>   (https://lore.kernel.org/xen-devel/20220121102309.27802-1...@kernel.org/)
> - Rebase on latest xen/tip/linux-next
> - Re-work by SeongJae Park 
> - Cc stable@
>
> Maximilian Heyne (1):
>   xen-blkback: Apply 'feature_persistent' parameter when connect
>
> SeongJae Park (2):
>   xen-blkback: fix persistent grants negotiation
>   xen-blkfront: Apply 'feature_persistent' parameter when connect
>
>  .../ABI/testing/sysfs-driver-xen-blkback  |  2 +-
>  .../ABI/testing/sysfs-driver-xen-blkfront |  2 +-
>  drivers/block/xen-blkback/xenbus.c| 20 ---
>  drivers/block/xen-blkfront.c  |  4 +---
>  4 files changed, 11 insertions(+), 17 deletions(-)
>
> --
> 2.25.1
>

Re: [PATCH] x86: deal with gcc12 release build issues

2022-07-18 Thread Andrew Cooper

On 14/07/2022 08:53, Jan Beulich wrote:
> While a number of issues we previously had with pre-release gcc12 were
> fixed in the final release, we continue to have one issue (with multiple
> instances) when doing release builds (i.e. at higher optimization
> levels): The compiler takes issue with subtracting (always 1 in our
> case) from artifical labels (expressed as array) marking the end of
> certain regions. This isn't an unreasonable position to take. Simply
> hide the "array-ness" by casting to an integer type. To keep things
> looking consistently, apply the same cast also on the respective
> expressions dealing with the starting addresses. (Note how
> efi_arch_memory_setup()'s l2_table_offset() invocations avoid a similar
> issue by already having the necessary casts.) In is_xen_fixed_mfn()
> further switch from __pa() to virt_to_maddr() to better match the left
> sides of the <= operators.
>
> Reported-by: Charles Arnold 
> Signed-off-by: Jan Beulich 
> ---
> Initially I had considered introducing something like END_MINUS_1(), but
> in the end I did consider this uglier than explicitly dealing with the
> two instances we have.

Yeah, I prefer this form.

Acked-by: Andrew Cooper

Re: [PATCH] x86/spec-ctrl: correct per-guest-type reporting of MD_CLEAR

2022-07-18 Thread Andrew Cooper

On 13/07/2022 08:52, Jan Beulich wrote:
> There are command line controls for this and the default also isn't "always
> enable when hardware supports it", which logging should take into account.
>
> Signed-off-by: Jan Beulich 

Reviewed-by: Andrew Cooper

Re: [PATCH] x86: log non-responding CPUs in fatal_trap()

2022-07-18 Thread Andrew Cooper

On 13/07/2022 07:51, Jan Beulich wrote:
> This eases recognizing that something odd is going on.
>
> Signed-off-by: Jan Beulich 
>
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -845,6 +845,9 @@ void fatal_trap(const struct cpu_user_re
>  msecs = 10;
>  }
>  }
> +if ( pending )
> +printk("Non-responding CPUs: %*pbl\n",
> +   CPUMASK_PR(_state_mask));

Prevailing style elsewhere is {%*pbl}.

Preferably with that adjusted, Acked-by: Andrew Cooper

[ovmf test] 171668: all pass - PUSHED

2022-07-18 Thread osstest service owner

flight 171668 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171668/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf fc4a132c0e9d108d30115cb8729f8fcd5564b855
baseline version:
 ovmf 039bdb4d3e96f9c9264abf135b8a0eef2e2b4860

Last test of basis   171659  2022-07-17 07:43:05 Z1 days
Testing same since   171668  2022-07-18 13:13:08 Z0 days1 attempts


People who touched revisions under test:
  Pierre Gondois 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   039bdb4d3e..fc4a132c0e  fc4a132c0e9d108d30115cb8729f8fcd5564b855 -> 
xen-tested-master

Re: [PATCH v2 9/9] xue: allow driving the rest of XHCI by a domain while Xen uses DbC

2022-07-18 Thread Jan Beulich

On 18.07.2022 14:54, Marek Marczykowski-Górecki wrote:
> On Thu, Jul 14, 2022 at 02:06:07PM +0200, Jan Beulich wrote:
>> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
>>> That's possible, because the capability was designed specifically to
>>> allow separate driver handle it, in parallel to unmodified xhci driver
>>> (separate set of registers, pretending the port is "disconnected" for
>>> the main xhci driver etc). It works with Linux dom0, although requires
>>> an awful hack - re-enabling bus mastering behind dom0's backs.
>>> Linux driver does similar thing - see
>>> drivers/usb/early/xhci-dbc.c:xdbc_handle_events().
>>
>> Isn't there a risk that intermediately data was lost?
> 
> Yes, there is such risk (although minimal in practice - it happens just
> once during dom0 boot). You can avoid it by instructing dom0 to not use
> that USB controller.
> Having this capability is really helpful (compared with the alternative
> of using the whole USB controller by either Xen or Linux), as many
> (most) systems have only a single USB controller.

No question about the usefulness. But this aspect wants spelling out,
and it is one of the arguments against allowing use of the device by
other than hwdom.

>>> To avoid Linux messing with the DbC, mark this MMIO area as read-only.
>>
>> In principle this would want to happen quite a bit earlier in the
>> series. I'm okay with it being kept here as long as it is made
>> very obvious to and easily noticeable by committers that this series
>> should only be committed all in one go.
>>
>> Also along with this is where I'd see the pci_hide_device() go.
> 
> Earlier version of the patch set had pci_ro_device() before this patch.
> I can add pci_ro_device() in the initial patch, and drop it in this one.

Having pci_ro_device() up to here sounds reasonable, but then you still
want to flip to using pci_hide_device() rather than just dropping the
call.

Jan

Re: [PATCH v2 4/9] console: support multiple serial console simultaneously

2022-07-18 Thread Jan Beulich

On 18.07.2022 14:48, Marek Marczykowski-Górecki wrote:
> On Wed, Jul 13, 2022 at 11:39:07AM +0200, Jan Beulich wrote:
>> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/drivers/char/console.c
>>> +++ b/xen/drivers/char/console.c
>>> @@ -113,7 +113,9 @@ static char *__read_mostly conring = _conring;
>>>  static uint32_t __read_mostly conring_size = _CONRING_SIZE;
>>>  static uint32_t conringc, conringp;
>>>  
>>> -static int __read_mostly sercon_handle = -1;
>>> +#define MAX_SERCONS 4
>>
>> Might this want to be a Kconfig setting?
> 
> Is that going to be useful for anybody (who isn't modifying the code
> anyway, for example to add yet another console driver)?

If allowing multiple serial consoles is deemed useful, then making
their maximum count build-time configurable is quite likely useful.
People may not want to allow multiple of them, for example.

>>> @@ -1291,7 +1322,8 @@ static int suspend_steal_id;
>>>  
>>>  int console_suspend(void)
>>>  {
>>> -suspend_steal_id = console_steal(sercon_handle, suspend_steal_fn);
>>> +if ( nr_sercon_handle )
>>> +suspend_steal_id = console_steal(sercon_handle[0], 
>>> suspend_steal_fn);
>>>  serial_suspend();
>>>  return 0;
>>>  }
>>
>> The commit message gives no explanation why only the first handle
>> would want/need dealing with here.
> 
> Sure, I can add an explanation. I'm adding this comment to console_steal():
> /* Redirect any console output to *fn*, if *handle* is configured as a 
> console. */
> 
> So, calling console_steal() is about all serial consoles, not just a
> specific one. The use case for this "if" part is gdbstub, which wants
> to redirect serial output only if that serial was configured as both
> console and gdb. Having proper per-console stealing is doable, but IMO
> not worth it (it would require also avoiding duplicated output in case
> of multiple serial consoles, and probably few more corner cases).

And what if the one handle you pass on isn't the one matching the
console the gdbstub is using? While I understand that per-console
stealing may have some sharp edges, I don't currently see how we can
get away here without handling things per-console.

>> One overall remark: Especially with sync_console latency is going to
>> be yet worse with all output being done sequentially. The help text
>> for "console=" will want to mention this, up and until this would be
>> parallelized.
> 
> I don't think it was parallelized anywhere. All the relevant functions
> (__putstr especially) write to various console types sequentially. The
> difference is that previously only the last "serial" console was used,
> all the other were silently ignored. So, it was "parallel" with all
> _zero other_ serial consoles, but not other console types.

Parallelizing vga and serial likely wasn't deemed very useful, as
vga has negligible latency compared to a (slow) serial line (albeit
I leave aside software scrolling here, which indeed is slow). There
are also no commands involved in vga output which may require waiting
for their completion - it's all simple MMIO writes (and hence the
slowness of scrolling could only be dealt with by involving a 2nd
CPU, as the one doing the scrolling can't at the same time do output
to another device; nevertheless some of the latency could be
compensated by doing output in suitable order). This is quite
different when it comes to multiple serial consoles.

> Anyway, both help text and boot warning for sync_console already warn
> about it. Do you want me to include relation to number of configured
> console explicitly?

I think it should be made explicit, yes.

Jan

Re: [PATCH] x86: Add MMIO Stale Data arch_caps to hardware domain

2022-07-18 Thread Jason Andryuk

On Mon, Jul 18, 2022 at 10:18 AM Jan Beulich  wrote:
>
> On 18.07.2022 16:05, Jason Andryuk wrote:
> > Let the hardware domain know about the hardware it is running on.  This
> > allows a linux Dom0 to know that it has the appropriate microcode via
> > FB_CLEAR.  /sys/devices/system/cpu/vulnerabilities/mmio_stale_data
> > changes from:
> > "Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state
> > unknown"
> > to:
> > "Mitigation: Clear CPU buffers; SMT Host state unknown"
> >
> > Signed-off-by: Jason Andryuk 
> > ---
> > Should calculate_host_policy()'s arch_caps mask also be updated?  They
> > are not identical today, but I'm don't know this code to understand why
> > they differ.
>
> I think this wants updating too, yes. I'm afraid I have to leave it to
> Andrew to provide the reasons for the differences between the two.
>
> I would further suggest to also consider adding RRSBA and BHI_NO, even
> if then the title would want adjusting. And finally I'd like to ask to
> add a proper Fixes: tag (or more), as it looks like the updating here
> was simply forgotten when the bits were introduced. Ideally we'd have
> a way for the compiler to remind us of updates being needed (or at
> least be considered) here.

That all sounds good.

Thanks,
Jason

Re: [PATCH] x86: Add MMIO Stale Data arch_caps to hardware domain

2022-07-18 Thread Jan Beulich

On 18.07.2022 16:05, Jason Andryuk wrote:
> Let the hardware domain know about the hardware it is running on.  This
> allows a linux Dom0 to know that it has the appropriate microcode via
> FB_CLEAR.  /sys/devices/system/cpu/vulnerabilities/mmio_stale_data
> changes from:
> "Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state
> unknown"
> to:
> "Mitigation: Clear CPU buffers; SMT Host state unknown"
> 
> Signed-off-by: Jason Andryuk 
> ---
> Should calculate_host_policy()'s arch_caps mask also be updated?  They
> are not identical today, but I'm don't know this code to understand why
> they differ.

I think this wants updating too, yes. I'm afraid I have to leave it to
Andrew to provide the reasons for the differences between the two.

I would further suggest to also consider adding RRSBA and BHI_NO, even
if then the title would want adjusting. And finally I'd like to ask to
add a proper Fixes: tag (or more), as it looks like the updating here
was simply forgotten when the bits were introduced. Ideally we'd have
a way for the compiler to remind us of updates being needed (or at
least be considered) here.

Jan

Re: [PATCH v2 2/2] xen: Fix latent check-endbr.sh bug with 32bit build environments

2022-07-18 Thread Jan Beulich

On 18.07.2022 14:07, Andrew Cooper wrote:
> On 18/07/2022 10:49, Jan Beulich wrote:
>> On 18.07.2022 11:31, Andrew Cooper wrote:
>>> On 18/07/2022 10:11, Jan Beulich wrote:
 On 15.07.2022 15:26, Andrew Cooper wrote:
> --- a/xen/tools/check-endbr.sh
> +++ b/xen/tools/check-endbr.sh
> @@ -61,19 +61,36 @@ ${OBJDUMP} -j .text $1 -d -w | grep ' endbr64 *$' | 
> cut -f 1 -d ':' > $VALID &
>  #the lower bits, rounding integers to the nearest 4k.
>  #
>  #Instead, use the fact that Xen's .text is within a 1G aligned 
> region, and
> -#split the VMA in half so AWK's numeric addition is only working on 
> 32 bit
> -#numbers, which don't lose precision.
> +#split the VMA so AWK's numeric addition is only working on <32 bit
> +#numbers, which don't lose precision.  (See point 5)
>  #
>  # 4) MAWK doesn't support plain hex constants (an optional part of the 
> POSIX
>  #spec), and GAWK and MAWK can't agree on how to work with hex 
> constants in
>  #a string.  Use the shell to convert $vma_lo to decimal before 
> passing to
>  #AWK.
>  #
> +# 5) Point 4 isn't fully portable.  POSIX only requires that $((0xN)) be
> +#evaluated as long, which in 32bit shells turns negative if bit 31 
> of the
> +#VMA is set.  AWK then interprets this negative number as a double 
> before
> +#adding the offsets from the binary grep.
> +#
> +#Instead of doing an 8/8 split with vma_hi/lo, do a 9/7 split.
> +#
> +#The consequence of this is that for all offsets, $vma_lo + offset 
> needs
> +#to be less that 256M (i.e. 7 nibbles) so as to be successfully 
> recombined
> +#with the 9 nibbles of $vma_hi.  This is fine; .text is at the start 
> of a
> +#1G aligned region, and Xen is far far smaller than 256M, but leave 
> safety
> +#check nevertheless.
> +#
>  eval $(${OBJDUMP} -j .text $1 -h |
> -$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 
> 8), substr($4, 9, 16)}')
> +$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 
> 9), substr($4, 10, 16)}')
>  
>  ${OBJCOPY} -j .text $1 -O binary $TEXT_BIN
>  
> +bin_sz=$(stat -c '%s' $TEXT_BIN)
> +[ "$bin_sz" -ge $(((1 << 28) - $vma_lo)) ] &&
> +{ echo "$MSG_PFX Error: .text offsets can exceed 256M" >&2; exit 1; }
 ... s/can/cannot/ ?
>>> Why?  "Can" is correct here.  If the offsets can't exceed 256M, then
>>> everything is good.
>> Hmm, the wording then indeed is ambiguous.
> 
> I see your point.  In this case it's meant as "are able to", but this is
> still clearer than using "can't" because at least the text matches the
> check which triggered it.
> 
>>  I read "can" as "are allowed
>> to", when we mean "aren't allowed to". Maybe ".text is 256M or more in
>> size"? If you mention "offsets", then I think the check should be based
>> on actually observing an offset which is too large (which .text size
>> alone doesn't guarantee will happen).
> 
> It's not just .text on its own because the VMA of offset by 2M, hence
> the subtraction of $vma_lo in the main calculation.
> 
> There's no point searching for offsets.  There will be one near the end,
> so all searching for an offset would do is complicate the critical loop.
> 
> How about ".text offsets must not exceed 256M" ?
> 
> That should be unambiguous.

Yes, that reads fine. Thanks.

Jan

Re: [PATCH 4/7] xen/arm: mm: Add more ASSERT() in {destroy, modify}_xen_mappings()

2022-07-18 Thread Julien Grall


Hi Jan,

On 18/07/2022 09:47, Jan Beulich wrote:

On 16.07.2022 16:38, Julien Grall wrote:

On 04/07/2022 13:35, Bertrand Marquis wrote:

On 24 Jun 2022, at 10:11, Julien Grall  wrote:

From: Julien Grall 

Both destroy_xen_mappings() and modify_xen_mappings() will take in
parameter a range [start, end[. Both end should be page aligned.

Add extra ASSERT() to ensure start and end are page aligned. Take the
opportunity to rename 'v' to 's' to be consistent with the other helper.

Signed-off-by: Julien Grall 


With the prototype updated in mm.h as suggested by Michal:


Done. I will send a new version shortly.


I notice you had applied and then reverted this, with the revert saying
an x86 ack was missing. I don't see any need for an x86 ack here though,
so I'm puzzled ...


Sorry, I am not sure why I thought this needed an x86 ack. I will 
(re-)commit it shortly.


Cheers,

--
Julien Grall

[PATCH] x86: Add MMIO Stale Data arch_caps to hardware domain

2022-07-18 Thread Jason Andryuk

Let the hardware domain know about the hardware it is running on.  This
allows a linux Dom0 to know that it has the appropriate microcode via
FB_CLEAR.  /sys/devices/system/cpu/vulnerabilities/mmio_stale_data
changes from:
"Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state
unknown"
to:
"Mitigation: Clear CPU buffers; SMT Host state unknown"

Signed-off-by: Jason Andryuk 
---
Should calculate_host_policy()'s arch_caps mask also be updated?  They
are not identical today, but I'm don't know this code to understand why
they differ.

 xen/arch/x86/msr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
index c2c0025e3a..f1c36d423f 100644
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -163,7 +163,9 @@ int init_domain_msr_policy(struct domain *d)
 
 mp->arch_caps.raw = val &
 (ARCH_CAPS_RDCL_NO | ARCH_CAPS_IBRS_ALL | ARCH_CAPS_RSBA |
- ARCH_CAPS_SSB_NO | ARCH_CAPS_MDS_NO | ARCH_CAPS_TAA_NO);
+ ARCH_CAPS_SSB_NO | ARCH_CAPS_MDS_NO | ARCH_CAPS_TAA_NO |
+ ARCH_CAPS_SBDR_SSDP_NO | ARCH_CAPS_FBSDP_NO | ARCH_CAPS_PSDP_NO |
+ ARCH_CAPS_FB_CLEAR);
 }
 
 d->arch.msr = mp;
-- 
2.36.1

Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-18 Thread Boris Ostrovsky




On 7/18/22 4:56 AM, Andrew Cooper wrote:

On 15/07/2022 14:10, Boris Ostrovsky wrote:

On 7/15/22 5:50 AM, Andrew Cooper wrote:

On 15/07/2022 09:18, Jane Malalane wrote:

On 14/07/2022 00:27, Boris Ostrovsky wrote:

    xen_hvm_smp_init();
    WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm,
xen_cpu_dead_hvm));
diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c
index 9d548b0c772f..be66e027ef28 100644
--- a/arch/x86/xen/suspend_hvm.c
+++ b/arch/x86/xen/suspend_hvm.c
@@ -5,6 +5,7 @@
    #include 
    #include 
    #include 
+#include 
    #include "xen-ops.h"
@@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled)
    xen_hvm_init_shared_info();
    xen_vcpu_restore();
    }
-    xen_setup_callback_vector();
+    if (xen_ack_upcall) {
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu) {
+    xen_hvm_evtchn_upcall_vector_t op = {
+    .vector = HYPERVISOR_CALLBACK_VECTOR,
+    .vcpu = per_cpu(xen_vcpu_id, cpu),
+    };
+
+    BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector,
+ ));
+    /* Trick toolstack to think we are enlightened. */
+    if (!cpu)
+    BUG_ON(xen_set_callback_via(1));

What are you trying to make the toolstack aware of? That we have *a*
callback (either global or percpu)?

Yes, specifically for the check in libxl__domain_pvcontrol_available.

And others.

This is all a giant bodge, but basically a lot of tooling uses the
non-zero-ness of the CALLBACK_VIA param to determine whether the VM has
Xen-aware drivers loaded or not.

The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only
reason this doesn't explode everywhere is because the
evtchn_upcall_vector registration takes priority over GSI delivery.

This is decades of tech debt piled on top of tech debt.


Feels like it (setting the callback parameter) is something that the
hypervisor should do --- no need to expose guests to this.

Sensible or not, it is the ABI.

Linux still needs to work (nicely) with older Xen's in the world, and we
can't just retrofit a change in the hypervisor which says "btw, this ABI
we've just changed now has a side effect of modifying a field that you
also logically own".



The hypercall has been around for a while so I understand ABI concerns there 
but XEN_HVM_CPUID_UPCALL_VECTOR was introduced only a month ago. Why not tie 
presence of this bit to no longer having to explicitly set the callback field?


-boris

RE: [PATCH v1 10/18] x86: introduce the domain builder

2022-07-18 Thread Smith, Jackson

Hi Daniel,

> -Original Message-
> Subject: [PATCH v1 10/18] x86: introduce the domain builder
> 
> This commit introduces the domain builder configuration FDT parser along
> with the domain builder core for domain creation. To enable domain builder
> to be a cross architecture internal API, a new arch domain creation call
is
> introduced for use by the domain builder.

> diff --git a/xen/common/domain-builder/core.c

> +void __init builder_init(struct boot_info *info) {
> +struct boot_domain *d = NULL;
> +
> +info->builder = 
> +
> +if ( IS_ENABLED(CONFIG_BUILDER_FDT) )
> +{

> +}
> +
> +/*
> + * No FDT config support or an FDT wasn't present, do an initial
> + * domain construction
> + */
> +printk("Domain Builder: falling back to initial domain build\n");
> +info->builder->nr_doms = 1;
> +d = >builder->domains[0];
> +
> +d->mode = opt_dom0_pvh ? 0 : BUILD_MODE_PARAVIRTUALIZED;
> +
> +d->kernel = >mods[0];
> +d->kernel->kind = BOOTMOD_KERNEL;
> +
> +d->permissions = BUILD_PERMISSION_CONTROL |
> BUILD_PERMISSION_HARDWARE;
> +d->functions = BUILD_FUNCTION_CONSOLE |
> BUILD_FUNCTION_XENSTORE |
> + BUILD_FUNCTION_INITIAL_DOM;
> +
> +d->kernel->arch->headroom = bzimage_headroom(bootstrap_map(d-
> >kernel),
> +   d->kernel->size);
> +bootstrap_map(NULL);
> +
> +if ( d->kernel->string.len )
> +d->kernel->string.kind = BOOTSTR_CMDLINE; }

Forgive me if I'm incorrect, but I believe there is an issue with this
fallback logic for the case where no FDT was provided.

If dom0_mem is not supplied to the xen cmd line, then d->meminfo is never
initialized. (See dom0_compute_nr_pages/dom0_build.c:335)
This was giving me trouble because bd->meminfo.mem_max.nr_pages was left at
0, effectivity clamping dom0 to 0 pages of ram.

I'm not sure what the best solution is but one (easy) possibility is just
initializing meminfo to the dom0 defaults near the end of this function:
d->meminfo.mem_size = dom0_size;
d->meminfo.mem_min = dom0_min_size;
d->meminfo.mem_max = dom0_max_size;

Thanks,
Jackson


smime.p7s
Description: S/MIME cryptographic signature

RE: [PATCH v1 04/18] x86: refactor entrypoints to new boot info

2022-07-18 Thread Smith, Jackson

Hi Daniel,

I hope outlook gets this reply right.

> -Original Message-
> Subject: [PATCH v1 04/18] x86: refactor entrypoints to new boot info

> diff --git a/xen/arch/x86/guest/xen/pvh-boot.c
> b/xen/arch/x86/guest/xen/pvh-boot.c
> index 834b1ad16b..28cf5df0a3 100644
> --- a/xen/arch/x86/guest/xen/pvh-boot.c
> +++ b/xen/arch/x86/guest/xen/pvh-boot.c

> @@ -99,13 +118,16 @@ static void __init get_memory_map(void)
>  sanitize_e820_map(e820_raw.map, _raw.nr_map);
>  }
>
> -void __init pvh_init(multiboot_info_t **mbi, module_t **mod)
> +void __init pvh_init(struct boot_info **bi)
>  {
> -convert_pvh_info(mbi, mod);
> +*bi = init_pvh_info();
> +convert_pvh_info(*bi);
>
>  hypervisor_probe();
>  ASSERT(xen_guest);
>
> +(*bi)->arch->xen_guest = xen_guest;

I think you may have a typo/missed refactoring here?
I changed this line to "(*bi)->arch->xenguest = xen_guest;" to get the 
patchset to build.

The arch_boot_info struct in boot_info32.h has a field 'xen_guest' but the 
same field in asm/bootinfo.h was re-named from 'xen_guest' to 'xenguest' in 
the 'x86: adopt new boot info structures' commit.

What was your intent?

> +
>  get_memory_map();
>  }
>

Thanks,
Jackson Smith


smime.p7s
Description: S/MIME cryptographic signature

RE: [PATCH v1 07/18] docs: update hyperlaunch device tree documentation

2022-07-18 Thread Smith, Jackson

Hi Daniel,

> -Original Message-
> Subject: [PATCH v1 07/18] docs: update hyperlaunch device tree
> documentation


> diff --git a/docs/designs/launch/hyperlaunch-devicetree.rst
> b/docs/designs/launch/hyperlaunch-devicetree.rst
> index b49c98cfbd..ae1a786d0b 100644
> --- a/docs/designs/launch/hyperlaunch-devicetree.rst
> +++ b/docs/designs/launch/hyperlaunch-devicetree.rst
> @@ -13,12 +13,268 @@ difference is the introduction of the ``hypervisor``

> +
> +The Hypervisor node
> +---
> +
> +The ``hypervisor`` node is a top level container for the domains that
> +will be
> built
> +by hypervisor on start up. The node will be named ``hypervisor``  with
> +a
> ``compatible``
> +property to identify which hypervisors the configuration is intended.
^^^ Should there be a note here that hypervisor node also needs a compatible 
"xen,"?

> +The
> hypervisor
> +node will consist of one or more config nodes and one or more domain
> nodes.
> +
> +Properties
> +""
> +
> +compatible
> +  Identifies which hypervisors the configuration is compatible. Required.
> +
> +  Format: "hypervisor,", e.g "hypervisor,xen"
^^^ Same here: compatible ","?

>  Example Configuration
>  -
> +
> +Multiboot x86 Configuration Dom0-only:
> +""
> +The following dts file can be provided to the Device Tree compiler,
> +``dtc``,
> to
> +produce a dtb file.
> +::
> +
> +  /dts-v1/;
> +
> +  / {
> +  chosen {
> +  hypervisor {
> +  compatible = "hypervisor,xen";
  compatible = "hypervisor,xen", "xen,x86";

> +
> +  dom0 {
> +  compatible = "xen,domain";
> +
> +  domid = <0>;
> +
> +  permissions = <3>;
> +  functions = <0xC00F>;
> +  mode = <5>;
> +
> +  domain-uuid = [B3 FB 98 FB 8F 9F 67 A3 8A 6E 62 5A 09
> + 13 F0
> 8C];
> +
> +  cpus = <1>;
> +  memory = <0x0 0x2000>;
^^ memory = "2048M";
Needs to be updated to new format for mem.

> +
> +  kernel {
> +  compatible = "module,kernel", "module,index";
> +  module-index = <1>;
> +  };
> +  };
> +
> +  };
> +  };
> +  };
> +

Similar adjustments are needed for the rest of the examples I believe.

Also, two typos:
Line 287 is missing a line ending semi-colon.
Line 82 has a double space between 'node' and 'may'.

Best,
Jackson


smime.p7s
Description: S/MIME cryptographic signature

RE: Xen Memory Sharing Query

2022-07-18 Thread Marc Bonnici

Hi Andrew,

Thank you very much for your detailed reply, that make things 
a lot clearer. I did have a few follow up questions.

> gnttab v2 is horribly more complicated.  

With v2, do the high level behaviours change much from 
what you have outlined here? Do you expect in the long term 
for v2 to be the preferred implementation or are they more like 
alternatives?


> While a gref is mapped, domA is not permitted to edit the 
> associated entry.  

So IIUC once the gref is mapped then domA can't make any changes 
to the entry at all, (or at least they won't be reflected). So if 
domA wants to make any modifications (extend the shared memory 
region, change permissions etc.), then it would just create another
entry and share the new gref? 


> From a grant perspective, Xen doesn't enforce any policy.  domA's grefs
> can be mapped in the manner permitted by what domA wrote into the grant
> table.

So does this mean that if domA grants access to domB for a given frame, 
and then added a new entry in its grant table with the same frame details 
but with "domid = domC" instead. This would be allowed? And if so, would this 
result in a 3-way shared buffer?

And finally a similar scenario, if a frame was shared from domA to domB, 
would domB be able to add an entry in its own grant table to share the 
same frame with domC and end up with the same outcome?

Thanks

Kind Regards
Marc Bonnici

> -Original Message-
> From: Andrew Cooper 
> Sent: 15 July 2022 18:29
> To: Marc Bonnici ; xen-
> de...@lists.xenproject.org
> Subject: Re: Xen Memory Sharing Query
> 
> On 15/07/2022 16:56, Marc Bonnici wrote:
> > Hi All,
> >
> > I was wondering if someone could help me understand some of the
> rules of the
> > memory sharing implementation in Xen?
> >
> > Specifically I'm looking to understand what restrictions Xen
> places on
> > granting access from one Dom to another from Xen's perspective,
> and what types
> > of grant requests would be allowed/rejected by Xen?
> >
> > I.e. How would the situation be handled if the same frame of
> memory was attempted
> > to be shared multiple times?
> >
> > As an example scenario, DomA shares 1 physical page of memory in a
> transaction
> > with DomB. And then without releasing any memory, DomA attempts to
> share
> > another region of memory, which includes the same physical page of
> the previous share
> > with DomB again. This would result in two concurrent shares
> containing an overlap.
> >
> > Apologies if I've missed something but is there any documentation
> / threat model
> > that would cover these types of scenarios? So far I have been
> trying to read through
> > the code but was wondering if there is something else I could
> refer to help me
> > with my understanding?
> 
> There's nothing adequately written down.  It ought to live in sphinx
> docs, but my copious free time is non-existent for speculative
> security
> reasons.
> 
> This all pertains to gnttab v1 which is the only supported one on
> ARM
> right now.  gnttab v2 is horribly more complicated.  Refer to
> https://github.com/xen-
> project/xen/blob/master/xen/include/public/grant_table.h#L132-L186
> 
> When DomA and DomB are set up and running, they each have a grant
> table.  The grant table is some shared memory (of Xen's) mapped into
> the
> guest, and is a bidirectional communication interface between the
> guest
> kernel and Xen.
> 
> The guest kernel logically owns the grant table, and it's a simple
> array
> of grant entries.  Entries 0 thru 7 are reserved for system use, and
> indeed two entries (one for xenstore, one for xenconsole) are set up
> on
> the guest kernel's behalf by the domain builder.  Entries 8 thru $N
> are
> entirely under the guest's control.
> 
> A guest kernel (domA) creates a grant by filling in a grant table
> entry,
> and passing the grant reference (the entry's index in the table) to
> some
> other entity in the system (in this case, domB).
> 
> The grant table entry is formed of:
> 
> u16 flags
> u16 domid
> u32 frame
> 
> so for domA to grant a frame to domB, it would pick a free gref (any
> entry in the table with flags=0) and fill in:
> 
> frame = f
> domid = domB
> smp_wmb()
> flags = GTF_permit_access (think "grant usable")
> 
> GFT_readonly is another relevant flag that domA might choose to set.
> 
> Then, domB would take the gref it has been given by domA, and make a
> gnttab_op_map() hypercall, passing {domA, gref} as an input.
> 
> Xen looks up gref in domA's grant table, checks e.g. domA granted
> access
> to domB, and if everything is happy, sets the the
> GFT_{reading,writing}
> flags (as appropriate) in flags.  This tells domA that the grant is
> currently mapped readably and/or writeably.
> 
> Later, when domB unmaps the grant, Xen clears the
> GFT_{reading,writing}
> bits, telling domA that the grant is no longer in use.
> 
> DomA then clears GTF_permit_access to mark this as gref as invalid,
> and
> can then free the frame.
> 
> 
> Now, that's the

Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports

2022-07-18 Thread Boris Ostrovsky




On 7/17/22 1:20 AM, Juergen Gross wrote:


What about filling the complete hypercall page just with "int 3" or "ud2"?

Any try to do a hypercall before the hypercall page has been initialized
is a bug anyway. What good can come from calling a function which will
return a basically random value instead of doing a privileged operation?



This is all about objtool, that's why 'ret' was added there originally by f4b4bc10b0b8 
("x86/xen: Support objtool vmlinux.o validation in xen-head.S").


Before that it was all 'nop' which is similar to what you are suggesting 
('int3' or 'ud2' would of course be better)


-boris

Re: [PATCH v2] EFI: strip xen.efi when putting it on the EFI partition

2022-07-18 Thread Anthony PERARD

On Mon, Jul 11, 2022 at 06:05:37PM +0200, Jan Beulich wrote:
> With debug info retained, xen.efi can be quite large. Unlike for xen.gz
> there's no intermediate step (mkelf32 there) involved which would strip
> debug info kind of as a side effect. While the installing of xen.efi on
> the EFI partition is an optional step (intended to be a courtesy to the
> developer), adjust it also for the purpose of documenting what distros
> would be expected to do during boot loader configuration (which is what
> would normally put xen.efi into the EFI partition).
> 
> Model the control over stripping after Linux'es module installation,
> except that the stripped executable is constructed in the build area
> instead of in the destination location. This is to conserve on space
> used there - EFI partitions tend to be only a few hundred Mb in size.
> 
> Signed-off-by: Jan Beulich 
> Tested-by: Henry Wang 
> Tested-by: Wei Chen  # arm

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v2 8/9] xue: mark DMA buffers as reserved for the device

2022-07-18 Thread Marek Marczykowski-Górecki

On Thu, Jul 14, 2022 at 01:51:06PM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > The important part is to include those buffers in IOMMU page table
> > relevant for the USB controller. Otherwise, DbC will stop working as
> > soon as IOMMU is enabled, regardless of to which domain device assigned
> > (be it xen or dom0).
> > If the device is passed through to dom0 or other domain (see later
> > patches), that domain will effectively have access to those buffers too.
> > It does give such domain yet another way to DoS the system (as is the
> > case when having PCI device assigned already), but also possibly steal
> > the console ring content. Thus, such domain should be a trusted one.
> > In any case, prevent anything else being placed on those pages by adding
> > artificial padding.
> 
> I don't think this device should be allowed to be assigned to any
> DomU. Just like the EHCI driver, at some point in the series you
> will want to call pci_hide_device() (you may already do, and I may
> merely have missed it).

There is the major difference compared to the EHCI driver: the XHCI
hardware interface was designed for debug capability to be driven by
separate driver (see description of patch 9). The device reset is the
only action that needs some coordination and this patch series follows
what Linux does (re-enable DbC part, when it detects the XHCI reset).
Having this capability is especially important when one has only a
single USB controller (many, if not most recent Intel platforms) and has
some important devices on USB (for example system disk, or the only ethernet
controller - I have both of those cases in my test lab...).

Anyway, this patch is necessary even if no domain would use the device.
As Andrew explained in response to the cover letter of the RFC series,
Xen doesn't have IOMMU context for devices used by Xen exclusively. So,
with the current model, it would be pci_ro_device() + dom0's IOMMU
context.

> > Using this API for DbC pages requires raising MAX_USER_RMRR_PAGES.
> 
> I disagree here: This is merely an effect of you re-using the pre-
> existing functionality there with too little customization. I think
> the respective check shouldn't be applied for internal uses.

Ok, I'll move the check.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: [PATCH] xl: move freemem()'s "credit expired" loop exit

2022-07-18 Thread Anthony PERARD

On Tue, Jul 12, 2022 at 04:08:12PM +0200, Jan Beulich wrote:
> Move the "credit expired" loop exit to the middle of the loop,
> immediately after "return true". This way having reached the goal on the
> last iteration would be reported as success to the caller, rather than
> as "timed out".
> 
> Signed-off-by: Jan Beulich 

Reviewed-by: Anthony PERARD 

Thanks,

-- 
Anthony PERARD

Re: [PATCH v2 9/9] xue: allow driving the rest of XHCI by a domain while Xen uses DbC

2022-07-18 Thread Marek Marczykowski-Górecki

On Thu, Jul 14, 2022 at 02:06:07PM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > That's possible, because the capability was designed specifically to
> > allow separate driver handle it, in parallel to unmodified xhci driver
> > (separate set of registers, pretending the port is "disconnected" for
> > the main xhci driver etc). It works with Linux dom0, although requires
> > an awful hack - re-enabling bus mastering behind dom0's backs.
> > Linux driver does similar thing - see
> > drivers/usb/early/xhci-dbc.c:xdbc_handle_events().
> 
> Isn't there a risk that intermediately data was lost?

Yes, there is such risk (although minimal in practice - it happens just
once during dom0 boot). You can avoid it by instructing dom0 to not use
that USB controller.
Having this capability is really helpful (compared with the alternative
of using the whole USB controller by either Xen or Linux), as many
(most) systems have only a single USB controller.

> > To avoid Linux messing with the DbC, mark this MMIO area as read-only.
> 
> In principle this would want to happen quite a bit earlier in the
> series. I'm okay with it being kept here as long as it is made
> very obvious to and easily noticeable by committers that this series
> should only be committed all in one go.
> 
> Also along with this is where I'd see the pci_hide_device() go.

Earlier version of the patch set had pci_ro_device() before this patch.
I can add pci_ro_device() in the initial patch, and drop it in this one.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

signature.asc
Description: PGP signature

Re: [PATCH v2 4/9] console: support multiple serial console simultaneously

2022-07-18 Thread Marek Marczykowski-Górecki

On Wed, Jul 13, 2022 at 11:39:07AM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > Previously only one serial console was supported at the same time. Using
> > console=com1,dbgp,vga silently ignored all but last serial console (in
> > this case: only dbgp and vga were active).
> > 
> > Fix this by storing not a single sercon_handle, but an array of them, up
> > to MAX_SERCONS entries. The value of MAX_SERCONS (4) is arbitrary,
> > inspired by the number of SERHND_IDX values.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> >  xen/drivers/char/console.c | 58 ++-
> >  1 file changed, 45 insertions(+), 13 deletions(-)
> > 
> > diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c
> > index f9937c5134c0..44b703296487 100644
> > --- a/xen/drivers/char/console.c
> > +++ b/xen/drivers/char/console.c
> > @@ -113,7 +113,9 @@ static char *__read_mostly conring = _conring;
> >  static uint32_t __read_mostly conring_size = _CONRING_SIZE;
> >  static uint32_t conringc, conringp;
> >  
> > -static int __read_mostly sercon_handle = -1;
> > +#define MAX_SERCONS 4
> 
> Might this want to be a Kconfig setting?

Is that going to be useful for anybody (who isn't modifying the code
anyway, for example to add yet another console driver)?

> > +static int __read_mostly sercon_handle[MAX_SERCONS];
> > +static int __read_mostly nr_sercon_handle = 0;
> 
> I would have wanted to ask for __ro_after_init here, but Arm still
> hasn't enabled support for it.
> 
> > @@ -395,9 +397,17 @@ static unsigned int serial_rx_cons, serial_rx_prod;
> >  
> >  static void (*serial_steal_fn)(const char *, size_t nr) = early_puts;
> >  
> > +/* Redirect any console output to *fn*, if *handle* is configured as a 
> > console. */
> >  int console_steal(int handle, void (*fn)(const char *, size_t nr))
> >  {
> > -if ( (handle == -1) || (handle != sercon_handle) )
> > +int i;
> 
> unsigned int please (also elsewhere).
> 
> > +if ( handle == -1 )
> > +return 0;
> > +for ( i = 0; i < nr_sercon_handle; i++ )
> > +if ( handle == sercon_handle[i] )
> > +break;
> > +if ( nr_sercon_handle && i == nr_sercon_handle )
> >  return 0;
> 
> No need for the left side of the &&, I think. I guess it's actively
> wrong to be there.
> 
> >  if ( serial_steal_fn != NULL )
> 
> I guess you then also want to make serial_steal_fn an array, and no
> longer return constant 1 from the function.
> 
> > @@ -977,7 +990,8 @@ void __init console_init_preirq(void)
> >  continue;
> >  else if ( (sh = serial_parse_handle(p)) >= 0 )
> >  {
> > -sercon_handle = sh;
> > +if ( nr_sercon_handle < MAX_SERCONS )
> > +sercon_handle[nr_sercon_handle++] = sh;
> 
> else 
> 
> > @@ -1291,7 +1322,8 @@ static int suspend_steal_id;
> >  
> >  int console_suspend(void)
> >  {
> > -suspend_steal_id = console_steal(sercon_handle, suspend_steal_fn);
> > +if ( nr_sercon_handle )
> > +suspend_steal_id = console_steal(sercon_handle[0], 
> > suspend_steal_fn);
> >  serial_suspend();
> >  return 0;
> >  }
> 
> The commit message gives no explanation why only the first handle
> would want/need dealing with here.

Sure, I can add an explanation. I'm adding this comment to console_steal():
/* Redirect any console output to *fn*, if *handle* is configured as a console. 
*/

So, calling console_steal() is about all serial consoles, not just a
specific one. The use case for this "if" part is gdbstub, which wants
to redirect serial output only if that serial was configured as both
console and gdb. Having proper per-console stealing is doable, but IMO
not worth it (it would require also avoiding duplicated output in case
of multiple serial consoles, and probably few more corner cases).

> One overall remark: Especially with sync_console latency is going to
> be yet worse with all output being done sequentially. The help text
> for "console=" will want to mention this, up and until this would be
> parallelized.

I don't think it was parallelized anywhere. All the relevant functions
(__putstr especially) write to various console types sequentially. The
difference is that previously only the last "serial" console was used,
all the other were silently ignored. So, it was "parallel" with all
_zero other_ serial consoles, but not other console types.
Anyway, both help text and boot warning for sync_console already warn
about it. Do you want me to include relation to number of configured
console explicitly?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 1/3] x86: move some code out of arch/x86/kernel/cpu/mtrr

2022-07-18 Thread Borislav Petkov

On Fri, Jul 15, 2022 at 04:25:47PM +0200, Juergen Gross wrote:
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 736262a76a12..e43322f8a4ef 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c

I guess the move's ok but not into cpu/common.c pls. That thing is
huuuge and is a dumping ground for everything.

arch/x86/kernel/cpu/cacheinfo.c looks like a more viable candidate for
all things cache.

Rest looks trivial.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v2] Subject: x86/PAT: Report PAT on CPUs that support PAT without MTRR

2022-07-18 Thread Chuck Zmudzinski

On 7/18/2022 7:39 AM, Jan Beulich wrote:
> On 18.07.2022 13:31, Chuck Zmudzinski wrote:
> > On 7/18/2022 2:07 AM, Jan Beulich wrote:
> >> On 15.07.2022 21:53, Chuck Zmudzinski wrote:
> >>> Two things I see here in my efforts to get a patch to fix this regression:
> >>>
> >>> 1. Does Xen have plans to give Linux running in Dom0 write-access to the
> >>> PAT MSR?
> >>
> >> No, as this is not technically feasible (all physical CPUs should run
> >> with the same value in the MSR, or else other issues arise).
> >>
> >>> 2. Does Xen have plans to expose MTRRs to Linux running in Dom0?
> >>
> >> Yen does expose MTRRs to PV Dom0, but via a hypercall mechanism. I
> >> don't think there are plans on the Xen side to support the MSR
> >> interface (and hence to expose the CPUID bit), and iirc there are
> >> no plans on the Linux side to use the MTRR interface. This also
> >> wouldn't really make sense anymore now that it has become quite
> >> clear that Linux wants to have PAT working without depending on
> >> MTRR.
> > 
> > I am not so sure about that, given what Borislav Petkov
> > said when commenting on your patch here:
> > 
> > https://lore.kernel.org/lkml/ysrjx%2fu1xn8rq...@zn.tnic/
> > 
> > Specifically, Borislav Petkov wrote on Tue, 5 Jul 2022 18:14:23 +0200:
> > 
> > Actually, the current goal is to adjust Xen dom0 because:
> > 
> > 1. it uses the PAT code
> > 
> > 2. but then it does something special and hides the MTRRs
> > 
> > which is not something real hardware does.
> > 
> > So this one-off thing should be prominent, visible and not get in the
> > way.
> > 
> > --end of Borislav Petkov quote---
>
> And then, a day later, he said
>
> "So I'm being told that it would be generally beneficial for all kinds of
>  virtualization solutions to be able to support PAT only, without MTRRs
>  so it would be interesting to see how ugly it would become to decouple
>  PAT from MTRRs in Linux..."

What if it proves to be too ugly to decouple PAT from MTRRs? Then
I doubt that "Linux wants to have PAT working without depending
on MTRR." We can hope that Juergen's work to decouple PAT from
MTRRs is not so ugly that it cannot be done, but that is by no means
certain at this point. At the very least, this means the fix to the
regression
will be at least delayed considerably, and possibly this means this
regression will never be fixed in the mainline kernel release.

Chuck

Re: [PATCH v2 2/2] xen: Fix latent check-endbr.sh bug with 32bit build environments

2022-07-18 Thread Andrew Cooper

On 18/07/2022 10:49, Jan Beulich wrote:
> On 18.07.2022 11:31, Andrew Cooper wrote:
>> On 18/07/2022 10:11, Jan Beulich wrote:
>>> On 15.07.2022 15:26, Andrew Cooper wrote:
 --- a/xen/tools/check-endbr.sh
 +++ b/xen/tools/check-endbr.sh
 @@ -61,19 +61,36 @@ ${OBJDUMP} -j .text $1 -d -w | grep '  endbr64 *$' | 
 cut -f 1 -d ':' > $VALID &
  #the lower bits, rounding integers to the nearest 4k.
  #
  #Instead, use the fact that Xen's .text is within a 1G aligned 
 region, and
 -#split the VMA in half so AWK's numeric addition is only working on 
 32 bit
 -#numbers, which don't lose precision.
 +#split the VMA so AWK's numeric addition is only working on <32 bit
 +#numbers, which don't lose precision.  (See point 5)
  #
  # 4) MAWK doesn't support plain hex constants (an optional part of the 
 POSIX
  #spec), and GAWK and MAWK can't agree on how to work with hex 
 constants in
  #a string.  Use the shell to convert $vma_lo to decimal before 
 passing to
  #AWK.
  #
 +# 5) Point 4 isn't fully portable.  POSIX only requires that $((0xN)) be
 +#evaluated as long, which in 32bit shells turns negative if bit 31 of 
 the
 +#VMA is set.  AWK then interprets this negative number as a double 
 before
 +#adding the offsets from the binary grep.
 +#
 +#Instead of doing an 8/8 split with vma_hi/lo, do a 9/7 split.
 +#
 +#The consequence of this is that for all offsets, $vma_lo + offset 
 needs
 +#to be less that 256M (i.e. 7 nibbles) so as to be successfully 
 recombined
 +#with the 9 nibbles of $vma_hi.  This is fine; .text is at the start 
 of a
 +#1G aligned region, and Xen is far far smaller than 256M, but leave 
 safety
 +#check nevertheless.
 +#
  eval $(${OBJDUMP} -j .text $1 -h |
 -$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 
 8), substr($4, 9, 16)}')
 +$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 
 9), substr($4, 10, 16)}')
  
  ${OBJCOPY} -j .text $1 -O binary $TEXT_BIN
  
 +bin_sz=$(stat -c '%s' $TEXT_BIN)
 +[ "$bin_sz" -ge $(((1 << 28) - $vma_lo)) ] &&
 +{ echo "$MSG_PFX Error: .text offsets can exceed 256M" >&2; exit 1; }
>>> ... s/can/cannot/ ?
>> Why?  "Can" is correct here.  If the offsets can't exceed 256M, then
>> everything is good.
> Hmm, the wording then indeed is ambiguous.

I see your point.  In this case it's meant as "are able to", but this is
still clearer than using "can't" because at least the text matches the
check which triggered it.

>  I read "can" as "are allowed
> to", when we mean "aren't allowed to". Maybe ".text is 256M or more in
> size"? If you mention "offsets", then I think the check should be based
> on actually observing an offset which is too large (which .text size
> alone doesn't guarantee will happen).

It's not just .text on its own because the VMA of offset by 2M, hence
the subtraction of $vma_lo in the main calculation.

There's no point searching for offsets.  There will be one near the end,
so all searching for an offset would do is complicate the critical loop.

How about ".text offsets must not exceed 256M" ?

That should be unambiguous.

~Andrew

Re: [PATCH v2] Subject: x86/PAT: Report PAT on CPUs that support PAT without MTRR

2022-07-18 Thread Chuck Zmudzinski

On 7/18/2022 7:39 AM, Jan Beulich wrote:
> On 18.07.2022 13:31, Chuck Zmudzinski wrote:
> > On 7/18/2022 2:07 AM, Jan Beulich wrote:
> >> On 15.07.2022 21:53, Chuck Zmudzinski wrote:
> >>> Two things I see here in my efforts to get a patch to fix this regression:
> >>>
> >>> 1. Does Xen have plans to give Linux running in Dom0 write-access to the
> >>> PAT MSR?
> >>
> >> No, as this is not technically feasible (all physical CPUs should run
> >> with the same value in the MSR, or else other issues arise).
> >>
> >>> 2. Does Xen have plans to expose MTRRs to Linux running in Dom0?
> >>
> >> Yen does expose MTRRs to PV Dom0, but via a hypercall mechanism. I
> >> don't think there are plans on the Xen side to support the MSR
> >> interface (and hence to expose the CPUID bit), and iirc there are
> >> no plans on the Linux side to use the MTRR interface. This also
> >> wouldn't really make sense anymore now that it has become quite
> >> clear that Linux wants to have PAT working without depending on
> >> MTRR.
> > 
> > I am not so sure about that, given what Borislav Petkov
> > said when commenting on your patch here:
> > 
> > https://lore.kernel.org/lkml/ysrjx%2fu1xn8rq...@zn.tnic/
> > 
> > Specifically, Borislav Petkov wrote on Tue, 5 Jul 2022 18:14:23 +0200:
> > 
> > Actually, the current goal is to adjust Xen dom0 because:
> > 
> > 1. it uses the PAT code
> > 
> > 2. but then it does something special and hides the MTRRs
> > 
> > which is not something real hardware does.
> > 
> > So this one-off thing should be prominent, visible and not get in the
> > way.
> > 
> > --end of Borislav Petkov quote---
>
> And then, a day later, he said
>
> "So I'm being told that it would be generally beneficial for all kinds of
>  virtualization solutions to be able to support PAT only, without MTRRs
>  so it would be interesting to see how ugly it would become to decouple
>  PAT from MTRRs in Linux..."
>
> > Jan, can you explain this comment by Borislav Petkov about
> > Xen being a "one-off thing" that hides MTRRs and needs
> > to be "adjusted" so it does "not get in the way"?
>
> I'm afraid this isn't the first time that you ask people to explain
> what somebody else said. I don't follow why you think I could better
> explain what Boris said and why than he could do himself.

That is why I also asked Boris to say something now to clarify his
opinion on these matters. Let's wait and see if Boris says something
to clarify his opinion.

Chuck

>
> Jan

Re: [PATCH v2 7/9] IOMMU/AMD: wire common device reserved memory API

2022-07-18 Thread Jan Beulich

On 18.07.2022 13:35, Marek Marczykowski-Górecki wrote:
> On Thu, Jul 14, 2022 at 12:22:36PM +0200, Jan Beulich wrote:
>> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/drivers/passthrough/amd/iommu_acpi.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
>>> @@ -1078,6 +1078,20 @@ static inline bool_t is_ivmd_block(u8 type)
>>>  type == ACPI_IVRS_TYPE_MEMORY_IOMMU);
>>>  }
>>>  
>>> +static int __init cf_check add_one_extra_ivmd(xen_pfn_t start, xen_ulong_t 
>>> nr, u32 id, void *ctxt)
>>> +{
>>> +struct acpi_ivrs_memory ivmd;
>>> +
>>> +ivmd.start_address = start << PAGE_SHIFT;
>>> +ivmd.memory_length = nr << PAGE_SHIFT;
>>
>> Aren't these at risk of truncation on 32-bit architectures? We have
>> suitable wrappers for such conversions, avoiding such issues.
> 
> Well, this (and the vtd equivalent) is x86-only, so it's always 64-bit.

Nevertheless writing things cleanly everywhere will reduce the risk of
somebody cloning this code on the assumption that it's correct.

> Anyway, what are those wrappers?

pfn_to_paddr()

Jan

Re: [PATCH v2] Subject: x86/PAT: Report PAT on CPUs that support PAT without MTRR

2022-07-18 Thread Jan Beulich

On 18.07.2022 13:31, Chuck Zmudzinski wrote:
> On 7/18/2022 2:07 AM, Jan Beulich wrote:
>> On 15.07.2022 21:53, Chuck Zmudzinski wrote:
>>> Two things I see here in my efforts to get a patch to fix this regression:
>>>
>>> 1. Does Xen have plans to give Linux running in Dom0 write-access to the
>>> PAT MSR?
>>
>> No, as this is not technically feasible (all physical CPUs should run
>> with the same value in the MSR, or else other issues arise).
>>
>>> 2. Does Xen have plans to expose MTRRs to Linux running in Dom0?
>>
>> Yen does expose MTRRs to PV Dom0, but via a hypercall mechanism. I
>> don't think there are plans on the Xen side to support the MSR
>> interface (and hence to expose the CPUID bit), and iirc there are
>> no plans on the Linux side to use the MTRR interface. This also
>> wouldn't really make sense anymore now that it has become quite
>> clear that Linux wants to have PAT working without depending on
>> MTRR.
> 
> I am not so sure about that, given what Borislav Petkov
> said when commenting on your patch here:
> 
> https://lore.kernel.org/lkml/ysrjx%2fu1xn8rq...@zn.tnic/
> 
> Specifically, Borislav Petkov wrote on Tue, 5 Jul 2022 18:14:23 +0200:
> 
> Actually, the current goal is to adjust Xen dom0 because:
> 
> 1. it uses the PAT code
> 
> 2. but then it does something special and hides the MTRRs
> 
> which is not something real hardware does.
> 
> So this one-off thing should be prominent, visible and not get in the
> way.
> 
> --end of Borislav Petkov quote---

And then, a day later, he said

"So I'm being told that it would be generally beneficial for all kinds of
 virtualization solutions to be able to support PAT only, without MTRRs
 so it would be interesting to see how ugly it would become to decouple
 PAT from MTRRs in Linux..."

> Jan, can you explain this comment by Borislav Petkov about
> Xen being a "one-off thing" that hides MTRRs and needs
> to be "adjusted" so it does "not get in the way"?

I'm afraid this isn't the first time that you ask people to explain
what somebody else said. I don't follow why you think I could better
explain what Boris said and why than he could do himself.

Jan

Re: [PATCH v2 7/9] IOMMU/AMD: wire common device reserved memory API

2022-07-18 Thread Marek Marczykowski-Górecki

On Thu, Jul 14, 2022 at 12:22:36PM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > --- a/xen/drivers/passthrough/amd/iommu_acpi.c
> > +++ b/xen/drivers/passthrough/amd/iommu_acpi.c
> > @@ -1078,6 +1078,20 @@ static inline bool_t is_ivmd_block(u8 type)
> >  type == ACPI_IVRS_TYPE_MEMORY_IOMMU);
> >  }
> >  
> > +static int __init cf_check add_one_extra_ivmd(xen_pfn_t start, xen_ulong_t 
> > nr, u32 id, void *ctxt)
> > +{
> > +struct acpi_ivrs_memory ivmd;
> > +
> > +ivmd.start_address = start << PAGE_SHIFT;
> > +ivmd.memory_length = nr << PAGE_SHIFT;
> 
> Aren't these at risk of truncation on 32-bit architectures? We have
> suitable wrappers for such conversions, avoiding such issues.

Well, this (and the vtd equivalent) is x86-only, so it's always 64-bit.
Anyway, what are those wrappers?

> > +ivmd.header.flags = ACPI_IVMD_UNITY |
> > +ACPI_IVMD_READ | ACPI_IVMD_WRITE;
> > +ivmd.header.length = sizeof(ivmd);
> > +ivmd.header.device_id = id;
> > +ivmd.header.type = ACPI_IVRS_TYPE_MEMORY_ONE;
> 
> Please make these the variable's initializer.
> 
> Jan

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 0/3] x86: make pat and mtrr independent from each other

2022-07-18 Thread Chuck Zmudzinski

On 7/17/2022 3:55 AM, Thorsten Leemhuis wrote:
> Hi Juergen!
>
> On 15.07.22 16:25, Juergen Gross wrote:
> > Today PAT can't be used without MTRR being available, unless MTRR is at
> > least configured via CONFIG_MTRR and the system is running as Xen PV
> > guest. In this case PAT is automatically available via the hypervisor,
> > but the PAT MSR can't be modified by the kernel and MTRR is disabled.
> > 
> > As an additional complexity the availability of PAT can't be queried
> > via pat_enabled() in the Xen PV case, as the lack of MTRR will set PAT
> > to be disabled. This leads to some drivers believing that not all cache
> > modes are available, resulting in failures or degraded functionality.
> > 
> > The same applies to a kernel built with no MTRR support: it won't
> > allow to use the PAT MSR, even if there is no technical reason for
> > that, other than setting up PAT on all cpus the same way (which is a
> > requirement of the processor's cache management) is relying on some
> > MTRR specific code.
> > 
> > Fix all of that by:
> > 
> > - moving the function needed by PAT from MTRR specific code one level
> >   up
> > - adding a PAT indirection layer supporting the 3 cases "no or disabled
> >   PAT", "PAT under kernel control", and "PAT under Xen control"
> > - removing the dependency of PAT on MTRR
>
> Thx for working on this. If you need to respin these patches for one
> reason or another, could you do me a favor and add proper 'Link:' tags
> pointing to all reports about this issue? e.g. like this:
>
>  Link: https://lore.kernel.org/regressions/YnHK1Z3o99eMXsVK@mail-itl/
>
> These tags are considered important by Linus[1] and others, as they
> allow anyone to look into the backstory weeks or years from now. That is
> why they should be placed in cases like this, as
> Documentation/process/submitting-patches.rst and
> Documentation/process/5.Posting.rst explain in more detail. I care
> personally, because these tags make my regression tracking efforts a
> whole lot easier, as they allow my tracking bot 'regzbot' to
> automatically connect reports with patches posted or committed to fix
> tracked regressions.
>
> [1] see for example:
> https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+gsyyfbth0f-dgvd...@mail.gmail.com/
> https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nowvkvzjpm3vfu1zobp37fwd_h9iad...@mail.gmail.com/
> https://lore.kernel.org/all/CAHk-=wjxzafG-=j8ot30s7upn4rhbs6tx-uvfz5rme+l5_d...@mail.gmail.com/
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>

I echo Thorsten's thx for starting on this now instead of waiting until
September which I think is when Juergen said he could start working
on this last week. I agree with Thorsten that Link tags are needed.
Since multiple patches have been proposed to fix this regression,
perhaps a Link to each proposed patch, and a note that
the original report identified a specific commit which when reverted
also fixes it. IMO, this is all part of the backstory Thorsten refers to.

It looks like with this approach, a fix will not be coming real soon,
and Borislav Petkov also discouraged me from testing this
patch set until I receive a ping telling me it is ready for testing,
which seems to confirm that this regression will not be fixed
very soon. Please correct me if I am wrong about how long
it will take to fix it with this approach.

Also, is there any guarantee this approach is endorsed by
all the maintainers who will need to sign-off, especially
Linus? I say this because some of the discussion on the
earlier proposed patches makes me doubt this. I am especially
referring to this discussion:

https://lore.kernel.org/lkml/4c8c9d4c-1c6b-8e9f-fa47-918a64898...@leemhuis.info/

and also, here:

https://lore.kernel.org/lkml/ysrjx%2fu1xn8rq...@zn.tnic/

where Borislav Petkov argues that Linux should not be
patched at all to fix this regression but instead the fix
should come by patching the Xen hypervisor.

So I have several questions, presuming at least the fix is going
to be delayed for some time, and also presuming this approach
is not yet an approach that has the blessing of the maintainers
who will need to sign-off:

1. Can you estimate when the patch series will be ready for
testing and suitable for a prepatch or RC release?

2. Can you estimate when the patch series will be ready to be
merged into the mainline release? Is there any hope it will be
fixed before the next longterm release hosted on kernel.org?

3. Since a fix is likely not coming soon, can you explain
why the commit that was mentioned in the original
report cannot be reverted as a temporary solution while
we wait for the full fix to come later? I can say that
reverting that commit (It was a commit affecting
drm/i915) does fix the issue on my system with no
negative side effects at all. In such a case, it seems
contrary to Linus' regression rule to not revert the
offending commit, even if reverting the offending
commit is not going to be the final

Re: [PATCH v2] Subject: x86/PAT: Report PAT on CPUs that support PAT without MTRR

2022-07-18 Thread Chuck Zmudzinski

On 7/18/2022 2:07 AM, Jan Beulich wrote:
> On 15.07.2022 21:53, Chuck Zmudzinski wrote:
> > Two things I see here in my efforts to get a patch to fix this regression:
> > 
> > 1. Does Xen have plans to give Linux running in Dom0 write-access to the
> > PAT MSR?
>
> No, as this is not technically feasible (all physical CPUs should run
> with the same value in the MSR, or else other issues arise).
>
> > 2. Does Xen have plans to expose MTRRs to Linux running in Dom0?
>
> Yen does expose MTRRs to PV Dom0, but via a hypercall mechanism. I
> don't think there are plans on the Xen side to support the MSR
> interface (and hence to expose the CPUID bit), and iirc there are
> no plans on the Linux side to use the MTRR interface. This also
> wouldn't really make sense anymore now that it has become quite
> clear that Linux wants to have PAT working without depending on
> MTRR.

I am not so sure about that, given what Borislav Petkov
said when commenting on your patch here:

https://lore.kernel.org/lkml/ysrjx%2fu1xn8rq...@zn.tnic/

Specifically, Borislav Petkov wrote on Tue, 5 Jul 2022 18:14:23 +0200:

Actually, the current goal is to adjust Xen dom0 because:

1. it uses the PAT code

2. but then it does something special and hides the MTRRs

which is not something real hardware does.

So this one-off thing should be prominent, visible and not get in the
way.

--end of Borislav Petkov quote---

Jan, can you explain this comment by Borislav Petkov about
Xen being a "one-off thing" that hides MTRRs and needs
to be "adjusted" so it does "not get in the way"?

Borislav, are you willing to retract the comments you made
on Tue, 5 Jul 2022 18:14:23 +0200 about Xen?

>
> Jan

Jan, thanks for answering my questions.

Chuck

Re: [PATCH 0/5] xen/wait: Improvements

2022-07-18 Thread Andrew Cooper

On 18/07/2022 12:11, Jan Beulich wrote:
> On 18.07.2022 09:18, Andrew Cooper wrote:
>> This started out as patch 2 for a different task, and quickly identified some
>> technical debt, long overdue for cleaning up.
>>
>> Andrew Cooper (5):
>>   xen/wait: Drop vestigial remnants of TRAP_regs_partial
>>   xen/wait: Extend the description of how this logic actually works
>>   xen/wait: Minor asm improvements
>>   xen/wait: Use relative stack adjustments
>>   xen/wait: Remove VCPU_AFFINITY_WAIT
> While going through this series I came to notice that we build wait.c even
> on Arm, and I couldn't convince myself that wait_event() is actually not
> reachable there when e.g. there's an Arm-specific vm_event.c. I would
> generally prefer if non-functioning code paths were actually compiled out.
>
> Thoughts?

I've been wanting to delete wait.c fully for a long time.

It is only needed because the event/paging/access interfaces still use a
single 4k shared ring, and even then, only when you happen to fill the
4k ring, which is 11 vCPUs on x86 last I checked.

I could easily believe that limit has never been tripped on ARM.

There are plenty of perf wins to be had by building a new Xen=>agent
interface using acquire_resource, which include being able to guarantee
that we never need to pause a vCPU to wait for space in the ring to post
a event.

With the interface fixed, wait.c ceases to have any use whatsoever.

~Andrew

Re: [PATCH 5/5] xen/wait: Remove VCPU_AFFINITY_WAIT

2022-07-18 Thread Andrew Cooper

On 18/07/2022 11:48, Jan Beulich wrote:
> On 18.07.2022 09:18, Andrew Cooper wrote:
>> With the waitqueue logic updated to not use an absolute stack pointer
>> reference, the vCPU can safely be resumed anywhere.
>>
>> Remove VCPU_AFFINITY_WAIT completely, getting rid of two domain crashes,
> I understand you mean two domain_crash() invocations here, but ...
>
>> and a
>> logical corner case where resetting the vcpu with an oustanding waitqueue
>> would crash the domain.
> ... some other domain crash here?

One of the two above.  It's more that resetting (would have) broken the
affinity and would have triggered the domain crash.

>
>> Signed-off-by: Andrew Cooper 
> I assume you've checked thoroughly that calling code hasn't
> grown dependencies on execution coming back on the same CPU?

Urgh yes, my trivial test case didn't encounter it, but anything with an
smp_processor_id() stashed on the stack is going to end up unhappy.

I'm going to have to retract half this series.  (I'll follow up on the
0/$N with the longer term plan to remove this mess).

~Andrew

Re: [PATCH v2 5/9] IOMMU: add common API for device reserved memory

2022-07-18 Thread Jan Beulich

On 18.07.2022 13:03, Marek Marczykowski-Górecki wrote:
> On Thu, Jul 14, 2022 at 12:17:53PM +0200, Jan Beulich wrote:
>> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
>>> +};
>>> +static unsigned int __initdata nr_extra_reserved_ranges;
>>> +static struct extra_reserved_range __initdata 
>>> extra_reserved_ranges[MAX_EXTRA_RESERVED_RANGES];
>>
>> Overly long line.
> 
> I can't find what coding style dictate in such case. Should the second
> line be indented (and how much)?

I'd recommend

static struct extra_reserved_range __initdata
   extra_reserved_ranges[MAX_EXTRA_RESERVED_RANGES];

Jan

Re: [PATCH v2 5/9] IOMMU: add common API for device reserved memory

2022-07-18 Thread Jan Beulich

On 18.07.2022 12:53, Marek Marczykowski-Górecki wrote:
> On Thu, Jul 14, 2022 at 12:17:53PM +0200, Jan Beulich wrote:
>> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
>>> --- a/xen/drivers/passthrough/iommu.c
>>> +++ b/xen/drivers/passthrough/iommu.c
>>> @@ -651,6 +651,46 @@ bool_t iommu_has_feature(struct domain *d, enum 
>>> iommu_feature feature)
>>>  return is_iommu_enabled(d) && test_bit(feature, 
>>> dom_iommu(d)->features);
>>>  }
>>>  
>>> +#define MAX_EXTRA_RESERVED_RANGES 20
>>> +struct extra_reserved_range {
>>> +xen_pfn_t start;
>>
>> Why not unsigned long?
> 
> Isn't this the correct type for the page number?

In the public interface - yes. But internally we (almost) universally
use unsigned long. xen_pfn_t is inefficient to use on Arm32 (and then
presumably also other future 32-bit counterparts of 64-bit
architectures, inheriting the choices made for Arm and differing from
what we have on x86).

Jan

Re: [PATCH 0/5] xen/wait: Improvements

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:18, Andrew Cooper wrote:
> This started out as patch 2 for a different task, and quickly identified some
> technical debt, long overdue for cleaning up.
> 
> Andrew Cooper (5):
>   xen/wait: Drop vestigial remnants of TRAP_regs_partial
>   xen/wait: Extend the description of how this logic actually works
>   xen/wait: Minor asm improvements
>   xen/wait: Use relative stack adjustments
>   xen/wait: Remove VCPU_AFFINITY_WAIT

While going through this series I came to notice that we build wait.c even
on Arm, and I couldn't convince myself that wait_event() is actually not
reachable there when e.g. there's an Arm-specific vm_event.c. I would
generally prefer if non-functioning code paths were actually compiled out.

Thoughts?

Jan

Re: [PATCH v2 5/9] IOMMU: add common API for device reserved memory

2022-07-18 Thread Marek Marczykowski-Górecki

On Thu, Jul 14, 2022 at 12:17:53PM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > +};
> > +static unsigned int __initdata nr_extra_reserved_ranges;
> > +static struct extra_reserved_range __initdata 
> > extra_reserved_ranges[MAX_EXTRA_RESERVED_RANGES];
> 
> Overly long line.

I can't find what coding style dictate in such case. Should the second
line be indented (and how much)?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH v2 3/3] xen/heap: pass order to free_heap_pages() in heap init

2022-07-18 Thread Jan Beulich

On 18.07.2022 12:24, Julien Grall wrote:
> Hi Jan,
> 
> On 18/07/2022 10:43, Jan Beulich wrote:
>> On 15.07.2022 19:03, Julien Grall wrote:
>>> @@ -1806,8 +1806,22 @@ static void _init_heap_pages(const struct page_info 
>>> *pg,
>>>   
>>>   while ( s < e )
>>>   {
>>> -free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
>>> -s += 1UL;
>>> +/*
>>> + * For s == 0, we simply use the largest increment by checking the
>>> + * MSB of the region size. For s != 0, we also need to ensure that 
>>> the
>>> + * chunk is properly sized to end at power-of-two alignment. We do 
>>> this
>>> + * by checking the LSB of the start address and use its index as
>>> + * the increment. Both cases need to be guarded by MAX_ORDER.
>>
>> s/guarded/bounded/ ?
>>
>>> + * Note that the value of ffsl() and flsl() starts from 1 so we 
>>> need
>>> + * to decrement it by 1.
>>> + */
>>> +int inc_order = min(MAX_ORDER, flsl(e - s) - 1);
>>
>> unsigned int, since ...
> 
> MAX_ORDER, flsl(), ffsl() are both returning signed value. So inc_order 
> should be 'int' to avoid any compilation issue.
> 
>>
>>> +if ( s )
>>> +inc_order = min(inc_order, ffsl(s) - 1);
>>> +free_heap_pages(mfn_to_page(_mfn(s)), inc_order, need_scrub);
>>
>> ... that's what the function parameter says, and the value also can go
>> negative?
> 
> AFAICT, none of the values are negative. But per above, if we use min() 
> then the local variable should be 'int'. The two possible alternatives are:
>1) Use min_t()
>2) Update MAX_ORDER, flsl(), ffsl() to return an unsigned value
> 
> The ideal would be #2 but it will require at least an extra patch and 
> effort. If we use #1, then they use may become stale if we implement #2.

I'm not sure #2 is a good option - we'd deviate from conventions used
elsewhere on what flsl() and ffsl() return. And constants would imo
best remain suffix-less unless a suffix is very relevant to have (for
MAX_ORDER, which isn't at risk of being subject to ~ or -, it may be
warranted).

3)
unsigned int inc_order = min(MAX_ORDER, flsl(e - s) - 1);

if ( s )
inc_order = min(inc_order, ffsl(s) - 1U);

No compilation issues to expect here, afaict.

> So I would prefer to keep min(). That said, I would be open to use 
> min_t() if you strongly prefer it.

I consider max_t() / min_t() dangerous, so I'd like to limit their use
to cases where no good alternatives exist.

Jan

Re: [PATCH v2 2/3] xen/heap: Split init_heap_pages() in two

2022-07-18 Thread Jan Beulich

On 18.07.2022 12:08, Julien Grall wrote:
> On 18/07/2022 10:31, Jan Beulich wrote:
>> On 15.07.2022 19:03, Julien Grall wrote:
>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -1778,16 +1778,44 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
>>>   }
>>>   
>>>   /*
>>> - * Hand the specified arbitrary page range to the specified heap zone
>>> - * checking the node_id of the previous page.  If they differ and the
>>> - * latter is not on a MAX_ORDER boundary, then we reserve the page by
>>> - * not freeing it to the buddy allocator.
>>> + * This function should only be called with valid pages from the same NUMA
>>> + * node.
>>>*/
>>> +static void _init_heap_pages(const struct page_info *pg,
>>> + unsigned long nr_pages,
>>> + bool need_scrub)
>>> +{
>>> +unsigned long s, e;
>>> +unsigned int nid = phys_to_nid(page_to_maddr(pg));
>>> +
>>> +s = mfn_x(page_to_mfn(pg));
>>> +e = mfn_x(mfn_add(page_to_mfn(pg + nr_pages - 1), 1));
>>> +if ( unlikely(!avail[nid]) )
>>> +{
>>> +bool use_tail = IS_ALIGNED(s, 1UL << MAX_ORDER) &&
>>> +(find_first_set_bit(e) <= find_first_set_bit(s));
>>> +unsigned long n;
>>> +
>>> +n = init_node_heap(nid, s, nr_pages, _tail);
>>> +BUG_ON(n > nr_pages);
>>> +if ( use_tail )
>>> +e -= n;
>>> +else
>>> +s += n;
>>> +}
>>> +
>>> +while ( s < e )
>>> +{
>>> +free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
>>> +s += 1UL;
>>
>> ... the more conventional s++ or ++s used here?
> 
> I would prefer to keep using "s += 1UL" here because:
>* it will be replace with a proper order in the follow-up patch. So 
> this is temporary.

Fair enough.

Jan

>* one could argue that if I use "s++" then I should also switch to a 
> for loop which would make sense here but not in the next patch.
> 
> Cheers,
>

Re: [PATCH v2 1/9] drivers/char: Add support for Xue USB3 debugger

2022-07-18 Thread Jan Beulich

On 18.07.2022 12:45, Marek Marczykowski-Górecki wrote:
> On Tue, Jul 12, 2022 at 05:59:51PM +0200, Jan Beulich wrote:
>> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
>>> +static bool __init xue_init_xhc(struct xue *xue)
>>> +{
>>> +uint32_t bar0;
>>> +uint64_t bar1;
>>> +uint64_t devfn;
>>> +
>>> +/*
>>> + * Search PCI bus 0 for the xHC. All the host controllers supported so 
>>> far
>>> + * are part of the chipset and are on bus 0.
>>> + */
>>> +for ( devfn = 0; devfn < 256; devfn++ )
>>> +{
>>> +pci_sbdf_t sbdf = PCI_SBDF(0, 0, devfn);
>>> +uint32_t hdr = pci_conf_read8(sbdf, PCI_HEADER_TYPE);
>>> +
>>> +if ( hdr == 0 || hdr == 0x80 )
>>> +{
>>> +if ( (pci_conf_read32(sbdf, PCI_CLASS_REVISION) >> 8) == 
>>> XUE_XHC_CLASSC )
>>> +{
>>> +xue->sbdf = sbdf;
>>> +break;
>>> +}
>>> +}
>>> +}
>>> +
>>> +if ( !xue->sbdf.sbdf )
>>> +{
>>> +xue_error("Compatible xHC not found on bus 0\n");
>>> +return false;
>>> +}
>>> +
>>> +/* ...we found it, so parse the BAR and map the registers */
>>> +bar0 = pci_conf_read32(xue->sbdf, PCI_BASE_ADDRESS_0);
>>> +bar1 = pci_conf_read32(xue->sbdf, PCI_BASE_ADDRESS_1);
>>> +
>>> +/* IO BARs not allowed; BAR must be 64-bit */
>>> +if ( (bar0 & PCI_BASE_ADDRESS_SPACE) != PCI_BASE_ADDRESS_SPACE_MEMORY 
>>> ||
>>> + (bar0 & PCI_BASE_ADDRESS_MEM_TYPE_MASK) != 
>>> PCI_BASE_ADDRESS_MEM_TYPE_64 )
>>> +return false;
>>> +
>>> +pci_conf_write32(xue->sbdf, PCI_BASE_ADDRESS_0, 0x);
>>> +xue->xhc_mmio_size = ~(pci_conf_read32(xue->sbdf, PCI_BASE_ADDRESS_0) 
>>> & 0xFFF0) + 1;
>>> +pci_conf_write32(xue->sbdf, PCI_BASE_ADDRESS_0, bar0);
>>
>> Why is a 64-bit BAR required when you size only the low 32 bits?
> 
> xHCI spec says the first BAR is required to be 64-bit, so I'm checking
> this assumption to handle just this one case. But then, the size is 64K
> in practice (and xue_sys_map_xhc() checks for that), so just 32 bits are
> enough. Anyway, I can add sizing the whole thing, for consistency.
> 
>> Also you need to disable memory decoding around this (and
>> somewhere you also need to explicitly enable it, assuming here
>> you would afterwards restore the original value of the command
>> register). 
> 
> Actually, this is a good place to enable memory decoding.

It might seem so, I agree, but then upon encountering a later error
you'll need more precautions so you would able to restore the command
register to its original value. I think it's easier / clearer when
you keep command register save/restore to within functions.

Jan

Re: [PATCH v2 5/9] IOMMU: add common API for device reserved memory

2022-07-18 Thread Marek Marczykowski-Górecki

On Thu, Jul 14, 2022 at 12:17:53PM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > --- a/xen/drivers/passthrough/iommu.c
> > +++ b/xen/drivers/passthrough/iommu.c
> > @@ -651,6 +651,46 @@ bool_t iommu_has_feature(struct domain *d, enum 
> > iommu_feature feature)
> >  return is_iommu_enabled(d) && test_bit(feature, 
> > dom_iommu(d)->features);
> >  }
> >  
> > +#define MAX_EXTRA_RESERVED_RANGES 20
> > +struct extra_reserved_range {
> > +xen_pfn_t start;
> 
> Why not unsigned long?

Isn't this the correct type for the page number?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature

Re: [PATCH 4/5] xen/wait: Use relative stack adjustments

2022-07-18 Thread Jan Beulich

On 18.07.2022 12:41, Andrew Cooper wrote:
> On 18/07/2022 11:29, Jan Beulich wrote:
>> On 18.07.2022 09:18, Andrew Cooper wrote:
>>> -"mov %%rsp,%%rsi;"
>>> +
>>> +"mov %%ecx, %%eax;"
>>> +"mov %%rsp, %%rsi;" /* Copy from the stack, into wqv->stack */
>>>  
>>>  /* check_wakeup_from_wait() longjmp()'s to this point. */
>>>  ".L_wq_resume: rep movsb;"
>>> -"mov %%rsp,%%rsi;"
>>>  
>>>  ".L_skip:"
>>>  "pop %%r15; pop %%r14; pop %%r13;"
>>>  "pop %%r12; pop %%rbp; pop %%rbx;"
>>> -: "=" (wqv->esp), "=" (dummy), "=" (dummy)
>>> -: "0" (0), "1" (cpu_info), "2" (wqv->stack),
>>> +: "=a" (used), "=D" (dummy), "=c" (dummy), "=" 
>>> (dummy)
>> You can't validly drop & from =D and =c.
> 
> On the contrary, GCC demands it.
> 
> & is only relevant, and only permitted, when there is not an explicit
> input tied to the same register.
> 
> When there is an explicit input tied to the same register, of course it
> can't be reused for another input before being used as an output.

Oh, sorry - I neglected to take into account this adding of inputs.

Jan

Re: [PATCH 2/5] xen/wait: Extend the description of how this logic actually works

2022-07-18 Thread Jan Beulich

On 18.07.2022 12:11, Andrew Cooper wrote:
> On 18/07/2022 11:00, Jan Beulich wrote:
>> On 18.07.2022 09:18, Andrew Cooper wrote:
>>> @@ -199,9 +211,18 @@ void check_wakeup_from_wait(void)
>>>  }
>>>  
>>>  /*
>>> - * Hand-rolled longjmp().  Returns to __prepare_to_wait(), and lands 
>>> on a
>>> - * `rep movs` instruction.  All other GPRs are restored from the 
>>> stack, so
>>> - * are available for use here.
>>> + * Hand-rolled longjmp().
>>> + *
>>> + * check_wakeup_from_wait() is always called with a shallow stack,
>>> + * immediately after the vCPU has been rescheduled.
>>> + *
>>> + * Adjust %rsp to be the correct depth for the (deeper) stack we want 
>>> to
>>> + * restore, then prepare %rsi, %rdi and %rcx such that when we 
>>> intercept
>>> + * the rep movs in __prepare_to_wait(), it copies from wqv->stack over 
>>> the
>>> + * active stack.
>> I'm struggling with the use of "intercept" here, but I guess that's just
>> because I'm not a native speaker.
> 
> "intercept" is the same terminology used in the middle of
> __prepare_to_wait()'s block.
> 
> It's because we have a weird setup where this is (now) a noreturn
> function merging into the middle of a function which already executed once.
> 
> I'm happy to change it if it's unclear, but I can't think of a better
> description.

"... when we go back to ..."? (But I'm not meaning to insist on a
wording change.)

Jan

Re: [PATCH 5/5] xen/wait: Remove VCPU_AFFINITY_WAIT

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:18, Andrew Cooper wrote:
> With the waitqueue logic updated to not use an absolute stack pointer
> reference, the vCPU can safely be resumed anywhere.
> 
> Remove VCPU_AFFINITY_WAIT completely, getting rid of two domain crashes,

I understand you mean two domain_crash() invocations here, but ...

> and a
> logical corner case where resetting the vcpu with an oustanding waitqueue
> would crash the domain.

... some other domain crash here?

> Signed-off-by: Andrew Cooper 

I assume you've checked thoroughly that calling code hasn't
grown dependencies on execution coming back on the same CPU?

Jan

Re: [PATCH v2 1/9] drivers/char: Add support for Xue USB3 debugger

2022-07-18 Thread Marek Marczykowski-Górecki

On Tue, Jul 12, 2022 at 05:59:51PM +0200, Jan Beulich wrote:
> On 06.07.2022 17:32, Marek Marczykowski-Górecki wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -721,10 +721,15 @@ Available alternatives, with their meaning, are:
> >  
> >  ### dbgp
> >  > `= ehci[  | @pci:. ]`
> > +> `= xue`
> >  
> >  Specify the USB controller to use, either by instance number (when going
> >  over the PCI busses sequentially) or by PCI device (must be on segment 0).
> >  
> > +Use `ehci` for EHCI debug port, use `xue` for XHCI debug capability.
> > +Xue driver will wait indefinitely for the debug host to connect - make 
> > sure the
> > +cable is connected.
> 
> Especially without it being clear what "xue" stands for, I wonder
> whether "xhci" would be the better (more commonly known) token to
> use here.

Sure, I can change that. I modify this code heavily anyway, so there is
little point in keeping it similar to the original xue driver.

> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -946,6 +946,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >  ns16550.irq = 3;
> >  ns16550_init(1, );
> >  ehci_dbgp_init();
> > +#ifdef CONFIG_HAS_XHCI
> > +xue_uart_init();
> > +#endif
> 
> Can you make an empty inline stub to avoid the #ifdef here?

Ok.

> > --- a/xen/drivers/char/Kconfig
> > +++ b/xen/drivers/char/Kconfig
> > @@ -74,3 +74,12 @@ config HAS_EHCI
> > help
> >   This selects the USB based EHCI debug port to be used as a UART. If
> >   you have an x86 based system with USB, say Y.
> > +
> > +config HAS_XHCI
> > +   bool "XHCI DbC UART driver"
> 
> I'm afraid I consider most of the other options here wrong in
> starting with HAS_: Such named options should have no prompt, and
> be exclusively engaged by "select". Hence I'd like to ask to drop
> the HAS_ here.

Ok.

> > +   depends on X86
> > +   help
> > + This selects the USB based XHCI debug capability to be used as a UART.
> 
> s/used/usable/?

Yes.

> > --- /dev/null
> > +++ b/xen/drivers/char/xue.c
> > @@ -0,0 +1,933 @@
> > +/*
> > + * drivers/char/xue.c
> > + *
> > + * Xen port for the xue debugger
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; If not, see .
> > + *
> > + * Copyright (c) 2019 Assured Information Security.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> 
> Please sort xen/ before asm/ and alphabetically within each group.

Ok.

> > +/* uncomment to have xue_uart_dump() debug function */
> > +/* #define XUE_DEBUG 1 */
> > +
> > +#define XUE_POLL_INTERVAL 100 /* us */
> > +
> > +#define XUE_PAGE_SIZE 4096ULL
> 
> I think I had asked before - why this odd suffix?
> 
> > +static void xue_sys_pause(void)
> > +{
> > +asm volatile("pause" ::: "memory");
> > +}
> 
> I wonder whether the open-coded inline assembly is really needed
> here: Can't you use cpu_relax()? If not, style nit: Several blanks
> missing.

Probably I can.

> > +static bool __init xue_init_xhc(struct xue *xue)
> > +{
> > +uint32_t bar0;
> > +uint64_t bar1;
> > +uint64_t devfn;
> > +
> > +/*
> > + * Search PCI bus 0 for the xHC. All the host controllers supported so 
> > far
> > + * are part of the chipset and are on bus 0.
> > + */
> > +for ( devfn = 0; devfn < 256; devfn++ )
> > +{
> > +pci_sbdf_t sbdf = PCI_SBDF(0, 0, devfn);
> > +uint32_t hdr = pci_conf_read8(sbdf, PCI_HEADER_TYPE);
> > +
> > +if ( hdr == 0 || hdr == 0x80 )
> > +{
> > +if ( (pci_conf_read32(sbdf, PCI_CLASS_REVISION) >> 8) == 
> > XUE_XHC_CLASSC )
> > +{
> > +xue->sbdf = sbdf;
> > +break;
> > +}
> > +}
> > +}
> > +
> > +if ( !xue->sbdf.sbdf )
> > +{
> > +xue_error("Compatible xHC not found on bus 0\n");
> > +return false;
> > +}
> > +
> > +/* ...we found it, so parse the BAR and map the registers */
> > +bar0 = pci_conf_read32(xue->sbdf, PCI_BASE_ADDRESS_0);
> > +bar1 = pci_conf_read32(xue->sbdf, PCI_BASE_ADDRESS_1);
> > +
> > +/* IO BARs not allowed; BAR must be 64-bit */
> > +if ( (bar0 & PCI_BASE_ADDRESS_SPACE) != PCI_BASE_ADDRESS_SPACE_MEMORY

Re: [PATCH 4/5] xen/wait: Use relative stack adjustments

2022-07-18 Thread Andrew Cooper

On 18/07/2022 11:29, Jan Beulich wrote:
> On 18.07.2022 09:18, Andrew Cooper wrote:
>> -"mov %%rsp,%%rsi;"
>> +
>> +"mov %%ecx, %%eax;"
>> +"mov %%rsp, %%rsi;" /* Copy from the stack, into wqv->stack */
>>  
>>  /* check_wakeup_from_wait() longjmp()'s to this point. */
>>  ".L_wq_resume: rep movsb;"
>> -"mov %%rsp,%%rsi;"
>>  
>>  ".L_skip:"
>>  "pop %%r15; pop %%r14; pop %%r13;"
>>  "pop %%r12; pop %%rbp; pop %%rbx;"
>> -: "=" (wqv->esp), "=" (dummy), "=" (dummy)
>> -: "0" (0), "1" (cpu_info), "2" (wqv->stack),
>> +: "=a" (used), "=D" (dummy), "=c" (dummy), "=" (dummy)
> You can't validly drop & from =D and =c.

On the contrary, GCC demands it.

& is only relevant, and only permitted, when there is not an explicit
input tied to the same register.

When there is an explicit input tied to the same register, of course it
can't be reused for another input before being used as an output.

https://godbolt.org/z/4vWP4PKc5

~Andrew

Re: [PATCH 4/5] xen/wait: Use relative stack adjustments

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:18, Andrew Cooper wrote:
> @@ -121,11 +121,11 @@ void wake_up_all(struct waitqueue_head *wq)
>  
>  static void __prepare_to_wait(struct waitqueue_vcpu *wqv)
>  {
> -struct cpu_info *cpu_info = get_cpu_info();
>  struct vcpu *curr = current;
>  unsigned long dummy;
> +unsigned int used;
>  
> -ASSERT(wqv->esp == 0);
> +ASSERT(wqv->used == 0);

Minor: Use ! like you do further down?

> @@ -154,24 +154,25 @@ static void __prepare_to_wait(struct waitqueue_vcpu 
> *wqv)
>  "push %%rbx; push %%rbp; push %%r12;"
>  "push %%r13; push %%r14; push %%r15;"
>  
> -"sub %%esp,%%ecx;"
> +"sub %%esp, %%ecx;" /* ecx = delta to cpu_info */
>  "cmp %[sz], %%ecx;"
>  "ja .L_skip;"   /* Bail if >4k */

According to the inputs, %eax is still 0 when bailing here, so the
check below won't find "used > PAGE_SIZE". I further wonder why you
don't store directly into wqv->used, and go through %eax instead.

> -"mov %%rsp,%%rsi;"
> +
> +"mov %%ecx, %%eax;"
> +"mov %%rsp, %%rsi;" /* Copy from the stack, into wqv->stack */
>  
>  /* check_wakeup_from_wait() longjmp()'s to this point. */
>  ".L_wq_resume: rep movsb;"
> -"mov %%rsp,%%rsi;"
>  
>  ".L_skip:"
>  "pop %%r15; pop %%r14; pop %%r13;"
>  "pop %%r12; pop %%rbp; pop %%rbx;"
> -: "=" (wqv->esp), "=" (dummy), "=" (dummy)
> -: "0" (0), "1" (cpu_info), "2" (wqv->stack),
> +: "=a" (used), "=D" (dummy), "=c" (dummy), "=" (dummy)

You can't validly drop & from =D and =c. If you want to stick to
going through %eax, I think that one wants & added as well and ...

> +: "a" (0), "D" (wqv->stack), "c" (get_cpu_info()),

... the (unused) input here dropped.

> @@ -220,14 +224,22 @@ void check_wakeup_from_wait(void)
>   * the rep movs in __prepare_to_wait(), it copies from wqv->stack over 
> the
>   * active stack.
>   *
> + * We are also bound by __prepare_to_wait()'s output constraints, so %eax
> + * needs to be wqv->used.
> + *
>   * All other GPRs are available for use; they're either restored from
>   * wqv->stack or explicitly clobbered.
>   */
> -asm volatile ( "mov %%rdi, %%rsp;"
> +asm volatile ( "sub %%esp, %k[var];" /* var = delta to cpu_info */
> +   "neg %k[var];"
> +   "add %%ecx, %k[var];" /* var = -delta + wqv->used */
> +
> +   "sub %[var], %%rsp;"  /* Adjust %rsp down to make room */
> +   "mov %%rsp, %%rdi;"   /* Copy from wqv->stack, into the 
> stack */
> "jmp .L_wq_resume;"
> -   :
> -   : "S" (wqv->stack), "D" (wqv->esp),
> - "c" ((char *)get_cpu_info() - (char *)wqv->esp)
> +   : "=D" (tmp), [var] "=" (tmp)
> +   : "S" (wqv->stack), "c" (wqv->used), "a" (wqv->used),

If you want to stick to going through %eax, then I think you need to
make it an output here: "+a" (wqv->used), so it is clear that the
register is blocked from any other use throughout the asm(). Or you
could use "=" and copy %ecx into %eax inside the asm(). Both may
cause the compiler to emit dead code updating wqv->used right after
the asm(), so I think not going through %eax is the more desirable
approach (but I may well be overlooking a reason why directly
dealing with wqv->used in __prepare_to_wait() isn't an option).

Strictly speaking (in particular if [right now] there wasn't just a
branch after updating %rdi) you also again want "=" here.

Jan

Re: [PATCH v2 3/3] xen/heap: pass order to free_heap_pages() in heap init

2022-07-18 Thread Julien Grall


Hi Jan,

On 18/07/2022 10:43, Jan Beulich wrote:

On 15.07.2022 19:03, Julien Grall wrote:

@@ -1806,8 +1806,22 @@ static void _init_heap_pages(const struct page_info *pg,
  
  while ( s < e )

  {
-free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
-s += 1UL;
+/*
+ * For s == 0, we simply use the largest increment by checking the
+ * MSB of the region size. For s != 0, we also need to ensure that the
+ * chunk is properly sized to end at power-of-two alignment. We do this
+ * by checking the LSB of the start address and use its index as
+ * the increment. Both cases need to be guarded by MAX_ORDER.


s/guarded/bounded/ ?


+ * Note that the value of ffsl() and flsl() starts from 1 so we need
+ * to decrement it by 1.
+ */
+int inc_order = min(MAX_ORDER, flsl(e - s) - 1);


unsigned int, since ...


MAX_ORDER, flsl(), ffsl() are both returning signed value. So inc_order 
should be 'int' to avoid any compilation issue.





+if ( s )
+inc_order = min(inc_order, ffsl(s) - 1);
+free_heap_pages(mfn_to_page(_mfn(s)), inc_order, need_scrub);


... that's what the function parameter says, and the value also can go
negative?


AFAICT, none of the values are negative. But per above, if we use min() 
then the local variable should be 'int'. The two possible alternatives are:

  1) Use min_t()
  2) Update MAX_ORDER, flsl(), ffsl() to return an unsigned value

The ideal would be #2 but it will require at least an extra patch and 
effort. If we use #1, then they use may become stale if we implement #2.


So I would prefer to keep min(). That said, I would be open to use 
min_t() if you strongly prefer it.



Preferably with these adjustments
Reviewed-by: Jan Beulich 


Thanks!

Cheers,

--
Julien Grall

Re: [PATCH 2/5] xen/wait: Extend the description of how this logic actually works

2022-07-18 Thread Andrew Cooper

On 18/07/2022 11:00, Jan Beulich wrote:
> On 18.07.2022 09:18, Andrew Cooper wrote:
>> @@ -199,9 +211,18 @@ void check_wakeup_from_wait(void)
>>  }
>>  
>>  /*
>> - * Hand-rolled longjmp().  Returns to __prepare_to_wait(), and lands on 
>> a
>> - * `rep movs` instruction.  All other GPRs are restored from the stack, 
>> so
>> - * are available for use here.
>> + * Hand-rolled longjmp().
>> + *
>> + * check_wakeup_from_wait() is always called with a shallow stack,
>> + * immediately after the vCPU has been rescheduled.
>> + *
>> + * Adjust %rsp to be the correct depth for the (deeper) stack we want to
>> + * restore, then prepare %rsi, %rdi and %rcx such that when we intercept
>> + * the rep movs in __prepare_to_wait(), it copies from wqv->stack over 
>> the
>> + * active stack.
> I'm struggling with the use of "intercept" here, but I guess that's just
> because I'm not a native speaker.

"intercept" is the same terminology used in the middle of
__prepare_to_wait()'s block.

It's because we have a weird setup where this is (now) a noreturn
function merging into the middle of a function which already executed once.

I'm happy to change it if it's unclear, but I can't think of a better
description.

>> + * All other GPRs are available for use; they're either restored from
>> + * wqv->stack or explicitly clobbered.
> You talking of "other GPRs" - there aren't any which are explicitly
> clobbered. It's only the previously named ones which are. Hence I'd like
> to ask that the respective parts of the sentence be dropped. Then
> Reviewed-by: Jan Beulich 

It becomes true in the next patch.  I'll try and shuffle things.

~Andrew

Re: [PATCH v2 2/3] xen/heap: Split init_heap_pages() in two

2022-07-18 Thread Julien Grall


Hi Jan,

On 18/07/2022 10:31, Jan Beulich wrote:

On 15.07.2022 19:03, Julien Grall wrote:

From: Julien Grall 

At the moment, init_heap_pages() will call free_heap_pages() page
by page. To reduce the time to initialize the heap, we will want
to provide multiple pages at the same time.

init_heap_pages() is now split in two parts:
 - init_heap_pages(): will break down the range in multiple set
   of contiguous pages. For now, the criteria is the pages should
   belong to the same NUMA node.
 - _init_heap_pages(): will initialize a set of pages belonging to
   the same NUMA node. In a follow-up patch, new requirements will
   be added (e.g. pages should belong to the same zone). For now the
   pages are still passed one by one to free_heap_pages().

Note that the comment before init_heap_pages() is heavily outdated and
does not reflect the current code. So update it.

This patch is a merge/rework of patches from David Woodhouse and
Hongyan Xia.

Signed-off-by: Julien Grall 


Reviewed-by: Jan Beulich 


Thanks.


--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1778,16 +1778,44 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
  }
  
  /*

- * Hand the specified arbitrary page range to the specified heap zone
- * checking the node_id of the previous page.  If they differ and the
- * latter is not on a MAX_ORDER boundary, then we reserve the page by
- * not freeing it to the buddy allocator.
+ * This function should only be called with valid pages from the same NUMA
+ * node.
   */
+static void _init_heap_pages(const struct page_info *pg,
+ unsigned long nr_pages,
+ bool need_scrub)
+{
+unsigned long s, e;
+unsigned int nid = phys_to_nid(page_to_maddr(pg));
+
+s = mfn_x(page_to_mfn(pg));
+e = mfn_x(mfn_add(page_to_mfn(pg + nr_pages - 1), 1));
+if ( unlikely(!avail[nid]) )
+{
+bool use_tail = IS_ALIGNED(s, 1UL << MAX_ORDER) &&
+(find_first_set_bit(e) <= find_first_set_bit(s));
+unsigned long n;
+
+n = init_node_heap(nid, s, nr_pages, _tail);
+BUG_ON(n > nr_pages);
+if ( use_tail )
+e -= n;
+else
+s += n;
+}
+
+while ( s < e )
+{
+free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
+s += 1UL;


... the more conventional s++ or ++s used here?


I would prefer to keep using "s += 1UL" here because:
  * it will be replace with a proper order in the follow-up patch. So 
this is temporary.
  * one could argue that if I use "s++" then I should also switch to a 
for loop which would make sense here but not in the next patch.


Cheers,

--
Julien Grall

Re: [PATCH v2 2/3] xen/heap: Split init_heap_pages() in two

2022-07-18 Thread Julien Grall





On 18/07/2022 09:18, Wei Chen wrote:

  static void init_heap_pages(
  struct page_info *pg, unsigned long nr_pages)
  {
  unsigned long i;
-    bool idle_scrub = false;
+    bool need_scrub = scrub_debug;



You have changed idle_scrub to need_scrub, but haven't mentioned this
in commit log, and I also haven't found related discussion in v1. I
am very clear about this change.


The meaning/use of the variable is now different. Before this patch, the 
variable was only indicating whether idle scrub was enabled (this is 
configurable by the admin). This was then or-ed with  'scrub_debug' when 
calling free_heap_pages().


With this patch, we now store the result of the or-ed in the local variable.

This is not something I felt was necessary to mention in the commit message.

Cheers,

--
Julien Grall

Re: [PATCH 3/5] xen/wait: Minor asm improvements

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:18, Andrew Cooper wrote:
> --- a/xen/common/wait.c
> +++ b/xen/common/wait.c
> @@ -151,13 +151,12 @@ static void __prepare_to_wait(struct waitqueue_vcpu 
> *wqv)
>   * copies in from wqv->stack over the active stack.
>   */
>  asm volatile (
> -"push %%rax; push %%rbx; push %%rdx; push %%rbp;"
> -"push %%r8;  push %%r9;  push %%r10; push %%r11;"
> -"push %%r12; push %%r13; push %%r14; push %%r15;"
> +"push %%rbx; push %%rbp; push %%r12;"
> +"push %%r13; push %%r14; push %%r15;"
>  
>  "sub %%esp,%%ecx;"
> -"cmp %3,%%ecx;"
> -"ja .L_skip;"
> +"cmp %[sz], %%ecx;"
> +"ja .L_skip;"   /* Bail if >4k */
>  "mov %%rsp,%%rsi;"
>  
>  /* check_wakeup_from_wait() longjmp()'s to this point. */
> @@ -165,12 +164,12 @@ static void __prepare_to_wait(struct waitqueue_vcpu 
> *wqv)
>  "mov %%rsp,%%rsi;"
>  
>  ".L_skip:"
> -"pop %%r15; pop %%r14; pop %%r13; pop %%r12;"
> -"pop %%r11; pop %%r10; pop %%r9;  pop %%r8;"
> -"pop %%rbp; pop %%rdx; pop %%rbx; pop %%rax"
> +"pop %%r15; pop %%r14; pop %%r13;"
> +"pop %%r12; pop %%rbp; pop %%rbx;"
>  : "=" (wqv->esp), "=" (dummy), "=" (dummy)
> -: "i" (PAGE_SIZE), "0" (0), "1" (cpu_info), "2" (wqv->stack)
> -: "memory" );
> +: "0" (0), "1" (cpu_info), "2" (wqv->stack),
> +  [sz] "i" (PAGE_SIZE)
> +: "memory", "rax", "rdx", "r8", "r9", "r10", "r11" );
>  
>  if ( unlikely(wqv->esp == 0) )
>  {
> @@ -224,11 +223,12 @@ void check_wakeup_from_wait(void)
>   * All other GPRs are available for use; they're either restored from
>   * wqv->stack or explicitly clobbered.
>   */

Ah, this is where the comment stops being misleading. I think it would
be better if you had it correct there and adjusted it here.

> -asm volatile (
> -"mov %1,%%"__OP"sp; jmp .L_wq_resume;"
> -: : "S" (wqv->stack), "D" (wqv->esp),
> -  "c" ((char *)get_cpu_info() - (char *)wqv->esp)
> -: "memory" );
> +asm volatile ( "mov %%rdi, %%rsp;"
> +   "jmp .L_wq_resume;"

Minor remark: No need for trailing semicolons like this one. Anyway:
Reviewed-by: Jan Beulich 

Jan

Re: [PATCH 2/5] xen/wait: Extend the description of how this logic actually works

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:18, Andrew Cooper wrote:
> @@ -199,9 +211,18 @@ void check_wakeup_from_wait(void)
>  }
>  
>  /*
> - * Hand-rolled longjmp().  Returns to __prepare_to_wait(), and lands on a
> - * `rep movs` instruction.  All other GPRs are restored from the stack, 
> so
> - * are available for use here.
> + * Hand-rolled longjmp().
> + *
> + * check_wakeup_from_wait() is always called with a shallow stack,
> + * immediately after the vCPU has been rescheduled.
> + *
> + * Adjust %rsp to be the correct depth for the (deeper) stack we want to
> + * restore, then prepare %rsi, %rdi and %rcx such that when we intercept
> + * the rep movs in __prepare_to_wait(), it copies from wqv->stack over 
> the
> + * active stack.

I'm struggling with the use of "intercept" here, but I guess that's just
because I'm not a native speaker.

> + * All other GPRs are available for use; they're either restored from
> + * wqv->stack or explicitly clobbered.

You talking of "other GPRs" - there aren't any which are explicitly
clobbered. It's only the previously named ones which are. Hence I'd like
to ask that the respective parts of the sentence be dropped. Then
Reviewed-by: Jan Beulich 

Jan

Re: [PATCH 1/5] xen/wait: Drop vestigial remnants of TRAP_regs_partial

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:18, Andrew Cooper wrote:
> The preservation of entry_vector was introduced with ecf9846a6a20 ("x86:
> save/restore only partial register state where possible") where
> TRAP_regs_partial was introduced, but missed from f9eb74789af7 ("x86/entry:
> Remove support for partial cpu_user_regs frames") where TRAP_regs_partial was
> removed.
> 
> Fixes: f9eb74789af7 ("x86/entry: Remove support for partial cpu_user_regs 
> frames")
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich

Re: [PATCH v2 2/2] xen: Fix latent check-endbr.sh bug with 32bit build environments

2022-07-18 Thread Jan Beulich

On 18.07.2022 11:31, Andrew Cooper wrote:
> On 18/07/2022 10:11, Jan Beulich wrote:
>> On 15.07.2022 15:26, Andrew Cooper wrote:
>>> --- a/xen/tools/check-endbr.sh
>>> +++ b/xen/tools/check-endbr.sh
>>> @@ -61,19 +61,36 @@ ${OBJDUMP} -j .text $1 -d -w | grep '   endbr64 *$' | 
>>> cut -f 1 -d ':' > $VALID &
>>>  #the lower bits, rounding integers to the nearest 4k.
>>>  #
>>>  #Instead, use the fact that Xen's .text is within a 1G aligned region, 
>>> and
>>> -#split the VMA in half so AWK's numeric addition is only working on 32 
>>> bit
>>> -#numbers, which don't lose precision.
>>> +#split the VMA so AWK's numeric addition is only working on <32 bit
>>> +#numbers, which don't lose precision.  (See point 5)
>>>  #
>>>  # 4) MAWK doesn't support plain hex constants (an optional part of the 
>>> POSIX
>>>  #spec), and GAWK and MAWK can't agree on how to work with hex 
>>> constants in
>>>  #a string.  Use the shell to convert $vma_lo to decimal before passing 
>>> to
>>>  #AWK.
>>>  #
>>> +# 5) Point 4 isn't fully portable.  POSIX only requires that $((0xN)) be
>>> +#evaluated as long, which in 32bit shells turns negative if bit 31 of 
>>> the
>>> +#VMA is set.  AWK then interprets this negative number as a double 
>>> before
>>> +#adding the offsets from the binary grep.
>>> +#
>>> +#Instead of doing an 8/8 split with vma_hi/lo, do a 9/7 split.
>>> +#
>>> +#The consequence of this is that for all offsets, $vma_lo + offset 
>>> needs
>>> +#to be less that 256M (i.e. 7 nibbles) so as to be successfully 
>>> recombined
>>> +#with the 9 nibbles of $vma_hi.  This is fine; .text is at the start 
>>> of a
>>> +#1G aligned region, and Xen is far far smaller than 256M, but leave 
>>> safety
>>> +#check nevertheless.
>>> +#
>>>  eval $(${OBJDUMP} -j .text $1 -h |
>>> -$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 
>>> 8), substr($4, 9, 16)}')
>>> +$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 
>>> 9), substr($4, 10, 16)}')
>>>  
>>>  ${OBJCOPY} -j .text $1 -O binary $TEXT_BIN
>>>  
>>> +bin_sz=$(stat -c '%s' $TEXT_BIN)
>>> +[ "$bin_sz" -ge $(((1 << 28) - $vma_lo)) ] &&
>>> +{ echo "$MSG_PFX Error: .text offsets can exceed 256M" >&2; exit 1; }
>> ... s/can/cannot/ ?
> 
> Why?  "Can" is correct here.  If the offsets can't exceed 256M, then
> everything is good.

Hmm, the wording then indeed is ambiguous. I read "can" as "are allowed
to", when we mean "aren't allowed to". Maybe ".text is 256M or more in
size"? If you mention "offsets", then I think the check should be based
on actually observing an offset which is too large (which .text size
alone doesn't guarantee will happen).

Jan

Re: [PATCH v2 3/3] xen/heap: pass order to free_heap_pages() in heap init

2022-07-18 Thread Jan Beulich

On 15.07.2022 19:03, Julien Grall wrote:
> @@ -1806,8 +1806,22 @@ static void _init_heap_pages(const struct page_info 
> *pg,
>  
>  while ( s < e )
>  {
> -free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
> -s += 1UL;
> +/*
> + * For s == 0, we simply use the largest increment by checking the
> + * MSB of the region size. For s != 0, we also need to ensure that 
> the
> + * chunk is properly sized to end at power-of-two alignment. We do 
> this
> + * by checking the LSB of the start address and use its index as
> + * the increment. Both cases need to be guarded by MAX_ORDER.

s/guarded/bounded/ ?

> + * Note that the value of ffsl() and flsl() starts from 1 so we need
> + * to decrement it by 1.
> + */
> +int inc_order = min(MAX_ORDER, flsl(e - s) - 1);

unsigned int, since ...

> +if ( s )
> +inc_order = min(inc_order, ffsl(s) - 1);
> +free_heap_pages(mfn_to_page(_mfn(s)), inc_order, need_scrub);

... that's what the function parameter says, and the value also can go
negative? Preferably with these adjustments
Reviewed-by: Jan Beulich 

Jan

Re: [PATCH v2 2/2] xen: Fix latent check-endbr.sh bug with 32bit build environments

2022-07-18 Thread Andrew Cooper

On 18/07/2022 10:11, Jan Beulich wrote:
> On 15.07.2022 15:26, Andrew Cooper wrote:
>> While Xen's current VMA means it works, the mawk fix (i.e. using $((0xN)) in
>> the shell) isn't portable in 32bit shells.  See the code comment for the fix.
>>
>> The fix found a second latent bug.  Recombining $vma_hi/lo should have used
>> printf "%s%08x" and only worked previously because $vma_lo had bits set in
>> it's top nibble.  Combining with the main fix, %08x becomes %07x.
>>
>> Fixes: $XXX patch 1
>> Reported-by: Jan Beulich 
>> Signed-off-by: Andrew Cooper 
> Reviewed-by: Jan Beulich 

Thanks, but...

> with, I guess, ...
>
>> --- a/xen/tools/check-endbr.sh
>> +++ b/xen/tools/check-endbr.sh
>> @@ -61,19 +61,36 @@ ${OBJDUMP} -j .text $1 -d -w | grep 'endbr64 *$' | 
>> cut -f 1 -d ':' > $VALID &
>>  #the lower bits, rounding integers to the nearest 4k.
>>  #
>>  #Instead, use the fact that Xen's .text is within a 1G aligned region, 
>> and
>> -#split the VMA in half so AWK's numeric addition is only working on 32 
>> bit
>> -#numbers, which don't lose precision.
>> +#split the VMA so AWK's numeric addition is only working on <32 bit
>> +#numbers, which don't lose precision.  (See point 5)
>>  #
>>  # 4) MAWK doesn't support plain hex constants (an optional part of the POSIX
>>  #spec), and GAWK and MAWK can't agree on how to work with hex constants 
>> in
>>  #a string.  Use the shell to convert $vma_lo to decimal before passing 
>> to
>>  #AWK.
>>  #
>> +# 5) Point 4 isn't fully portable.  POSIX only requires that $((0xN)) be
>> +#evaluated as long, which in 32bit shells turns negative if bit 31 of 
>> the
>> +#VMA is set.  AWK then interprets this negative number as a double 
>> before
>> +#adding the offsets from the binary grep.
>> +#
>> +#Instead of doing an 8/8 split with vma_hi/lo, do a 9/7 split.
>> +#
>> +#The consequence of this is that for all offsets, $vma_lo + offset needs
>> +#to be less that 256M (i.e. 7 nibbles) so as to be successfully 
>> recombined
>> +#with the 9 nibbles of $vma_hi.  This is fine; .text is at the start of 
>> a
>> +#1G aligned region, and Xen is far far smaller than 256M, but leave 
>> safety
>> +#check nevertheless.
>> +#
>>  eval $(${OBJDUMP} -j .text $1 -h |
>> -$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 8), 
>> substr($4, 9, 16)}')
>> +$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 9), 
>> substr($4, 10, 16)}')
>>  
>>  ${OBJCOPY} -j .text $1 -O binary $TEXT_BIN
>>  
>> +bin_sz=$(stat -c '%s' $TEXT_BIN)
>> +[ "$bin_sz" -ge $(((1 << 28) - $vma_lo)) ] &&
>> +{ echo "$MSG_PFX Error: .text offsets can exceed 256M" >&2; exit 1; }
> ... s/can/cannot/ ?

Why?  "Can" is correct here.  If the offsets can't exceed 256M, then
everything is good.

~Andrew

Re: [PATCH v2 2/3] xen/heap: Split init_heap_pages() in two

2022-07-18 Thread Jan Beulich

On 15.07.2022 19:03, Julien Grall wrote:
> From: Julien Grall 
> 
> At the moment, init_heap_pages() will call free_heap_pages() page
> by page. To reduce the time to initialize the heap, we will want
> to provide multiple pages at the same time.
> 
> init_heap_pages() is now split in two parts:
> - init_heap_pages(): will break down the range in multiple set
>   of contiguous pages. For now, the criteria is the pages should
>   belong to the same NUMA node.
> - _init_heap_pages(): will initialize a set of pages belonging to
>   the same NUMA node. In a follow-up patch, new requirements will
>   be added (e.g. pages should belong to the same zone). For now the
>   pages are still passed one by one to free_heap_pages().
> 
> Note that the comment before init_heap_pages() is heavily outdated and
> does not reflect the current code. So update it.
> 
> This patch is a merge/rework of patches from David Woodhouse and
> Hongyan Xia.
> 
> Signed-off-by: Julien Grall 

Reviewed-by: Jan Beulich 
Albeit maybe with ...

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1778,16 +1778,44 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
>  }
>  
>  /*
> - * Hand the specified arbitrary page range to the specified heap zone
> - * checking the node_id of the previous page.  If they differ and the
> - * latter is not on a MAX_ORDER boundary, then we reserve the page by
> - * not freeing it to the buddy allocator.
> + * This function should only be called with valid pages from the same NUMA
> + * node.
>   */
> +static void _init_heap_pages(const struct page_info *pg,
> + unsigned long nr_pages,
> + bool need_scrub)
> +{
> +unsigned long s, e;
> +unsigned int nid = phys_to_nid(page_to_maddr(pg));
> +
> +s = mfn_x(page_to_mfn(pg));
> +e = mfn_x(mfn_add(page_to_mfn(pg + nr_pages - 1), 1));
> +if ( unlikely(!avail[nid]) )
> +{
> +bool use_tail = IS_ALIGNED(s, 1UL << MAX_ORDER) &&
> +(find_first_set_bit(e) <= find_first_set_bit(s));
> +unsigned long n;
> +
> +n = init_node_heap(nid, s, nr_pages, _tail);
> +BUG_ON(n > nr_pages);
> +if ( use_tail )
> +e -= n;
> +else
> +s += n;
> +}
> +
> +while ( s < e )
> +{
> +free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
> +s += 1UL;

... the more conventional s++ or ++s used here?

Jan

[PATCH] golang/xenlight: Update generated code

2022-07-18 Thread Oleksandr Tyshchenko

From: Oleksandr Tyshchenko 

Re-generate goland bindings to reflect changes to libxl_types.idl
from the following commit:
54d8f27d0477 tools/libxl: report trusted backend status to frontends

Signed-off-by: Oleksandr Tyshchenko 
---
 tools/golang/xenlight/helpers.gen.go | 12 
 tools/golang/xenlight/types.gen.go   |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/tools/golang/xenlight/helpers.gen.go 
b/tools/golang/xenlight/helpers.gen.go
index dece545ee0..33fe03971f 100644
--- a/tools/golang/xenlight/helpers.gen.go
+++ b/tools/golang/xenlight/helpers.gen.go
@@ -1774,6 +1774,9 @@ x.ColoPort = int(xc.colo_port)
 x.ColoExport = C.GoString(xc.colo_export)
 x.ActiveDisk = C.GoString(xc.active_disk)
 x.HiddenDisk = C.GoString(xc.hidden_disk)
+if err := x.Trusted.fromC();err != nil {
+return fmt.Errorf("converting field Trusted: %v", err)
+}
 
  return nil}
 
@@ -1815,6 +1818,9 @@ if x.ActiveDisk != "" {
 xc.active_disk = C.CString(x.ActiveDisk)}
 if x.HiddenDisk != "" {
 xc.hidden_disk = C.CString(x.HiddenDisk)}
+if err := x.Trusted.toC(); err != nil {
+return fmt.Errorf("converting field Trusted: %v", err)
+}
 
  return nil
  }
@@ -1899,6 +1905,9 @@ x.ColoFilterSecRedirector1Outdev = 
C.GoString(xc.colo_filter_sec_redirector1_out
 x.ColoFilterSecRewriter0Queue = C.GoString(xc.colo_filter_sec_rewriter0_queue)
 x.ColoCheckpointHost = C.GoString(xc.colo_checkpoint_host)
 x.ColoCheckpointPort = C.GoString(xc.colo_checkpoint_port)
+if err := x.Trusted.fromC();err != nil {
+return fmt.Errorf("converting field Trusted: %v", err)
+}
 
  return nil}
 
@@ -2028,6 +2037,9 @@ if x.ColoCheckpointHost != "" {
 xc.colo_checkpoint_host = C.CString(x.ColoCheckpointHost)}
 if x.ColoCheckpointPort != "" {
 xc.colo_checkpoint_port = C.CString(x.ColoCheckpointPort)}
+if err := x.Trusted.toC(); err != nil {
+return fmt.Errorf("converting field Trusted: %v", err)
+}
 
  return nil
  }
diff --git a/tools/golang/xenlight/types.gen.go 
b/tools/golang/xenlight/types.gen.go
index 253c9ad93d..bb149547fd 100644
--- a/tools/golang/xenlight/types.gen.go
+++ b/tools/golang/xenlight/types.gen.go
@@ -652,6 +652,7 @@ ColoPort int
 ColoExport string
 ActiveDisk string
 HiddenDisk string
+Trusted Defbool
 }
 
 type DeviceNic struct {
@@ -718,6 +719,7 @@ ColoFilterSecRedirector1Outdev string
 ColoFilterSecRewriter0Queue string
 ColoCheckpointHost string
 ColoCheckpointPort string
+Trusted Defbool
 }
 
 type DevicePci struct {
-- 
2.25.1

[xen-unstable test] 171666: tolerable FAIL

2022-07-18 Thread osstest service owner

flight 171666 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/171666/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-i386-examine-bios  6 xen-install  fail in 171663 pass in 171666
 test-amd64-i386-qemut-rhel6hvm-amd  7 xen-install  fail pass in 171663
 test-armhf-armhf-xl-rtds 14 guest-startfail pass in 171663

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-rtds15 migrate-support-check fail in 171663 never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-check fail in 171663 never pass
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 171663
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 171663
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 171663
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 171663
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 171663
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 171663
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 171663
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 171663
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 171663
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 171663
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 171663
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 171663
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-seattle  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 15 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 16 saverestore-support-checkfail never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass

Re: [PATCH v2 2/2] xen: Fix latent check-endbr.sh bug with 32bit build environments

2022-07-18 Thread Jan Beulich

On 15.07.2022 15:26, Andrew Cooper wrote:
> While Xen's current VMA means it works, the mawk fix (i.e. using $((0xN)) in
> the shell) isn't portable in 32bit shells.  See the code comment for the fix.
> 
> The fix found a second latent bug.  Recombining $vma_hi/lo should have used
> printf "%s%08x" and only worked previously because $vma_lo had bits set in
> it's top nibble.  Combining with the main fix, %08x becomes %07x.
> 
> Fixes: $XXX patch 1
> Reported-by: Jan Beulich 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 
with, I guess, ...

> --- a/xen/tools/check-endbr.sh
> +++ b/xen/tools/check-endbr.sh
> @@ -61,19 +61,36 @@ ${OBJDUMP} -j .text $1 -d -w | grep ' endbr64 *$' | 
> cut -f 1 -d ':' > $VALID &
>  #the lower bits, rounding integers to the nearest 4k.
>  #
>  #Instead, use the fact that Xen's .text is within a 1G aligned region, 
> and
> -#split the VMA in half so AWK's numeric addition is only working on 32 
> bit
> -#numbers, which don't lose precision.
> +#split the VMA so AWK's numeric addition is only working on <32 bit
> +#numbers, which don't lose precision.  (See point 5)
>  #
>  # 4) MAWK doesn't support plain hex constants (an optional part of the POSIX
>  #spec), and GAWK and MAWK can't agree on how to work with hex constants 
> in
>  #a string.  Use the shell to convert $vma_lo to decimal before passing to
>  #AWK.
>  #
> +# 5) Point 4 isn't fully portable.  POSIX only requires that $((0xN)) be
> +#evaluated as long, which in 32bit shells turns negative if bit 31 of the
> +#VMA is set.  AWK then interprets this negative number as a double before
> +#adding the offsets from the binary grep.
> +#
> +#Instead of doing an 8/8 split with vma_hi/lo, do a 9/7 split.
> +#
> +#The consequence of this is that for all offsets, $vma_lo + offset needs
> +#to be less that 256M (i.e. 7 nibbles) so as to be successfully 
> recombined
> +#with the 9 nibbles of $vma_hi.  This is fine; .text is at the start of a
> +#1G aligned region, and Xen is far far smaller than 256M, but leave 
> safety
> +#check nevertheless.
> +#
>  eval $(${OBJDUMP} -j .text $1 -h |
> -$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 8), 
> substr($4, 9, 16)}')
> +$AWK '$2 == ".text" {printf "vma_hi=%s\nvma_lo=%s\n", substr($4, 1, 9), 
> substr($4, 10, 16)}')
>  
>  ${OBJCOPY} -j .text $1 -O binary $TEXT_BIN
>  
> +bin_sz=$(stat -c '%s' $TEXT_BIN)
> +[ "$bin_sz" -ge $(((1 << 28) - $vma_lo)) ] &&
> +{ echo "$MSG_PFX Error: .text offsets can exceed 256M" >&2; exit 1; }

... s/can/cannot/ ?

Jan

xenstored socket backlog length

2022-07-18 Thread Roger Pau Monné

Hello,

It has been raised on the freebsd-xen mailing list [0] that the socket
queue length for the xenstored local domain socket is set to 1, which
can cause concurrent executions of xl commands to fail.

I see in xenstored implementation (xenstored_core.c init_sockets())
that the call to listen() is made setting a backlog length to 1, and
hence would like to ask if there's a reasoning for this, as I would
think having a slightly longer pending connections queue shouldn't be
an issue.

Was this value chosen based on a toolstack that has a central daemon
with a single connection to xenstored?

Thanks, Roger.

[0] https://lists.freebsd.org/archives/freebsd-xen/2022-July/000116.html

Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-18 Thread Andrew Cooper

On 15/07/2022 14:10, Boris Ostrovsky wrote:
>
> On 7/15/22 5:50 AM, Andrew Cooper wrote:
>> On 15/07/2022 09:18, Jane Malalane wrote:
>>> On 14/07/2022 00:27, Boris Ostrovsky wrote:
>    xen_hvm_smp_init();
>    WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm,
> xen_cpu_dead_hvm));
> diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c
> index 9d548b0c772f..be66e027ef28 100644
> --- a/arch/x86/xen/suspend_hvm.c
> +++ b/arch/x86/xen/suspend_hvm.c
> @@ -5,6 +5,7 @@
>    #include 
>    #include 
>    #include 
> +#include 
>    #include "xen-ops.h"
> @@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled)
>    xen_hvm_init_shared_info();
>    xen_vcpu_restore();
>    }
> -    xen_setup_callback_vector();
> +    if (xen_ack_upcall) {
> +    unsigned int cpu;
> +
> +    for_each_online_cpu(cpu) {
> +    xen_hvm_evtchn_upcall_vector_t op = {
> +    .vector = HYPERVISOR_CALLBACK_VECTOR,
> +    .vcpu = per_cpu(xen_vcpu_id, cpu),
> +    };
> +
> +    BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector,
> + ));
> +    /* Trick toolstack to think we are enlightened. */
> +    if (!cpu)
> +    BUG_ON(xen_set_callback_via(1));
 What are you trying to make the toolstack aware of? That we have *a*
 callback (either global or percpu)?
>>> Yes, specifically for the check in libxl__domain_pvcontrol_available.
>> And others.
>>
>> This is all a giant bodge, but basically a lot of tooling uses the
>> non-zero-ness of the CALLBACK_VIA param to determine whether the VM has
>> Xen-aware drivers loaded or not.
>>
>> The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only
>> reason this doesn't explode everywhere is because the
>> evtchn_upcall_vector registration takes priority over GSI delivery.
>>
>> This is decades of tech debt piled on top of tech debt.
>
>
> Feels like it (setting the callback parameter) is something that the
> hypervisor should do --- no need to expose guests to this.

Sensible or not, it is the ABI.

Linux still needs to work (nicely) with older Xen's in the world, and we
can't just retrofit a change in the hypervisor which says "btw, this ABI
we've just changed now has a side effect of modifying a field that you
also logically own".

~Andrew

Re: [PATCH V7 2/2] xen/gnttab: Store frame GFN in struct page_info on Arm

2022-07-18 Thread Jan Beulich

On 16.07.2022 16:56, Oleksandr Tyshchenko wrote:
> From: Oleksandr Tyshchenko 
> 
> Rework Arm implementation to store grant table frame GFN
> in struct page_info directly instead of keeping it in
> standalone status/shared arrays. This patch is based on
> the assumption that a grant table page is a xenheap page.
> 
> To cover 64-bit/40-bit IPA on Arm64/Arm32 we need the space
> to hold 52-bit/28-bit + extra bit value respectively. In order
> to not grow the size of struct page_info borrow the required
> amount of bits from type_info's count portion which current
> context won't suffer (currently only 1 bit is used on Arm).
> Please note, to minimize code changes and avoid introducing
> an extra #ifdef-s to the header, we keep the same amount of
> bits on both subarches, although the count portion on Arm64
> could be wider, so we waste some bits here.
> 
> Introduce corresponding PGT_* constructs and access macro
> page_get(set)_xenheap_gfn. Please note, all accesses to
> the GFN portion of type_info field should always be protected
> by the P2M lock. In case when it is not feasible to satisfy
> that requirement (risk of deadlock, lock inversion, etc)
> it is important to make sure that all non-protected updates
> to this field are atomic.
> As several non-protected read accesses still exist within
> current code (most calls to page_get_xenheap_gfn() are not
> protected by the P2M lock) the subsequent patch will introduce
> hardening code for p2m_remove_mapping() to be called with P2M
> lock held in order to check any difference between what is
> already mapped and what is requested to be ummapped.
> 
> Update existing gnttab macros to deal with GFN value according
> to new location. Also update the use of count portion of type_info
> field on Arm in share_xen_page_with_guest().
> 
> While at it, extend this simplified M2P-like approach for any
> xenheap pages which are proccessed in xenmem_add_to_physmap_one()
> except foreign ones. Update the code to set GFN portion after
> establishing new mapping for the xenheap page in said function
> and to clean GFN portion when putting a reference on that page
> in p2m_put_l3_page().
> 
> And for everything to work correctly introduce arch-specific
> initialization pattern PGT_TYPE_INFO_INITIALIZER to be applied
> to type_info field during initialization at alloc_heap_pages()
> and acquire_staticmem_pages(). The pattern's purpose on Arm
> is to clear the GFN portion before use, on x86 it is just
> a stub.
> 
> This patch is intended to fix the potential issue on Arm
> which might happen when remapping grant-table frame.
> A guest (or the toolstack) will unmap the grant-table frame
> using XENMEM_remove_physmap. This is a generic hypercall,
> so on x86, we are relying on the fact the M2P entry will
> be cleared on removal. For architecture without the M2P,
> the GFN would still be present in the grant frame/status
> array. So on the next call to map the page, we will end up to
> request the P2M to remove whatever mapping was the given GFN.
> This could well be another mapping.
> 
> Please note, this patch also changes the behavior how the shared_info
> page (which is xenheap RAM page) is mapped in xenmem_add_to_physmap_one().
> Now, we only allow to map the shared_info at once. The subsequent
> attempts to map it will result in -EBUSY. Doing that we mandate
> the caller to first unmap the page before mapping it again. This is
> to prevent Xen creating an unwanted hole in the P2M. For instance,
> this could happen if the firmware stole a RAM address for mapping
> the shared_info page into but forgot to unmap it afterwards.
> 
> Besides that, this patch simplifies arch code on Arm by
> removing arrays and corresponding management code and
> as the result gnttab_init_arch/gnttab_destroy_arch helpers
> and struct grant_table_arch become useless and can be
> dropped globally.
> 
> Suggested-by: Julien Grall 
> Signed-off-by: Oleksandr Tyshchenko 

Acked-by: Jan Beulich

Re: [PATCH 4/7] xen/arm: mm: Add more ASSERT() in {destroy, modify}_xen_mappings()

2022-07-18 Thread Jan Beulich

On 16.07.2022 16:38, Julien Grall wrote:
> On 04/07/2022 13:35, Bertrand Marquis wrote:
>>> On 24 Jun 2022, at 10:11, Julien Grall  wrote:
>>>
>>> From: Julien Grall 
>>>
>>> Both destroy_xen_mappings() and modify_xen_mappings() will take in
>>> parameter a range [start, end[. Both end should be page aligned.
>>>
>>> Add extra ASSERT() to ensure start and end are page aligned. Take the
>>> opportunity to rename 'v' to 's' to be consistent with the other helper.
>>>
>>> Signed-off-by: Julien Grall 
>>
>> With the prototype updated in mm.h as suggested by Michal:
> 
> Done. I will send a new version shortly.

I notice you had applied and then reverted this, with the revert saying
an x86 ack was missing. I don't see any need for an x86 ack here though,
so I'm puzzled ...

Jan

Re: [PATCH v2 3/3] xen/heap: pass order to free_heap_pages() in heap init

2022-07-18 Thread Wei Chen


Hi Julien,

On 2022/7/16 1:03, Julien Grall wrote:

From: Hongyan Xia 

The idea is to split the range into multiple aligned power-of-2 regions
which only needs to call free_heap_pages() once each. We check the least
significant set bit of the start address and use its bit index as the
order of this increment. This makes sure that each increment is both
power-of-2 and properly aligned, which can be safely passed to
free_heap_pages(). Of course, the order also needs to be sanity checked
against the upper bound and MAX_ORDER.

Tested on a nested environment on c5.metal with various amount
of RAM and CONFIG_DEBUG=n. Time for end_boot_allocator() to complete:
 Before After
 - 90GB: 1445 ms 96 ms
 -  8GB:  126 ms  8 ms
 -  4GB:   62 ms  4 ms

Signed-off-by: Hongyan Xia 
Signed-off-by: Julien Grall 

---

Changes in v2:
 - Update comment
 - Update the numbers. They are slightly better as is_contig_page()
   has been folded in init_heap_pages().
---
  xen/common/page_alloc.c | 35 ---
  1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index eedb2fed77c3..2b99801d2ea3 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1779,7 +1779,7 @@ int query_page_offline(mfn_t mfn, uint32_t *status)
  
  /*

   * This function should only be called with valid pages from the same NUMA
- * node.
+ * node and zone.
   */
  static void _init_heap_pages(const struct page_info *pg,
   unsigned long nr_pages,
@@ -1806,8 +1806,22 @@ static void _init_heap_pages(const struct page_info *pg,
  
  while ( s < e )

  {
-free_heap_pages(mfn_to_page(_mfn(s)), 0, need_scrub);
-s += 1UL;
+/*
+ * For s == 0, we simply use the largest increment by checking the
+ * MSB of the region size. For s != 0, we also need to ensure that the
+ * chunk is properly sized to end at power-of-two alignment. We do this
+ * by checking the LSB of the start address and use its index as
+ * the increment. Both cases need to be guarded by MAX_ORDER.
+ *
+ * Note that the value of ffsl() and flsl() starts from 1 so we need
+ * to decrement it by 1.
+ */
+int inc_order = min(MAX_ORDER, flsl(e - s) - 1);
+
+if ( s )
+inc_order = min(inc_order, ffsl(s) - 1);
+free_heap_pages(mfn_to_page(_mfn(s)), inc_order, need_scrub);
+s += (1UL << inc_order);
  }
  }
  
@@ -1844,6 +1858,9 @@ static void init_heap_pages(
  
  for ( i = 0; i < nr_pages; )

  {
+#ifdef CONFIG_SEPARATE_XENHEAP
+unsigned int zone = page_to_zone(pg);
+#endif
  unsigned int nid = phys_to_nid(page_to_maddr(pg));
  unsigned long left = nr_pages - i;
  unsigned long contig_pages;
@@ -1856,6 +1873,18 @@ static void init_heap_pages(
   */
  for ( contig_pages = 1; contig_pages < left; contig_pages++ )
  {
+/*
+ * No need to check for the zone when !CONFIG_SEPARATE_XENHEAP
+ * because free_heap_pages() can only take power-of-two ranges
+ * which never cross zone boundaries. But for separate xenheap
+ * which is manually defined, it is possible for power-of-two
+ * range to cross zones.
+ */
+#ifdef CONFIG_SEPARATE_XENHEAP
+if ( zone != page_to_zone(pg) )
+break;
+#endif
+
  if ( nid != (phys_to_nid(page_to_maddr(pg))) )
  break;
  }


Reviewed-by: Wei Chen

Re: Arm64's xen.efi vs GNU binutils (and alike)

2022-07-18 Thread Jan Beulich

On 18.07.2022 09:43, Wei Chen wrote:
> On 2022/7/11 22:32, Jan Beulich wrote:
>> the other day I wanted to look at the basic structure of xen.efi. First
>> I used my own dumping tool, which didn't work. Then I used objdump,
>> which appeared to work. I decided that I should look into what they do
>> different, and whether I could make mine work as well, or whether
>> instead objdump is broken and shouldn't work on this sort of binary.
>> While I'm not fully certain yet, I'm leaning to the latter. This is
>> supported by GNU objcopy corrupting the binary (I assume this is known
>> and considered okay-ish).
>>
> 
> Did you use x86's objcopy? AArch64 GNU objcopy does not support any
> PE format file. So I'm curious about the version of objcopy you are using.

I did use an aarch64 one, yes. Are you sure (partial) support wasn't
added? I've had no error messages, just a corrupt output binary. Btw,
this is 2.38's entry in bfd/config.bfd:

  aarch64-*-linux* | aarch64-*-netbsd*)
targ_defvec=aarch64_elf64_le_vec
targ_selvecs="aarch64_elf64_be_vec aarch64_elf32_le_vec 
aarch64_elf32_be_vec arm_elf32_le_vec arm_elf32_be_vec aarch64_pei_vec"
want64=true
;;

Clearly PEI (the name used in GNU binutils) is included by default.

>> Many problems boil down to the (ab)use of the DOS executable header
>> fields, yielding an invalid header. The first 8 bytes are instructions,
>> with the first carefully chosen to produce 'MZ' in its low half.
>> (Oddly enough Xen and Linux use different insns there.) This doesn't
>> play well with the meaning of the respective fields in the DOS header.
> 
> UEFI executables are regular PE32/PE32+ images, Arm64 EFI applications 
> use a subsystem "0xAA64". PE32/PE32+ require images to have a DOS header
> for option#1 backwards compatibility,or option#2 to prevent images to be 
> run in DOS. I think Arm64 EFI applications select option#2. In this case
> I don't understand why we need a valid DOS header? For my understanding,
> we just need 'MZ' for file type identify and "offset to the PE header".

This last step requires reading a field from the DOS header which hadn't
been there forever. Therefore one first needs to establish whether the
field is actually inside the header. Yet the fields used to determine
header size have been re-used (abused).

> Other fields have be re-used by other purpose when load Xen image as
> binary. And lots of bootloaders are following this header format to load 
> Xen (Linux, or other Arm64 OS/VMM) images. Therefore, it is not 
> currently possible to construct a valid DOS header.

Which would carry the implication that well-behaved PE processing tools
should refuse to work with these binaries.

>> Subsequently there are a number of .quad-s, some of which again yield
>> an invalid DOS header. I'm therefore inclined to submit a patch to
>> make objdump properly fail on this binary. But of course with both
> 
> I have not used objdump to dump xen image successfully. I always use
> xen-syms for objdump.Sorry, Maybe I didn't understand your question clearly.

xen-syms is an ELF binary. That's of course easily dumpable. The
question very specifically is xen.efi, which ought to be a valid
binary (and hence possible to process by tools understanding the
format), but isn't really. As a result the question is: Should GNU
binutils be able to deal with this half-broken format? Imo the
answer can only be yes (requiring all tools to properly handle it)
or no (suggesting all tools to properly refuse to work with it).

>> Xen and Linux (and who knows who else) using this hairy approach, it
>> may end up necessary to continue to "support" this special case,
>> which is why I'm seeking your input here first.
>>
> 
> Yes, like I said above, most OSs, VMMs and bootloaders currently follow 
> this format and boot protocol. Therefore, it is difficult for us to 
> completely remove it all at once.
> 
> 
> 
>> Furthermore the fake .reloc section overlaps the file header. The
>> section is zero size (i.e. empty), but a reasonable PE loader might
>> still object to its RVA being zero.
>>
> 
> I am not very clear about "overlaps". Is it because we are setting
> PointerToRelocations to zero?

What is PointerToRelocations? There's an NT header field (entry 5
of the Data Directory) which is supposed to hold the same address as
the .reloc section's RVA. And it is the .reloc section's RVA being
zero which makes that section live at the same address as the image
header (both at RVA 0). The section being zero size, it can
effectively be put anywhere, and hence I cannot see why it isn't put
at a valid address (outside of the header). As long as it comes
ahead of .text in the section table, it would e.g. be fine to live
at the same RVA as .text. (Note how on x86 we had to adjust the RVAs
of .debug_* to match the general expectation of RVAs of successive
sections to never go backwards. Without that linker script
adjustment GNU ld would have produced a broken and unusable binary.)

RE: [PATCH v2 9/9] xen: introduce a Kconfig option to configure NUMA nodes number

2022-07-18 Thread Wei Chen

Hi Jan,

> -Original Message-
> From: Jan Beulich 
> Sent: 2022年7月18日 16:10
> To: Wei Chen 
> Cc: nd ; Andrew Cooper ; George
> Dunlap ; Stefano Stabellini
> ; Wei Liu ; Roger Pau Monné
> ; xen-devel@lists.xenproject.org; Julien Grall
> 
> Subject: Re: [PATCH v2 9/9] xen: introduce a Kconfig option to configure
> NUMA nodes number
> 
> > Sent: 2022年7月12日 22:34
> >
> > On 08.07.2022 16:54, Wei Chen wrote:
> >> --- a/xen/arch/Kconfig
> >> +++ b/xen/arch/Kconfig
> >> @@ -17,3 +17,14 @@ config NR_CPUS
> >>  For CPU cores which support Simultaneous Multi-Threading or
> > similar
> >>  technologies, this the number of logical threads which Xen
> >> will
> >>  support.
> >> +
> >> +config NR_NUMA_NODES
> >> +  int "Maximum number of NUMA nodes supported"
> >> +  range 1 255
> >> +  default "64"
> >> +  depends on NUMA
> >
> > Does 1 make sense? That's not going to be NUMA then, I would say.
> >
> 
>  Ok, we need at least 2 nodes to be a real NUMA.
> 
> > Does the value being (perhaps far) larger than NR_CPUS make sense?
> >
> 
>  Arm has 128 default NR_CPUS (except some platforms) and x86 has 256.
>  So I am not very clear about your comments about far larger? As my
>  Understanding, one node has 2 or 4 cores are very common in a NUMA
>  System.
> >>>
> >>> The defaults are fine. But does it make sense to have 255 nodes when
> >>> just 32 CPUs were selected? I'm afraid kconfig is going to get in the
> >>> way, but I think I'd like the upper bound to be min(NR_CPUS, 255).
> >>
> >> Looking around NUMA nodes with 0 CPUs exists. So I don't think we
> should
> >> tie the two.
> >>
> >
> > Yes, some nodes can only have RAM and some nodes can only have CPUs.
> > So is it ok to use 2-255 for node range?
> 
> Personally I think we shouldn't allow unreasonably high node counts,
> unless proven by real hardware existing. Which goes hand in hand with
> me wanting the upper bound to be a power of 2 value, if for nothing
> else than a hint that using power-of-2 values here is preferable.
> Hence I'd like to propose 64 or 128 as upper bound, in this order of
> (my personal) preference.
> 

Thanks, I will use 64 in next version.

Wei Chen

> Jan

1 2 >

1 - 100 of 116 matches

Mail list logo