Re: [PATCH 10/11] block: xen-blkfront: Demote kernel-doc abuses

2021-04-06 Thread Roger Pau Monné
On Fri, Mar 12, 2021 at 10:55:29AM +, Lee Jones wrote:
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/block/xen-blkfront.c:1960: warning: Function parameter or member 
> 'dev' not described in 'blkfront_probe'
>  drivers/block/xen-blkfront.c:1960: warning: Function parameter or member 
> 'id' not described in 'blkfront_probe'
>  drivers/block/xen-blkfront.c:1960: warning: expecting prototype for Allocate 
> the basic(). Prototype was for blkfront_probe() instead
>  drivers/block/xen-blkfront.c:2085: warning: Function parameter or member 
> 'dev' not described in 'blkfront_resume'
>  drivers/block/xen-blkfront.c:2085: warning: expecting prototype for or a 
> backend(). Prototype was for blkfront_resume() instead
>  drivers/block/xen-blkfront.c:2444: warning: wrong kernel-doc identifier on 
> line:
> 
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Jens Axboe 
> Cc: xen-de...@lists.xenproject.org
> Cc: linux-bl...@vger.kernel.org
> Signed-off-by: Lee Jones 

Acked-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-25 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 06:57:12PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 04:13:59PM +0100, Roger Pau Monné wrote:
> > On Wed, Mar 24, 2021 at 04:22:44PM +0200, Andy Shevchenko wrote:
> > > On Wed, Mar 24, 2021 at 02:55:15PM +0100, Roger Pau Monné wrote:
> > > > On Wed, Mar 24, 2021 at 02:58:07PM +0200, Andy Shevchenko wrote:
> > > > > On Wed, Mar 24, 2021 at 01:31:18PM +0100, Roger Pau Monne wrote:
> > > > > Moreover, it seems you are bailing out and basically denying driver 
> > > > > to load.
> > > > > This does look that capability is simply the first register that 
> > > > > blows the setup.
> > > > > I think you have to fix something into Xen to avoid loading these 
> > > > > drivers or
> > > > > check with something like pci_device_is_present() approach.
> > > > 
> > > > Is there a backing PCI device BAR for those MMIO regions that the
> > > > pinctrl driver is trying to access? AFAICT those regions are only
> > > > reported in the ACPI DSDT table on the _CRS method of the object (at
> > > > least on my system).
> > > 
> > > Unfortunately it does not expose PCI configuration space.
> > 
> > Are those regions supposed to be marked as reserved in the memory map,
> > or that's left to the discretion of the hardware vendor?
> 
> I didn't get. The OS doesn't see them and an internal backbone simply drops 
> any
> IO access to that region.

I'm not sure I understand the above reply. My question was whether the
MMIO regions used by the pinctrl device (as fetched from the ACPI DSDT
table) are supposed belong to regions marked as RESERVED in the
firmware memory map (ie: either the e820 or the EFI one).

> > > > Doing something like pci_device_is_present would require a register
> > > > that we know will never return ~0 unless the device is not present. As
> > > > said above, maybe we could use REVID to that end?
> > > 
> > > Yes, that's good, see above.
> > > 
> > > WRT capabilities, if we crash we will see the report immediately on the
> > > hardware which has such an issue. (It's quite unlikely we will ever have 
> > > one,
> > > that's why I consider it's not critical)
> > 
> > I would rather prefer to not crash, because I think the kernel should
> > only resort to crashing when there's no alternative, and here it's
> > perfectly fine to just print an error message and don't load the
> > driver.
> 
> Are we speaking about real hardware that has an issue? I eagerly want to know
> what is that beast.

OK, I'm not going to resend this anymore. I'm happy with just getting
the first patch in.

I think you trust the hardware more that I would do, and I also think
the check added here is very minimal an unintrusive and serves as a
way to sanitize the data fetched from the hardware in order to prevent
a kernel page fault if such data turns out to be wrong.

Taking a reactive approach of requiring a broken piece of hardware to
exist in order to sanitize a fetched value seems too risky. I could
add a WARN_ON or similar if you want some kind of splat that's very
noticeable when this goes wrong but that doesn't end up in a fatal
kernel page fault.

Thanks, Roger.


Re: [PATCH v2 1/2] intel/pinctrl: check REVID register value for device presence

2021-03-25 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 07:01:18PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 04:43:11PM +0100, Roger Pau Monne wrote:
> 
> Thanks for a fix! My comments below.
> 
> > Use the value read from the REVID register in order to check for the
> > presence of the device. A read of all ones is treated as if the device
> > is not present, and hence probing is ended.
> > 
> > This fixes an issue when running as a Xen PVH dom0, where the ACPI
> > DSDT table is provided unmodified to dom0 and hence contains the
> > pinctrl devices, but the MMIO region(s) containing the device
> > registers might not be mapped in the guest physical memory map if such
> > region(s) are not exposed on a PCI device BAR or marked as reserved in
> > the host memory map.
> 
> Any particular point that we can use in the Fixes tag?

Hm, I haven't seen those issues up until 91d898e51e60 ('pinctrl:
intel: Convert capability list to features'), but the device wasn't
working properly for sure, as the registers where not accessible, it
just didn't lead to a kernel crash.

> ...
> 
> > Suggested-by: Andy Shevchenko 
> 
> Hmm... was it that address I have used? In any case I think my 
> @linux.intel.com
> is better.

I just used the same as the one that's on the MAINTAINERS file,
because I already had that n my Cc list. I can change to the @intel
one if that's your preference.

> ...
> 
> > /* Determine community features based on the revision */
> > value = readl(regs + REVID);
> > +   if (value == ~0u)
> > +   return -ENODATA;
> 
> I think -ENODEV is more appropriate here.
> Also comment above should be adjusted to explain this check.

Right, will change and send v3.

Thanks.


Re: [PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 04:22:44PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 02:55:15PM +0100, Roger Pau Monné wrote:
> > On Wed, Mar 24, 2021 at 02:58:07PM +0200, Andy Shevchenko wrote:
> > > On Wed, Mar 24, 2021 at 01:31:18PM +0100, Roger Pau Monne wrote:
> > > Moreover, it seems you are bailing out and basically denying driver to 
> > > load.
> > > This does look that capability is simply the first register that blows 
> > > the setup.
> > > I think you have to fix something into Xen to avoid loading these drivers 
> > > or
> > > check with something like pci_device_is_present() approach.
> > 
> > Is there a backing PCI device BAR for those MMIO regions that the
> > pinctrl driver is trying to access? AFAICT those regions are only
> > reported in the ACPI DSDT table on the _CRS method of the object (at
> > least on my system).
> 
> Unfortunately it does not expose PCI configuration space.

Are those regions supposed to be marked as reserved in the memory map,
or that's left to the discretion of the hardware vendor?

> > Doing something like pci_device_is_present would require a register
> > that we know will never return ~0 unless the device is not present. As
> > said above, maybe we could use REVID to that end?
> 
> Yes, that's good, see above.
> 
> WRT capabilities, if we crash we will see the report immediately on the
> hardware which has such an issue. (It's quite unlikely we will ever have one,
> that's why I consider it's not critical)

I would rather prefer to not crash, because I think the kernel should
only resort to crashing when there's no alternative, and here it's
perfectly fine to just print an error message and don't load the
driver. IMO I would rather boot without pinctrl than get a panic if
it turns out pinctrl capabilities list is somehow corrupted. It's a
long shot, but the check added in order to prevent this scenario is
minimal.

In any case I will send a new version with the REVID check and this
current patch.

Thanks, Roger.


Re: [PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 02:58:07PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 01:31:18PM +0100, Roger Pau Monne wrote:
> > When parsing the capability list make sure the offset is between the
> > MMIO region mapped in 'regs', or else the kernel hits a page fault.
> > 
> > This fault has been seen when running as a Xen PVH dom0, which doesn't
> > have the MMIO regions mapped into the domain physical memory map,
> > despite having the device reported in the ACPI DSDT table. This
> > results in reporting a capability offset of 0x (because the kernel
> > is accessing unpopulated memory), and such offset is outside of the
> > mapped region.
> > 
> > Adding the check is harmless, and prevents buggy or broken systems
> > from crashing the kernel if the MMIO region is not properly reported.
> 
> Thanks for the report.
> 
> Looking into the code I would like rather see the explicit comparison to 
> 0x
> or ~0 against entire register b/c it's (one of) standard way of devices to 
> tell
> that something is not supported.

That can be done also. I think what I've proposed is slightly more
robust, as it will prevent a kernel page fault if somehow the offset
to the next capability is below ~0, but past the end of the MMIO
region. Unlikely I know, but it's not worth a kernel panic.

What could be done is check whether reading REVID returns ~0 and exit
at that point, if ~0 will never be a valid value returned by that
register. I think that should be a separate patch however.

> Moreover, it seems you are bailing out and basically denying driver to load.
> This does look that capability is simply the first register that blows the 
> setup.
> I think you have to fix something into Xen to avoid loading these drivers or
> check with something like pci_device_is_present() approach.

Is there a backing PCI device BAR for those MMIO regions that the
pinctrl driver is trying to access? AFAICT those regions are only
reported in the ACPI DSDT table on the _CRS method of the object (at
least on my system).

Doing something like pci_device_is_present would require a register
that we know will never return ~0 unless the device is not present. As
said above, maybe we could use REVID to that end?

> > Fixes: 91d898e51e60 ('pinctrl: intel: Convert capability list to features')
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Mika Westerberg 
> > Cc: Andy Shevchenko 
> > Cc: Linus Walleij 
> > Cc: linux-g...@vger.kernel.org
> > ---
> > Resend because I've missed adding the maintainers, sorry for the spam.
> 
> I have a script to make it easier: 
> https://github.com/andy-shev/home-bin-tools/blob/master/ge2maintainer.sh

Thanks!


Re: [PATCH v3 5/8] xen/events: link interdomain events to associated xenbus device

2021-02-22 Thread Roger Pau Monné
On Fri, Feb 19, 2021 at 04:40:27PM +0100, Juergen Gross wrote:
> In order to support the possibility of per-device event channel
> settings (e.g. lateeoi spurious event thresholds) add a xenbus device
> pointer to struct irq_info() and modify the related event channel
> binding interfaces to take the pointer to the xenbus device as a
> parameter instead of the domain id of the other side.
> 
> While at it remove the stale prototype of bind_evtchn_to_irq_lateeoi().
> 
> Signed-off-by: Juergen Gross 
> Reviewed-by: Boris Ostrovsky 
> Reviewed-by: Wei Liu 
> Reviewed-by: Paul Durrant 
> ---
>  drivers/block/xen-blkback/xenbus.c  |  2 +-

Reviewed-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH v2] xen-blkback: fix compatibility bug with single page rings

2021-02-02 Thread Roger Pau Monné
On Thu, Jan 28, 2021 at 01:04:41PM +, Paul Durrant wrote:
> From: Paul Durrant 
> 
> Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
> inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
> behaviour of xen-blkback when connecting to a frontend was:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring specified by 'ring-ref'
> - else expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> 
> This was correct behaviour, but was broken by the afforementioned commit to
> become:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring (i.e. ring-page-order = 0)
> - expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> - if that didn't work then see if there's a single page ring specified by
>   'ring-ref'
> 
> This incorrect behaviour works most of the time but fails when a frontend
> that sets 'ring-page-order' is unloaded and replaced by one that does not
> because, instead of reading 'ring-ref', xen-blkback will read the stale
> 'ring-ref0' left around by the previous frontend will try to map the wrong
> grant reference.
> 
> This patch restores the original behaviour.
> 
> Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid 
> inconsistent xenstore 'ring-page-order' set by malicious blkfront")
> Signed-off-by: Paul Durrant 
> ---
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Jens Axboe 
> Cc: Dongli Zhang 
> 
> v2:
>  - Remove now-spurious error path special-case when nr_grefs == 1
> ---
>  drivers/block/xen-blkback/common.h |  1 +
>  drivers/block/xen-blkback/xenbus.c | 38 +-
>  2 files changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/common.h 
> b/drivers/block/xen-blkback/common.h
> index b0c71d3a81a0..524a79f10de6 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -313,6 +313,7 @@ struct xen_blkif {
>  
>   struct work_struct  free_work;
>   unsigned int    nr_ring_pages;
> + boolmulti_ref;

You seem to have used spaces between the type and the variable name
here, while neighbors also use hard tabs.

The rest LGTM:

Reviewed-by: Roger Pau Monné 

We should have forbidden the usage of ring-page-order = 0 and we could
have avoided having to add the multi_ref variable, but that's too late
now.

Thanks, Roger.


Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-20 Thread Roger Pau Monné
On Wed, Jan 20, 2021 at 03:37:57PM +0100, Jan Beulich wrote:
> On 20.01.2021 15:35, Roger Pau Monné wrote:
> > On Wed, Jan 20, 2021 at 03:23:30PM +0100, Arthur Borsboom wrote:
> >> Hi Roger,
> >>
> >> I have set up a test environment based on Linux 5.11.0-rc4.
> >> The patch did not apply clean, so I copied/pasted the patch manually.
> >>
> >> Without the patch the call trace (as reported) is visible in dmesg.
> >> With the patch the call trace in dmesg is gone, but ... (there is always a
> >> but) ...
> >>
> >> Now the discard action returns the following.
> >>
> >> [arthur@test-arch ~]$ sudo fstrim -v /
> >> fstrim: /: the discard operation is not supported
> >>
> >> It might be correct, but of course I was hoping the Xen VM guest would pass
> >> on the discard request to the block device in the Xen VM host, which is a
> >> disk partition.
> >> Any suggestions?
> > 
> > Hm, that's not what I did see on my testing, the operation worked OK,
> > and that's what I would expect to happen in your case also, since I
> > know the xenstore keys.
> 
> Does discard work on partitions (and not just on entire disks)?
> It's been a while since I last had anything to do with this code,
> so I may well misremember.

The command provided did work for me, ie:

# fstrim -v /
/: 19.8 GiB (21190717440 bytes) trimmed

AFAICT the blkif discard interface we provide operates at physical
block granularity, so it's possible to discard a single block, and
hence should work against partitions.

Roger.


Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-20 Thread Roger Pau Monné
On Wed, Jan 20, 2021 at 03:23:30PM +0100, Arthur Borsboom wrote:
> Hi Roger,
> 
> I have set up a test environment based on Linux 5.11.0-rc4.
> The patch did not apply clean, so I copied/pasted the patch manually.
> 
> Without the patch the call trace (as reported) is visible in dmesg.
> With the patch the call trace in dmesg is gone, but ... (there is always a
> but) ...
> 
> Now the discard action returns the following.
> 
> [arthur@test-arch ~]$ sudo fstrim -v /
> fstrim: /: the discard operation is not supported
> 
> It might be correct, but of course I was hoping the Xen VM guest would pass
> on the discard request to the block device in the Xen VM host, which is a
> disk partition.
> Any suggestions?

Hm, that's not what I did see on my testing, the operation worked OK,
and that's what I would expect to happen in your case also, since I
know the xenstore keys.

I think it's possible your email client has mangled the patch, I'm
attaching the same patch to this email, could you try to apply it
again and report back? (this time it should apply cleanly)

Thanks, Roger.
>From f09acedb2e40fa84f31cfe4c960dea2037d06e3f Mon Sep 17 00:00:00 2001
From: Roger Pau Monne 
Date: Mon, 18 Jan 2021 11:47:05 +0100
Subject: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This is inline with the specification described in blkif.h:

 * discard-granularity: should be set to the physical block size if
   node is not present.
 * discard-alignment, discard-secure: should be set to 0 if node not
   present.

This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.

Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.

Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: Arthur Borsboom 
---
Changes since v2:
 - Allow all discard-* nodes to be optional.
---
 drivers/block/xen-blkfront.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5265975b3fba..e1c6798889f4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -945,7 +945,8 @@ static void blkif_set_queue_limits(struct blkfront_info 
*info)
if (info->feature_discard) {
blk_queue_flag_set(QUEUE_FLAG_DISCARD, rq);
blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_granularity = info->discard_granularity ?:
+info->physical_sector_size;
rq->limits.discard_alignment = info->discard_alignment;
if (info->feature_secdiscard)
blk_queue_flag_set(QUEUE_FLAG_SECERASE, rq);
@@ -2179,19 +2180,12 @@ static void blkfront_closing(struct blkfront_info *info)
 
 static void blkfront_setup_discard(struct blkfront_info *info)
 {
-   int err;
-   unsigned int discard_granularity;
-   unsigned int discard_alignment;
-
info->feature_discard = 1;
-   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-   "discard-granularity", "%u", _granularity,
-   "discard-alignment", "%u", _alignment,
-   NULL);
-   if (!err) {
-   info->discard_granularity = discard_granularity;
-   info->discard_alignment = discard_alignment;
-   }
+   info->discard_granularity = xenbus_read_unsigned(info->xbdev->otherend,
+"discard-granularity",
+0);
+   info->discard_alignment = xenbus_read_unsigned(info->xbdev->otherend,
+  "discard-alignment", 0);
info->feature_secdiscard =
!!xenbus_read_unsigned(info->xbdev->otherend, "discard-secure",
   0);
-- 
2.29.2



Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-19 Thread Roger Pau Monné
Forgot to Cc stable for the Fixes tag. Doing it now.

On Tue, Jan 19, 2021 at 11:57:27AM +0100, Roger Pau Monne wrote:
> This is inline with the specification described in blkif.h:
> 
>  * discard-granularity: should be set to the physical block size if
>node is not present.
>  * discard-alignment, discard-secure: should be set to 0 if node not
>present.
> 
> This was detected as QEMU would only create the discard-granularity
> node but not discard-alignment, and thus the setup done in
> blkfront_setup_discard would fail.
> 
> Fix blkfront_setup_discard to not fail on missing nodes, and also fix
> blkif_set_queue_limits to set the discard granularity to the physical
> block size if none is specified in xenbus.
> 
> Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
> Reported-by: Arthur Borsboom 
> Signed-off-by: Roger Pau Monné 
> ---
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Jens Axboe 
> Cc: xen-de...@lists.xenproject.org
> Cc: linux-bl...@vger.kernel.org
> Cc: Arthur Borsboom 
> ---
> Changes since v2:
>  - Allow all discard-* nodes to be optional.
> ---
>  drivers/block/xen-blkfront.c | 20 +++-
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5265975b3fba..e1c6798889f4 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -945,7 +945,8 @@ static void blkif_set_queue_limits(struct blkfront_info 
> *info)
>   if (info->feature_discard) {
>   blk_queue_flag_set(QUEUE_FLAG_DISCARD, rq);
>   blk_queue_max_discard_sectors(rq, get_capacity(gd));
> - rq->limits.discard_granularity = info->discard_granularity;
> + rq->limits.discard_granularity = info->discard_granularity ?:
> +  info->physical_sector_size;
>   rq->limits.discard_alignment = info->discard_alignment;
>   if (info->feature_secdiscard)
>   blk_queue_flag_set(QUEUE_FLAG_SECERASE, rq);
> @@ -2179,19 +2180,12 @@ static void blkfront_closing(struct blkfront_info 
> *info)
>  
>  static void blkfront_setup_discard(struct blkfront_info *info)
>  {
> - int err;
> - unsigned int discard_granularity;
> - unsigned int discard_alignment;
> -
>   info->feature_discard = 1;
> - err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> - "discard-granularity", "%u", _granularity,
> - "discard-alignment", "%u", _alignment,
> - NULL);
> - if (!err) {
> - info->discard_granularity = discard_granularity;
> - info->discard_alignment = discard_alignment;
> - }
> + info->discard_granularity = xenbus_read_unsigned(info->xbdev->otherend,
> +  "discard-granularity",
> +  0);
> + info->discard_alignment = xenbus_read_unsigned(info->xbdev->otherend,
> +"discard-alignment", 0);
>   info->feature_secdiscard =
>   !!xenbus_read_unsigned(info->xbdev->otherend, "discard-secure",
>  0);
> -- 
> 2.29.2
> 


Re: [PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-19 Thread Roger Pau Monné
On Tue, Jan 19, 2021 at 11:11:26AM +0100, Jürgen Groß wrote:
> On 19.01.21 11:06, Roger Pau Monné wrote:
> > On Tue, Jan 19, 2021 at 08:43:01AM +0100, Jürgen Groß wrote:
> > > On 18.01.21 16:15, Roger Pau Monne wrote:
> > > > Don't require the discard-alignment xenstore node to be present in
> > > > order to correctly setup the feature. This can happen with versions of
> > > > QEMU that only write the discard-granularity but not the
> > > > discard-alignment node.
> > > > 
> > > > Assume discard-alignment is 0 if not present. While there also fix the
> > > > logic to not enable the discard feature if discard-granularity is not
> > > > present.
> > > > 
> > > > Reported-by: Arthur Borsboom 
> > > > Signed-off-by: Roger Pau Monné 
> > > > ---
> > > > Cc: Boris Ostrovsky 
> > > > Cc: Juergen Gross 
> > > > Cc: Stefano Stabellini 
> > > > Cc: Konrad Rzeszutek Wilk 
> > > > Cc: "Roger Pau Monné" 
> > > > Cc: Jens Axboe 
> > > > Cc: xen-de...@lists.xenproject.org
> > > > Cc: linux-bl...@vger.kernel.org
> > > > Cc: Arthur Borsboom 
> > > > ---
> > > >drivers/block/xen-blkfront.c | 25 +
> > > >1 file changed, 13 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > > > index 5265975b3fba..5a93f7cc2939 100644
> > > > --- a/drivers/block/xen-blkfront.c
> > > > +++ b/drivers/block/xen-blkfront.c
> > > > @@ -2179,22 +2179,23 @@ static void blkfront_closing(struct 
> > > > blkfront_info *info)
> > > >static void blkfront_setup_discard(struct blkfront_info *info)
> > > >{
> > > > -   int err;
> > > > -   unsigned int discard_granularity;
> > > > -   unsigned int discard_alignment;
> > > > +   unsigned int discard_granularity = 0;
> > > > +   unsigned int discard_alignment = 0;
> > > > +   unsigned int discard_secure = 0;
> > > > -   info->feature_discard = 1;
> > > > -   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > > > +   xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > > > "discard-granularity", "%u", _granularity,
> > > > "discard-alignment", "%u", _alignment,
> > > > +   "discard-secure", "%u", _secure,
> > > > NULL);
> > > 
> > > This would mean that "discard-secure" will be evaluated only if the
> > > other two items are set in Xenstore. From blkif.h I can't see this is
> > > required, and your patch is modifying today's behavior in this regard.
> > > 
> > > You might want to have three xenbus_read_unsigned() calls instead.
> > 
> > You are right, discard-secure should be fetched regardless of whether
> > discard-alignment exists.
> > 
> > I can fetch discard-granularity and discard-alignment using
> > xenbus_gather and keep discard-secure using xenbus_read_unsigned. Let
> > me send a new version.
> 
> I'm still not convinced this is correct. blkif.h doesn't mention that
> discard-alignment will be valid only with discard-granularity being
> present.

No, in fact I think I need to rework this a little further. Just
having feature-discard = 1 should enable the discard functionality, by
setting discard-granularity = physical block size and
discard-alignment = 0.

Roger.


Re: [PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-19 Thread Roger Pau Monné
On Tue, Jan 19, 2021 at 08:43:01AM +0100, Jürgen Groß wrote:
> On 18.01.21 16:15, Roger Pau Monne wrote:
> > Don't require the discard-alignment xenstore node to be present in
> > order to correctly setup the feature. This can happen with versions of
> > QEMU that only write the discard-granularity but not the
> > discard-alignment node.
> > 
> > Assume discard-alignment is 0 if not present. While there also fix the
> > logic to not enable the discard feature if discard-granularity is not
> > present.
> > 
> > Reported-by: Arthur Borsboom 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Boris Ostrovsky 
> > Cc: Juergen Gross 
> > Cc: Stefano Stabellini 
> > Cc: Konrad Rzeszutek Wilk 
> > Cc: "Roger Pau Monné" 
> > Cc: Jens Axboe 
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux-bl...@vger.kernel.org
> > Cc: Arthur Borsboom 
> > ---
> >   drivers/block/xen-blkfront.c | 25 +
> >   1 file changed, 13 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index 5265975b3fba..5a93f7cc2939 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -2179,22 +2179,23 @@ static void blkfront_closing(struct blkfront_info 
> > *info)
> >   static void blkfront_setup_discard(struct blkfront_info *info)
> >   {
> > -   int err;
> > -   unsigned int discard_granularity;
> > -   unsigned int discard_alignment;
> > +   unsigned int discard_granularity = 0;
> > +   unsigned int discard_alignment = 0;
> > +   unsigned int discard_secure = 0;
> > -   info->feature_discard = 1;
> > -   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > +   xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > "discard-granularity", "%u", _granularity,
> > "discard-alignment", "%u", _alignment,
> > +   "discard-secure", "%u", _secure,
> > NULL);
> 
> This would mean that "discard-secure" will be evaluated only if the
> other two items are set in Xenstore. From blkif.h I can't see this is
> required, and your patch is modifying today's behavior in this regard.
> 
> You might want to have three xenbus_read_unsigned() calls instead.

You are right, discard-secure should be fetched regardless of whether
discard-alignment exists.

I can fetch discard-granularity and discard-alignment using
xenbus_gather and keep discard-secure using xenbus_read_unsigned. Let
me send a new version.

Thanks, Roger.


Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Roger Pau Monné
On Tue, Jan 12, 2021 at 06:57:30AM +0100, Jürgen Groß wrote:
> On 11.01.21 16:29, Roger Pau Monne wrote:
> > Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
> > addr = 0 in order to fetch the size of a specific resource.
> > 
> > Add a shortcut to the default map resource path, since fetching the
> > size requires no address to be passed in, and thus no VMA to setup.
> > 
> > Fixes: 3ad0876554caf ('xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE')
> 
> I don't think this addition is a reason to add a "Fixes:" tag. This is
> clearly new functionality.

It could be argued that not allowing to query the resource size was a
shortcoming of the original implementation, but a backport request to
stable would be more appropriate than a fixes tag I think. Will drop
on next version and add a backport request if you agree.

Thanks, Roger.


Re: [PATCH] xen: fix: use WARN_ON instead of if condition followed by BUG.

2020-12-30 Thread Roger Pau Monné
On Wed, Dec 30, 2020 at 02:38:06PM +0800, YANG LI wrote:
> Use WARN_ON instead of if condition followed by BUG in
> gnttab_batch_map() and gnttab_batch_copy().

But those are not equivalent as far as I'm aware. BUG will stop
execution, while WARN_ON will print a splat and continue executing.

If switching to WARN_ON is indeed fine it needs to be explained in the
commit message that returning to the caller(s) with
HYPERVISOR_grant_table_op having returned an error code is fine, and
that it's not going to create other issues, like memory corruption or
leaks.

Thanks, Roger.


Re: [PATCH 2/2] xen: don't use page->lru for ZONE_DEVICE memory

2020-12-10 Thread Roger Pau Monné
On Tue, Dec 08, 2020 at 07:45:00AM +0100, Jürgen Groß wrote:
> On 07.12.20 21:48, Jason Andryuk wrote:
> > On Mon, Dec 7, 2020 at 8:30 AM Juergen Gross  wrote:
> > > 
> > > Commit 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated
> > > memory") introduced usage of ZONE_DEVICE memory for foreign memory
> > > mappings.
> > > 
> > > Unfortunately this collides with using page->lru for Xen backend
> > > private page caches.
> > > 
> > > Fix that by using page->zone_device_data instead.
> > > 
> > > Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
> > > Signed-off-by: Juergen Gross 
> > 
> > Would it make sense to add BUG_ON(is_zone_device_page(page)) and the
> > opposite as appropriate to cache_enq?
> 
> No, I don't think so. At least in the CONFIG_ZONE_DEVICE case the
> initial list in a PV dom0 is populated from extra memory (basically
> the same, but not marked as zone device memory explicitly).

I assume it's fine for us to then use page->zone_device_data even if
the page is not explicitly marked as ZONE_DEVICE memory?

If that's fine, add my:

Acked-by: Roger Pau Monné 

To both patches, and thank you very much for fixing this!

Roger.


Re: [PATCH v3 3/6] mm: support THP migration to device private memory

2020-12-05 Thread Roger Pau Monné
On Wed, Dec 02, 2020 at 11:08:54AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 20, 2020 at 04:01:33PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> > 
> > > MEMORY_DEVICE_GENERIC:
> > > Struct pages are created in dev_dax_probe() and represent non-volatile 
> > > memory.
> > > The device can be mmap()'ed which calls dax_mmap() which sets
> > > vma->vm_flags | VM_HUGEPAGE.
> > > A CPU page fault will result in a PTE, PMD, or PUD sized page
> > > (but not compound) to be inserted by vmf_insert_mixed() which will call 
> > > either
> > > insert_pfn() or insert_page().
> > > Neither insert_pfn() nor insert_page() increments the page reference
> > > count.
> > 
> > But why was this done? It seems very strange to put a pfn with a
> > struct page into a VMA and then deliberately not take the refcount for
> > the duration of that pfn being in the VMA?
> > 
> > What prevents memunmap_pages() from progressing while VMAs still point
> > at the memory?
> 
> Agreed.  Adding Roger who added MEMORY_DEVICE_GENERIC and the only
> user.

MEMORY_DEVICE_GENERIC is just a rename of the previous
MEMORY_DEVICE_DEVDAX, and seems to be used by the DAX device apart
from Xen?

It's main purpose is to be able to allocate unused physical memory
ranges and have a baking struct page for them, so they can be used to
map foreign memory when running on Xen.

I'm currently on leave and won't be back until the end of the month,
could you please Cc the Xen maintainers if you modify the logic here
in order to make sure it will work for Xen?

Thanks, Roger.


Re: [PATCH 058/141] xen-blkfront: Fix fall-through warnings for Clang

2020-11-23 Thread Roger Pau Monné
On Fri, Nov 20, 2020 at 12:32:58PM -0600, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of letting the code fall
> through to the next case.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/block/xen-blkfront.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 48629d3433b4..34b028be78ab 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -2462,6 +2462,7 @@ static void blkback_changed(struct xenbus_device *dev,
>   break;
>   if (talk_to_blkback(dev, info))
>   break;
> + break;

I would have added a fallthrough like it's done below in
XenbusStateClosed.

Also, FWIW, I think clang's fallthrough warnings are a bit too verbose.
Falling through to a break like the case here shouldn't cause a
warning IMO, falling through to anything != break should indeed cause
those warnings to appear.

Thanks, Roger.


Re: [PATCH 1/2] xen: Remove Xen PVH/PVHVM dependency on PCI

2020-10-15 Thread Roger Pau Monné
On Thu, Oct 15, 2020 at 05:02:21PM +0200, Jan Beulich wrote:
> On 15.10.2020 16:59, Jason Andryuk wrote:
> > On Thu, Oct 15, 2020 at 4:10 AM Jan Beulich  wrote:
> >>
> >> On 14.10.2020 19:53, Jason Andryuk wrote:
> >>> @@ -76,7 +80,9 @@ config XEN_DEBUG_FS
> >>> Enabling this option may incur a significant performance overhead.
> >>>
> >>>  config XEN_PVH
> >>> - bool "Support for running as a Xen PVH guest"
> >>> + bool "Xen PVH guest support"
> >>
> >> Tangential question: Is "guest" here still appropriate, i.e.
> >> isn't this option also controlling whether the kernel can be
> >> used in a PVH Dom0?
> > 
> > Would you like something more generic like "Xen PVH support" and
> > "Support for running in Xen PVH mode"?
> 
> Yeah, just dropping "guest" would be fine with me. No idea how
> to reflect that PVH Dom0 isn't supported, yet.

The fact that it isn't supported by Xen shouldn't be reflected on the
Linux configuration, as it's independent. Ie: you could run this Linux
kernel on a future version of Xen where PVH dom0 is supported.

There's already a warning printed by Xen when booting PVH dom0 about
not being a supported mode.

Thanks, Roger.


Re: [PATCH 2/2] xen/blkback: turn the cache purge percent into a parameter

2020-10-15 Thread Roger Pau Monné
On Thu, Oct 15, 2020 at 04:37:52PM +0200, Jürgen Groß wrote:
> On 15.10.20 16:24, Roger Pau Monne wrote:
> > Assume that reads and writes to the variable will be atomic. The worse
> > that could happen is that one of the purges removes a partially
> > written percentage of grants, but the cache itself will recover.
> > 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Konrad Rzeszutek Wilk 
> > Cc: Jens Axboe 
> > Cc: Boris Ostrovsky 
> > Cc: SeongJae Park 
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux-bl...@vger.kernel.org
> > Cc: J. Roeleveld 
> > Cc: Jürgen Groß 
> > ---
> >   Documentation/ABI/testing/sysfs-driver-xen-blkback | 9 +
> >   drivers/block/xen-blkback/blkback.c| 7 +--
> >   2 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> > b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > index 776f25d335ca..7de791ad61f9 100644
> > --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > @@ -45,3 +45,12 @@ Description:
> >   to be executed periodically. This parameter controls the 
> > time
> >   interval between consecutive executions of the purge 
> > mechanism
> >       is set in ms.
> > +
> > +What:   /sys/module/xen_blkback/parameters/lru_percent_clean
> > +Date:   October 2020
> > +KernelVersion:  5.10
> > +Contact:Roger Pau Monné 
> > +Description:
> > +When the persistent grants list is full we will remove 
> > unused
> > +grants from the list. The percent number of grants to be
> > +removed at each LRU execution.
> > diff --git a/drivers/block/xen-blkback/blkback.c 
> > b/drivers/block/xen-blkback/blkback.c
> > index 6ad9b76fdb2b..772852d45a5a 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -127,7 +127,10 @@ MODULE_PARM_DESC(lru_internval,
> >* from the list. The percent number of grants to be removed at each LRU
> >* execution.
> >*/
> > -#define LRU_PERCENT_CLEAN 5
> > +static unsigned int lru_percent_clean = 5;
> > +module_param_named(lru_percent_clean, lru_percent_clean, uint, 0644);
> > +MODULE_PARM_DESC(lru_percent_clean,
> > +"Percentage of persistent grants to remove from the cache when 
> > full");
> >   /* Run-time switchable: /sys/module/blkback/parameters/ */
> >   static unsigned int log_stats;
> > @@ -404,7 +407,7 @@ static void purge_persistent_gnt(struct xen_blkif_ring 
> > *ring)
> > !ring->blkif->vbd.overflow_max_grants)) {
> > num_clean = 0;
> > } else {
> > -   num_clean = (max_pgrants / 100) * LRU_PERCENT_CLEAN;
> > +   num_clean = (max_pgrants / 100) * lru_percent_clean;
> 
> Hmm, wouldn't it be better to use (max_grants * lru_percent_clean) / 100
> here in order to support max_grants values less than 100?

Yes, we should have done that when moving max_pgrants to a parameter,
it used to be fixed first.

Will do in next version since I'm already changing the line.

Thanks, Roger.


Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-24 Thread Roger Pau Monné
On Thu, Sep 24, 2020 at 12:27:14PM +0200, SeongJae Park wrote:
> On Thu, 24 Sep 2020 12:13:44 +0200 "Roger Pau Monné"  
> wrote:
> 
> > On Wed, Sep 23, 2020 at 04:09:30PM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> > > > From: SeongJae Park 
> > > > 
> > > > Persistent grants feature provides high scalability.  On some small
> > > > systems, however, it could incur data copy overhead[1] and thus it is
> > > > required to be disabled.  But, there is no option to disable it.  For
> > > > the reason, this commit adds a module parameter for disabling of the
> > > > feature.
> > > 
> > > Would it be better suited to have it per guest?
> > 
> > I think having a per-backend policy that could be specified at the
> > toolstack level would be nice, but I see that as a further
> > improvement.
> 
> Agreed.
> 
> > 
> > Having a global backend domain policy of whether persistent grants are
> > enabled or not seems desirable, and if someone wants even more fine
> > grained control this change is AFAICT not incompatible with a
> > per-backend option anyway.
> 
> I think we could extend this design by receiving list of exceptional domains.
> For example, if 'feature_persistent' is True and exceptions list has '123,
> 456', domains of domid 123 and 456 will not use persistent grants, and vice
> versa.

I think that would be quite fragile IMO, I wouldn't recommend relying
on domain IDs.

What I would do instead is add a new attribute to
xl-disk-configuration [0] that allows setting the persistent grants
usage on a per-disk basis, and that should be passed to blkback in a
xenstore node.

> I could implement this, but... to be honest, I don't really understand the
> needs of the fine-grained control.  AFAIU, the problem is 'scalability' vs
> 'data copy overhead'.  So, only small systems would want to turn persistent
> grants off.  In such a small system, why would we need fine-grained control?
> I'm worrying if I would implement and maintain a feature without real use 
> case.
> 
> For the reason, I'd like to suggest to keep this as is for now and expand it
> with the 'exceptions list' idea or something better, if a real use case comes
> out later.

I agree. I'm happy to take patches to implement more fine grained
control, but that shouldn't prevent us from having a global policy if
that's useful to users.

Roger.

[0] https://xenbits.xen.org/docs/unstable/man/xl-disk-configuration.5.html


Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-24 Thread Roger Pau Monné
On Wed, Sep 23, 2020 at 04:09:30PM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> > From: SeongJae Park 
> > 
> > Persistent grants feature provides high scalability.  On some small
> > systems, however, it could incur data copy overhead[1] and thus it is
> > required to be disabled.  But, there is no option to disable it.  For
> > the reason, this commit adds a module parameter for disabling of the
> > feature.
> 
> Would it be better suited to have it per guest?

I think having a per-backend policy that could be specified at the
toolstack level would be nice, but I see that as a further
improvement.

Having a global backend domain policy of whether persistent grants are
enabled or not seems desirable, and if someone wants even more fine
grained control this change is AFAICT not incompatible with a
per-backend option anyway.

Roger.


Re: [PATCH v3 0/3] xen-blk(back|front): Let users disable persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 04:15:46PM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overheads[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds module parameters for disabling of the
> feature.

With the changes requested by Jürgen you can add my:

Acked-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH v3 3/3] xen-blkfront: Apply changed parameter name to the document

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 04:27:39PM +0200, Jürgen Groß wrote:
> On 22.09.20 16:15, SeongJae Park wrote:
> > From: SeongJae Park 
> > 
> > Commit 14e710fe7897 ("xen-blkfront: rename indirect descriptor
> > parameter") changed the name of the module parameter for the maximum
> > amount of segments in indirect requests but missed updating the
> > document.  This commit updates the document.
> > 
> > Signed-off-by: SeongJae Park 
> 
> Reviewed-by: Juergen Gross 

Does this need to be backported to stable branches?

Thanks, Roger.


Re: [PATCH v2 1/3] xen-blkback: add a parameter for disabling of persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 01:26:38PM +0200, SeongJae Park wrote:
> On Tue, 22 Sep 2020 13:12:59 +0200 "Roger Pau Monné"  
> wrote:
> 
> > On Tue, Sep 22, 2020 at 12:52:07PM +0200, SeongJae Park wrote:
> > > From: SeongJae Park 
> > > 
> > > Persistent grants feature provides high scalability.  On some small
> > > systems, however, it could incur data copy overheads[1] and thus it is
> > > required to be disabled.  But, there is no option to disable it.  For
> > > the reason, this commit adds a module parameter for disabling of the
> > > feature.
> > > 
> > > [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> > > 
> > > Signed-off-by: Anthony Liguori 
> > > Signed-off-by: SeongJae Park 
> > > ---
> > >  .../ABI/testing/sysfs-driver-xen-blkback  |  9 ++
> > >  drivers/block/xen-blkback/xenbus.c| 28 ++-
> > >  2 files changed, 30 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> > > b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > > index ecb7942ff146..ac2947b98950 100644
> > > --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > > +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > > @@ -35,3 +35,12 @@ Description:
> > >  controls the duration in milliseconds that blkback will 
> > > not
> > >  cache any page not backed by a grant mapping.
> > >  The default is 10ms.
> > > +
> > > +What:   /sys/module/xen_blkback/parameters/feature_persistent
> > > +Date:   September 2020
> > > +KernelVersion:  5.10
> > > +Contact:SeongJae Park 
> > > +Description:
> > > +Whether to enable the persistent grants feature or not.  
> > > Note
> > > +that this option only takes effect on newly created 
> > > backends.
> > > +The default is Y (enable).
> > > diff --git a/drivers/block/xen-blkback/xenbus.c 
> > > b/drivers/block/xen-blkback/xenbus.c
> > > index b9aa5d1ac10b..8a95ddd08b13 100644
> > > --- a/drivers/block/xen-blkback/xenbus.c
> > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
> > >  
> > >  /* ** Connection ** */
> > >  
> > > +/* Enable the persistent grants feature. */
> > > +static bool feature_persistent = true;
> > > +module_param(feature_persistent, bool, 0644);
> > > +MODULE_PARM_DESC(feature_persistent,
> > > + "Enables the persistent grants feature");
> > > +
> > >  /*
> > >   * Write the physical details regarding the block device to the store, 
> > > and
> > >   * switch to Connected state.
> > > @@ -906,11 +912,15 @@ static void connect(struct backend_info *be)
> > >  
> > >   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
> > >  
> > > - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> > > - if (err) {
> > > - xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
> > > -  dev->nodename);
> > > - goto abort;
> > > + if (feature_persistent) {
> > > + err = xenbus_printf(xbt, dev->nodename, "feature-persistent",
> > > + "%u", feature_persistent);
> > > + if (err) {
> > > + xenbus_dev_fatal(dev, err,
> > > + "writing %s/feature-persistent",
> > > + dev->nodename);
> > > + goto abort;
> > > + }
> > >   }
> > >  
> > >   err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
> > > @@ -1093,8 +1103,12 @@ static int connect_ring(struct backend_info *be)
> > >   xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
> > >   return -ENOSYS;
> > >   }
> > > - pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
> > > -0);
> > > + if (feature_persistent)
> > > + pers_grants = xenbus_read_unsigned(dev->otherend,
> > > +

Re: [PATCH v2 1/3] xen-blkback: add a parameter for disabling of persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 12:52:07PM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overheads[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds a module parameter for disabling of the
> feature.
> 
> [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> 
> Signed-off-by: Anthony Liguori 
> Signed-off-by: SeongJae Park 
> ---
>  .../ABI/testing/sysfs-driver-xen-blkback  |  9 ++
>  drivers/block/xen-blkback/xenbus.c| 28 ++-
>  2 files changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> index ecb7942ff146..ac2947b98950 100644
> --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> @@ -35,3 +35,12 @@ Description:
>  controls the duration in milliseconds that blkback will not
>  cache any page not backed by a grant mapping.
>  The default is 10ms.
> +
> +What:   /sys/module/xen_blkback/parameters/feature_persistent
> +Date:   September 2020
> +KernelVersion:  5.10
> +Contact:SeongJae Park 
> +Description:
> +Whether to enable the persistent grants feature or not.  Note
> +that this option only takes effect on newly created backends.
> +The default is Y (enable).
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index b9aa5d1ac10b..8a95ddd08b13 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
>  
>  /* ** Connection ** */
>  
> +/* Enable the persistent grants feature. */
> +static bool feature_persistent = true;
> +module_param(feature_persistent, bool, 0644);
> +MODULE_PARM_DESC(feature_persistent,
> + "Enables the persistent grants feature");
> +
>  /*
>   * Write the physical details regarding the block device to the store, and
>   * switch to Connected state.
> @@ -906,11 +912,15 @@ static void connect(struct backend_info *be)
>  
>   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
>  
> - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> - if (err) {
> - xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
> -  dev->nodename);
> - goto abort;
> + if (feature_persistent) {
> + err = xenbus_printf(xbt, dev->nodename, "feature-persistent",
> + "%u", feature_persistent);
> + if (err) {
> + xenbus_dev_fatal(dev, err,
> + "writing %s/feature-persistent",
> + dev->nodename);
> + goto abort;
> + }
>   }
>  
>   err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
> @@ -1093,8 +1103,12 @@ static int connect_ring(struct backend_info *be)
>   xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>   return -ENOSYS;
>   }
> - pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
> -0);
> + if (feature_persistent)
> + pers_grants = xenbus_read_unsigned(dev->otherend,
> + "feature-persistent", 0);
> + else
> + pers_grants = 0;
> +

Sorry for not realizing earlier, but looking at it again I think you
need to cache the value of feature_persistent when it's first used in
the blkback state data, so that it's consistent.

What would happen for example with the following flow (assume a
persistent grants enabled frontend):

feature_persistent = false

connect(...)
feature-persistent is not written to xenstore

User changes feature_persistent = true

connect_ring(...)
pers_grants = true, because feature-persistent is set unconditionally
by the frontend and feature_persistent variable is now true.

Then blkback will try to use persistent grants and the whole
connection will malfunction because the frontend won't.

The other option is to prevent changing the variable when there are
blkback instances already running.

Thanks, Roger.


Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overhead[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds a module parameter for disabling of the
> feature.

Have you considered adding a similar option for blkfront?

> 
> [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> 
> Signed-off-by: Anthony Liguori 
> Signed-off-by: SeongJae Park 
> ---
>  .../ABI/testing/sysfs-driver-xen-blkback|  8 
>  drivers/block/xen-blkback/xenbus.c  | 17 ++---
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> index ecb7942ff146..0c42285c75ee 100644
> --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> @@ -35,3 +35,11 @@ Description:
>  controls the duration in milliseconds that blkback will not
>  cache any page not backed by a grant mapping.
>  The default is 10ms.
> +
> +What:   /sys/module/xen_blkback/parameters/feature_persistent
> +Date:   September 2020
> +KernelVersion:  5.10
> +Contact:SeongJae Park 
> +Description:
> +Whether to enable the persistent grants feature or not.
> +The default is 1 (enable).

I would add that this option only takes effect on newly created
backends, existing backends when the option is set will continue to
use persistent grants.

For already running backends you could drain the buffer of persistent
grants and flip the option, but that's more complex and not required.

> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index b9aa5d1ac10b..9c03d70469f4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
>  
>  /* ** Connection ** */
>  
> +/* Enable the persistent grants feature. */
> +static unsigned int feature_persistent = 1;
> +module_param_named(feature_persistent, feature_persistent, int, 0644);
> +MODULE_PARM_DESC(feature_persistent,
> + "Enables the persistent grants feature");
> +
>  /*
>   * Write the physical details regarding the block device to the store, and
>   * switch to Connected state.
> @@ -906,7 +912,8 @@ static void connect(struct backend_info *be)
>  
>   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
>  
> - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> + err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u",
> + feature_persistent ? 1 : 0);

You can avoid writing the feature altogether if it's not enabled,
there's no need to set feature-persistent = 0.

Thanks, Roger.


Re: linux-next: build failure after merge of the akpm-current tree

2020-09-09 Thread Roger Pau Monné
On Tue, Sep 08, 2020 at 08:09:50PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the akpm-current tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> drivers/xen/unpopulated-alloc.c: In function 'fill_list':
> drivers/xen/unpopulated-alloc.c:30:9: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>30 |  pgmap->res.name = "Xen scratch";
>   | ^~~
>   | ref
> drivers/xen/unpopulated-alloc.c:31:9: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>31 |  pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
>   | ^~~
>   | ref
> drivers/xen/unpopulated-alloc.c:33:51: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>33 |  ret = allocate_resource(_resource, >res,
>   |   ^~~
>   |   ref
> In file included from include/asm-generic/memory_model.h:5,
>  from arch/x86/include/asm/page.h:76,
>  from arch/x86/include/asm/thread_info.h:12,
>  from include/linux/thread_info.h:38,
>  from arch/x86/include/asm/preempt.h:7,
>  from include/linux/preempt.h:78,
>  from include/linux/spinlock.h:51,
>  from include/linux/mmzone.h:8,
>  from include/linux/gfp.h:6,
>  from drivers/xen/unpopulated-alloc.c:3:
> drivers/xen/unpopulated-alloc.c:53:35: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>53 |   xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
>   |   ^~~
> include/linux/pfn.h:20:23: note: in definition of macro 'PFN_DOWN'
>20 | #define PFN_DOWN(x) ((x) >> PAGE_SHIFT)
>   |   ^
> drivers/xen/unpopulated-alloc.c:58:30: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>58 | release_resource(>res);
>   |  ^~~
>   |  ref
> drivers/xen/unpopulated-alloc.c:69:28: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>69 |   release_resource(>res);
>   |^~~
>   |ref
> fs/fuse/virtio_fs.c: In function 'virtio_fs_setup_dax':
> fs/fuse/virtio_fs.c:838:9: error: 'struct dev_pagemap' has no member named 
> 'res'; did you mean 'ref'?
>   838 |  pgmap->res = (struct resource){
>   | ^~~
>   | ref
> 
> Caused by commit
> 
>   b3e022c5a68c ("mm/memremap_pages: convert to 'struct range'")
> 
> interacting with commit
> 
>   9e2369c06c8a ("xen: add helpers to allocate unpopulated memory")
> 
> from Linus' tree (in v5.9-rc4) and commit
> 
>   7e833303db20 ("virtiofs: set up virtio_fs dax_device")
> 
> from the fuse tree.
> 
> I have added the following patch which may require more work but at
> least makes it all build.
> 
> From: Stephen Rothwell 
> Date: Tue, 8 Sep 2020 20:00:20 +1000
> Subject: [PATCH] merge fix up for "mm/memremap_pages: convert to 'struct
>  range'"
> 
> Signed-off-by: Stephen Rothwell 

Thanks, LGTM.

> ---
>  drivers/xen/unpopulated-alloc.c | 15 +--
>  fs/fuse/virtio_fs.c |  3 +--
>  2 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c
> index 3b98dc921426..9fa7ce330628 100644
> --- a/drivers/xen/unpopulated-alloc.c
> +++ b/drivers/xen/unpopulated-alloc.c
> @@ -18,6 +18,7 @@ static unsigned int list_count;
>  static int fill_list(unsigned int nr_pages)
>  {
>   struct dev_pagemap *pgmap;
> + struct resource res;
>   void *vaddr;
>   unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
>   int ret;
> @@ -27,10 +28,10 @@ static int fill_list(unsigned int nr_pages)
>   return -ENOMEM;
>  
>   pgmap->type = MEMORY_DEVICE_GENERIC;
> - pgmap->res.name = "Xen scratch";
> - pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> + res.name = "Xen scratch";
> + res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;

You could init the fields at res definition time now, since it's no
longer dynamically allocated.

Roger.


Re: [PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-04 Thread Roger Pau Monné
On Fri, Sep 04, 2020 at 09:00:18AM +0200, Jürgen Groß wrote:
> On 03.09.20 18:38, Roger Pau Monné wrote:
> > On Thu, Sep 03, 2020 at 05:30:07PM +0200, Jürgen Groß wrote:
> > > On 01.09.20 10:33, Roger Pau Monne wrote:
> > > > To be used in order to create foreign mappings. This is based on the
> > > > ZONE_DEVICE facility which is used by persistent memory devices in
> > > > order to create struct pages and kernel virtual mappings for the IOMEM
> > > > areas of such devices. Note that on kernels without support for
> > > > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > > > create foreign mappings.
> > > > 
> > > > The newly added helpers use the same parameters as the existing
> > > > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > > > replacement of the callers. Once a memory region has been added to be
> > > > used as scratch mapping space it will no longer be released, and pages
> > > > returned are kept in a linked list. This allows to have a buffer of
> > > > pages and prevents resorting to frequent additions and removals of
> > > > regions.
> > > > 
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > > 
> > > > Note the driver is currently not enabled on Arm platforms because it
> > > > would interfere with the identity mapping required on some platforms.
> > > > 
> > > > Signed-off-by: Roger Pau Monné 
> > > 
> > > Sorry, I just got a build error for x86 32-bit build:
> > > 
> > > WARNING: unmet direct dependencies detected for ZONE_DEVICE
> > >Depends on [n]: MEMORY_HOTPLUG [=n] && MEMORY_HOTREMOVE [=n] &&
> > > SPARSEMEM_VMEMMAP [=n] && ARCH_HAS_PTE_DEVMAP [=n]
> > >Selected by [y]:
> > >- XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86 [=y]
> > >GEN Makefile
> > >CC  kernel/bounds.s
> > >CALL/home/gross/korg/src/scripts/atomic/check-atomics.sh
> > >UPD include/generated/bounds.h
> > >CC  arch/x86/kernel/asm-offsets.s
> > > In file included from /home/gross/korg/src/include/linux/mmzone.h:19:0,
> > >   from /home/gross/korg/src/include/linux/gfp.h:6,
> > >   from /home/gross/korg/src/include/linux/slab.h:15,
> > >   from /home/gross/korg/src/include/linux/crypto.h:19,
> > >   from 
> > > /home/gross/korg/src/arch/x86/kernel/asm-offsets.c:9:
> > > /home/gross/korg/src/include/linux/page-flags-layout.h:95:2: error: #error
> > > "Not enough bits in page flags"
> > >   #error "Not enough bits in page flags"
> > >^
> > > make[2]: *** [/home/gross/korg/src/scripts/Makefile.build:114:
> > > arch/x86/kernel/asm-offsets.s] Error 1
> > > make[1]: *** [/home/gross/korg/src/Makefile:1175: prepare0] Error 2
> > > make[1]: Leaving directory '/home/gross/korg/x8632'
> > > make: *** [Makefile:185: __sub-make] Error 2
> > 
> > Sorry for this. I've tested a 32bit build but I think it was before
> > the last Kconfig changes. I'm a little unsure how to solve this, as
> > ZONE_DEVICE doesn't select the required options for it to run, but
> > rather depends on them to be available.
> > 
> > You can trigger something similar on x86-64 by doing:
> > 
> > $ make ARCH=x86_64 xen.config
> > Using .config as base
> > Merging ./kernel/configs/xen.config
> > Merging ./arch/x86/configs/xen.config
> > #
> > # merged configuration written to .config (needs make)
> > #
> > scripts/kconfig/conf  --olddefconfig Kconfig
> > 
> > WARNING: unmet direct dependencies detected for ZONE_DEVICE
> >Depends on [n]: MEMORY_HOTPLUG [=y] && MEMORY_HOTREMOVE [=n] && 
> > SPARSEMEM_VMEMMAP [=y] && ARCH_HAS_PTE_DEVMAP [=y]
> >Selected by [y]:
> >- XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86_64 [=y]
> > #
> > # configuration written to .config
> > #
> > 
> > I think the only solution is to have XEN_UNPOPULATED_ALLOC depend on
> > ZONE_DEVICE rather than select it?
> 
> Yes, I think so.
> 
> I've folded that in and now build is fine.

Thanks, I assume no further action is needed on my side.

Roger.


Re: [PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-03 Thread Roger Pau Monné
On Thu, Sep 03, 2020 at 05:30:07PM +0200, Jürgen Groß wrote:
> On 01.09.20 10:33, Roger Pau Monne wrote:
> > To be used in order to create foreign mappings. This is based on the
> > ZONE_DEVICE facility which is used by persistent memory devices in
> > order to create struct pages and kernel virtual mappings for the IOMEM
> > areas of such devices. Note that on kernels without support for
> > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > create foreign mappings.
> > 
> > The newly added helpers use the same parameters as the existing
> > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > replacement of the callers. Once a memory region has been added to be
> > used as scratch mapping space it will no longer be released, and pages
> > returned are kept in a linked list. This allows to have a buffer of
> > pages and prevents resorting to frequent additions and removals of
> > regions.
> > 
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> > 
> > Note the driver is currently not enabled on Arm platforms because it
> > would interfere with the identity mapping required on some platforms.
> > 
> > Signed-off-by: Roger Pau Monné 
> 
> Sorry, I just got a build error for x86 32-bit build:
> 
> WARNING: unmet direct dependencies detected for ZONE_DEVICE
>   Depends on [n]: MEMORY_HOTPLUG [=n] && MEMORY_HOTREMOVE [=n] &&
> SPARSEMEM_VMEMMAP [=n] && ARCH_HAS_PTE_DEVMAP [=n]
>   Selected by [y]:
>   - XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86 [=y]
>   GEN Makefile
>   CC  kernel/bounds.s
>   CALL/home/gross/korg/src/scripts/atomic/check-atomics.sh
>   UPD include/generated/bounds.h
>   CC  arch/x86/kernel/asm-offsets.s
> In file included from /home/gross/korg/src/include/linux/mmzone.h:19:0,
>  from /home/gross/korg/src/include/linux/gfp.h:6,
>  from /home/gross/korg/src/include/linux/slab.h:15,
>  from /home/gross/korg/src/include/linux/crypto.h:19,
>  from /home/gross/korg/src/arch/x86/kernel/asm-offsets.c:9:
> /home/gross/korg/src/include/linux/page-flags-layout.h:95:2: error: #error
> "Not enough bits in page flags"
>  #error "Not enough bits in page flags"
>   ^
> make[2]: *** [/home/gross/korg/src/scripts/Makefile.build:114:
> arch/x86/kernel/asm-offsets.s] Error 1
> make[1]: *** [/home/gross/korg/src/Makefile:1175: prepare0] Error 2
> make[1]: Leaving directory '/home/gross/korg/x8632'
> make: *** [Makefile:185: __sub-make] Error 2

Sorry for this. I've tested a 32bit build but I think it was before
the last Kconfig changes. I'm a little unsure how to solve this, as
ZONE_DEVICE doesn't select the required options for it to run, but
rather depends on them to be available.

You can trigger something similar on x86-64 by doing:

$ make ARCH=x86_64 xen.config
Using .config as base
Merging ./kernel/configs/xen.config
Merging ./arch/x86/configs/xen.config
#
# merged configuration written to .config (needs make)
#
scripts/kconfig/conf  --olddefconfig Kconfig

WARNING: unmet direct dependencies detected for ZONE_DEVICE
  Depends on [n]: MEMORY_HOTPLUG [=y] && MEMORY_HOTREMOVE [=n] && 
SPARSEMEM_VMEMMAP [=y] && ARCH_HAS_PTE_DEVMAP [=y]
  Selected by [y]:
  - XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86_64 [=y]
#
# configuration written to .config
#

I think the only solution is to have XEN_UNPOPULATED_ALLOC depend on
ZONE_DEVICE rather than select it?

Thanks, Roger.


Re: [PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-01 Thread Roger Pau Monné
On Tue, Sep 01, 2020 at 10:33:26AM +0200, Roger Pau Monne wrote:
> +static int fill_list(unsigned int nr_pages)
> +{
> + struct dev_pagemap *pgmap;
> + void *vaddr;
> + unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
> + int nid, ret;
> +
> + pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->type = MEMORY_DEVICE_GENERIC;
> + pgmap->res.name = "Xen scratch";
> + pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> +
> + ret = allocate_resource(_resource, >res,
> + alloc_pages * PAGE_SIZE, 0, -1,
> + PAGES_PER_SECTION * PAGE_SIZE, NULL, NULL);
> + if (ret < 0) {
> + pr_err("Cannot allocate new IOMEM resource\n");
> + kfree(pgmap);
> + return ret;
> + }
> +
> + nid = memory_add_physaddr_to_nid(pgmap->res.start);

I think this is not needed ...

> +
> +#ifdef CONFIG_XEN_HAVE_PVMMU
> +/*
> + * memremap will build page tables for the new memory so
> + * the p2m must contain invalid entries so the correct
> + * non-present PTEs will be written.
> + *
> + * If a failure occurs, the original (identity) p2m entries
> + * are not restored since this region is now known not to
> + * conflict with any devices.
> + */
> + if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> + xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
> +
> + for (i = 0; i < alloc_pages; i++) {
> + if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
> + pr_warn("set_phys_to_machine() failed, no 
> memory added\n");
> + release_resource(>res);
> + kfree(pgmap);
> + return -ENOMEM;
> + }
> +}
> + }
> +#endif
> +
> + vaddr = memremap_pages(pgmap, nid);

... and NUMA_NO_NODE should be used here instead, as this memory is just
fictitious space to map foreign memory, and shouldn't be related to
any NUMA node.

The following chunk should be folded in, or I can resend.

Thanks, Roger.
---8<---
diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c
index 1b5d157c6977..3b98dc921426 100644
--- a/drivers/xen/unpopulated-alloc.c
+++ b/drivers/xen/unpopulated-alloc.c
@@ -20,7 +20,7 @@ static int fill_list(unsigned int nr_pages)
struct dev_pagemap *pgmap;
void *vaddr;
unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
-   int nid, ret;
+   int ret;
 
pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
if (!pgmap)
@@ -39,8 +39,6 @@ static int fill_list(unsigned int nr_pages)
return ret;
}
 
-   nid = memory_add_physaddr_to_nid(pgmap->res.start);
-
 #ifdef CONFIG_XEN_HAVE_PVMMU
 /*
  * memremap will build page tables for the new memory so
@@ -65,7 +63,7 @@ static int fill_list(unsigned int nr_pages)
}
 #endif
 
-   vaddr = memremap_pages(pgmap, nid);
+   vaddr = memremap_pages(pgmap, NUMA_NO_NODE);
if (IS_ERR(vaddr)) {
pr_err("Cannot remap memory range\n");
release_resource(>res);



Re: [PATCH v4 1/2] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

2020-08-31 Thread Roger Pau Monné
On Thu, Aug 20, 2020 at 01:37:41PM +0200, Roger Pau Monné wrote:
> On Tue, Aug 11, 2020 at 11:07:36PM +0200, David Hildenbrand wrote:
> > On 11.08.20 11:44, Roger Pau Monne wrote:
> > > This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
> > > being used by non DAX devices.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Roger Pau Monné 
> > > ---
> > > Cc: Dan Williams 
> > > Cc: Vishal Verma 
> > > Cc: Dave Jiang 
> > > Cc: Andrew Morton 
> > > Cc: Jason Gunthorpe 
> > > Cc: Ira Weiny 
> > > Cc: "Aneesh Kumar K.V" 
> > > Cc: Johannes Thumshirn 
> > > Cc: Logan Gunthorpe 
> > > Cc: linux-nvd...@lists.01.org
> > > Cc: xen-de...@lists.xenproject.org
> > > Cc: linux...@kvack.org
> > > ---
> > >  drivers/dax/device.c | 2 +-
> > >  include/linux/memremap.h | 9 -
> > >  mm/memremap.c| 2 +-
> > >  3 files changed, 6 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > > index 4c0af2eb7e19..1e89513f3c59 100644
> > > --- a/drivers/dax/device.c
> > > +++ b/drivers/dax/device.c
> > > @@ -429,7 +429,7 @@ int dev_dax_probe(struct device *dev)
> > >   return -EBUSY;
> > >   }
> > >  
> > > - dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
> > > + dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
> > >   addr = devm_memremap_pages(dev, _dax->pgmap);
> > >   if (IS_ERR(addr))
> > >   return PTR_ERR(addr);
> > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > index 5f5b2df06e61..e5862746751b 100644
> > > --- a/include/linux/memremap.h
> > > +++ b/include/linux/memremap.h
> > > @@ -46,11 +46,10 @@ struct vmem_altmap {
> > >   * wakeup is used to coordinate physical address space management (ex:
> > >   * fs truncate/hole punch) vs pinned pages (ex: device dma).
> > >   *
> > > - * MEMORY_DEVICE_DEVDAX:
> > > + * MEMORY_DEVICE_GENERIC:
> > >   * Host memory that has similar access semantics as System RAM i.e. DMA
> > > - * coherent and supports page pinning. In contrast to
> > > - * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
> > > - * character device.
> > > + * coherent and supports page pinning. This is for example used by DAX 
> > > devices
> > > + * that expose memory using a character device.
> > >   *
> > >   * MEMORY_DEVICE_PCI_P2PDMA:
> > >   * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
> > > @@ -60,7 +59,7 @@ enum memory_type {
> > >   /* 0 is reserved to catch uninitialized type fields */
> > >   MEMORY_DEVICE_PRIVATE = 1,
> > >   MEMORY_DEVICE_FS_DAX,
> > > - MEMORY_DEVICE_DEVDAX,
> > > + MEMORY_DEVICE_GENERIC,
> > >   MEMORY_DEVICE_PCI_P2PDMA,
> > >  };
> > >  
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 03e38b7a38f1..006dace60b1a 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -216,7 +216,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int 
> > > nid)
> > >   return ERR_PTR(-EINVAL);
> > >   }
> > >   break;
> > > - case MEMORY_DEVICE_DEVDAX:
> > > + case MEMORY_DEVICE_GENERIC:
> > >   need_devmap_managed = false;
> > >   break;
> > >   case MEMORY_DEVICE_PCI_P2PDMA:
> > > 
> > 
> > No strong opinion (@Dan?), I do wonder if a separate type would make sense.
> 
> Gentle ping.

Sorry to ping again, but I would rather get this out of my queue if
possible, seeing as the other patch is OK to go in but depends on this
one going in first.

Thanks, Roger.


Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-20 Thread Roger Pau Monné
On Wed, Aug 19, 2020 at 01:33:39PM +0200, Norbert Kaminski wrote:
> 
> On 19.08.2020 10:19, Roger Pau Monné wrote:
> > On Tue, Aug 18, 2020 at 08:40:18PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, Aug 18, 2020 at 07:21:14PM +0200, Roger Pau Monné wrote:
> > > > > Let me draw the picture from the beginning.
> > > > Thanks, greatly appreciated.
> > > > 
> > > > > EFI memory map contains various memory regions. Some of them are 
> > > > > marked
> > > > > as not needed after ExitBootServices() call (done in Xen before
> > > > > launching dom0). This includes EFI_BOOT_SERVICES_DATA and
> > > > > EFI_BOOT_SERVICES_CODE.
> > > > > 
> > > > > EFI SystemTable contains pointers to various ConfigurationTables -
> > > > > physical addresses (at least in this case). Xen does interpret some of
> > > > > them, but not ESRT. Xen pass the whole (address of) SystemTable to 
> > > > > Linux
> > > > > dom0 (at least in PV case). Xen doesn't do anything about tables it
> > > > > doesn't understand.
> > > > > 
> > > > > Now, the code in Linux takes the (ESRT) table address early and checks
> > > > > the memory map for it. We have 3 cases:
> > > > >   - it points at area marked as neither EFI_*_SERVICES_DATA, nor with
> > > > > EFI_MEMORY_RUNTIME attribute -> Linux refuse to use it
> > > > >   - it points to EFI_RUNTIME_SERVICES_DATA or with EFI_MEMORY_RUNTIME
> > > > > attribute - Linux uses the table; memory map already says the area
> > > > > belongs to EFI and the OS should not use it for something else
> > > > >   - it points to EFI_BOOT_SERVICES_DATA - Linux mark the area as 
> > > > > reserved
> > > > > to not release it after calling ExitBootServices()
> > > > > 
> > > > > The problematic is the third case - at the time when Linux dom0 is 
> > > > > run,
> > > > > ExitBootServices() was already called and EFI_BOOT_SERVICES_* memory 
> > > > > was
> > > > > already released. It could be already used for something else (for
> > > > > example Xen could overwrite it while loading dom0).
> > > > > 
> > > > > Note the problematic case should be the most common - UEFI 
> > > > > specification
> > > > > says "The ESRT shall be stored in memory of type EfiBootServicesData"
> > > > > (chapter 22.3 of UEFI Spec v2.6).
> > > > > 
> > > > > For this reason, to use ESRT in dom0, Xen should do something about it
> > > > > before ExitBootServices() call. While analyzing all the EFI tables is
> > > > > probably not a viable option, it can do some simple action:
> > > > >   - retains all the EFI_BOOT_SERVICES_* areas - there is already code
> > > > > for that, controlled with /mapbs boot switch (to xen.efi, would 
> > > > > need
> > > > > another option for multiboot2+efi)
> > > > >   - have a list of tables to retain - since Xen already do analyze 
> > > > > some
> > > > > of the ConfigurationTables, it can also have a list of those to
> > > > > preserve even if they live in EFI_BOOT_SERVICES_DATA. In this 
> > > > > case,
> > > > > while Xen doesn't need to parse the whole table, it need to parse 
> > > > > it's
> > > > > header to get the table size - to reserve that memory and not 
> > > > > reuse
> > > > > it after ExitBootServices().
> > > > Xen seems to already contain skeleton
> > > > XEN_EFI_query_capsule_capabilities and XEN_EFI_update_capsule
> > > > hypercalls which is what should be used in order to perform the
> > > > updates?
> > > I think those covers only runtime service calls similarly named. But you
> > > need also ESRT table to collect info about devices that you can even
> > > attempt to update.
> > Right, the ESRT must be available so that dom0 can discover the
> > resources.
> > 
> > > TBH, I'm not sure if those runtime services are really needed. I think
> > > Norbert succeeded UEFI update from within Xen PV dom0 with just access
> > > to the ESRT table, but without those services.
> > > 
> Marek is right here. I was able to successfully update and downgrade
> UFEI when 

Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-20 Thread Roger Pau Monné
On Thu, Aug 20, 2020 at 11:34:54AM +0200, Marek Marczykowski-Górecki wrote:
> On Thu, Aug 20, 2020 at 11:30:25AM +0200, Roger Pau Monné wrote:
> > Right, so you only need access to the ESRT table, that's all. Then I
> > think we need to make sure Xen doesn't use this memory for anything
> > else, which will require some changes in Xen (or at least some
> > checks?).
> > 
> > We also need to decide what to do if the table turns out to be placed
> > in a wrong region. How are we going to prevent dom0 from using it
> > then? My preference would be to completely hide it from dom0 in that
> > case, such that it believes there's no ESRT at all if possible.
> 
> Yes, that makes sense. As discussed earlier, that probably means
> re-constructing SystemTable before giving it to dom0. We'd need to do
> that in PVH case anyway, to adjust addresses, right?

Not really, on PVH dom0 we should be able to identity map the required
EFI regions in the dom0 p2m, so the only difference between a classic
PV dom0 is that we need to assure that those regions are correctly
identity mapped in the p2m, but that shouldn't require any change to
the SystemTable unless we need to craft custom tables (see below).

> Is there something
> like this in the Xen codebase already, or it needs to be written from
> scratch?

AFAICT it needs to be written for EFI. For the purposes here I think
you could copy the SystemTable and modify the NumberOfTableEntries and
ConfigurationTable fields in the copy in order to delete the ESRT if
found to be placed in a non suitable region?

At that point we can remove the checks from Linux since Xen will
assert that whatever gets passed to dom0 is in a suitable region. It
would be nice to have a way to signal that the placement of the ESRT
has been checked, but I'm not sure how to do this, do you have any
ideas?

Roger.


Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-19 Thread Roger Pau Monné
On Tue, Aug 18, 2020 at 08:40:18PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 18, 2020 at 07:21:14PM +0200, Roger Pau Monné wrote:
> > > Let me draw the picture from the beginning.
> > 
> > Thanks, greatly appreciated.
> > 
> > > EFI memory map contains various memory regions. Some of them are marked
> > > as not needed after ExitBootServices() call (done in Xen before
> > > launching dom0). This includes EFI_BOOT_SERVICES_DATA and
> > > EFI_BOOT_SERVICES_CODE.
> > > 
> > > EFI SystemTable contains pointers to various ConfigurationTables -
> > > physical addresses (at least in this case). Xen does interpret some of
> > > them, but not ESRT. Xen pass the whole (address of) SystemTable to Linux
> > > dom0 (at least in PV case). Xen doesn't do anything about tables it
> > > doesn't understand.
> > > 
> > > Now, the code in Linux takes the (ESRT) table address early and checks
> > > the memory map for it. We have 3 cases:
> > >  - it points at area marked as neither EFI_*_SERVICES_DATA, nor with
> > >EFI_MEMORY_RUNTIME attribute -> Linux refuse to use it
> > >  - it points to EFI_RUNTIME_SERVICES_DATA or with EFI_MEMORY_RUNTIME
> > >attribute - Linux uses the table; memory map already says the area
> > >belongs to EFI and the OS should not use it for something else
> > >  - it points to EFI_BOOT_SERVICES_DATA - Linux mark the area as reserved
> > >to not release it after calling ExitBootServices()
> > > 
> > > The problematic is the third case - at the time when Linux dom0 is run,
> > > ExitBootServices() was already called and EFI_BOOT_SERVICES_* memory was
> > > already released. It could be already used for something else (for
> > > example Xen could overwrite it while loading dom0).
> > > 
> > > Note the problematic case should be the most common - UEFI specification
> > > says "The ESRT shall be stored in memory of type EfiBootServicesData"
> > > (chapter 22.3 of UEFI Spec v2.6).
> > > 
> > > For this reason, to use ESRT in dom0, Xen should do something about it
> > > before ExitBootServices() call. While analyzing all the EFI tables is
> > > probably not a viable option, it can do some simple action:
> > >  - retains all the EFI_BOOT_SERVICES_* areas - there is already code
> > >for that, controlled with /mapbs boot switch (to xen.efi, would need
> > >another option for multiboot2+efi)
> > >  - have a list of tables to retain - since Xen already do analyze some
> > >of the ConfigurationTables, it can also have a list of those to
> > >preserve even if they live in EFI_BOOT_SERVICES_DATA. In this case,
> > >while Xen doesn't need to parse the whole table, it need to parse it's
> > >header to get the table size - to reserve that memory and not reuse
> > >it after ExitBootServices().
> > 
> > Xen seems to already contain skeleton
> > XEN_EFI_query_capsule_capabilities and XEN_EFI_update_capsule
> > hypercalls which is what should be used in order to perform the
> > updates?
> 
> I think those covers only runtime service calls similarly named. But you
> need also ESRT table to collect info about devices that you can even
> attempt to update.

Right, the ESRT must be available so that dom0 can discover the
resources.

> TBH, I'm not sure if those runtime services are really needed. I think
> Norbert succeeded UEFI update from within Xen PV dom0 with just access
> to the ESRT table, but without those services.

OK, by reading the UEFI spec I assumed that you needed access to
QueryCapsuleCapabilities and UpdateCapsule in order to perform the
updates, and those should be proxied using hyopercalls. Maybe this is
not mandatory and there's a side-band mechanism of doing this?

I think we need more info here.

> > So yes, I agree Xen should make sure the region of the table is not
> > freed when exiting boot services, and that dom0 can access it. I
> > guess we should move the checks done by Linux to Xen, and then only
> > provide the ESRT table to dom0 if the checks (now done by Xen) pass.
> 
> Yes, something like this. But note currently in the (PV) dom0 case, Xen
> provides dom0 with a pointer to the whole SystemTable, not individual
> ConfigurationTables. Making it filter what dom0 gets would require Xen
> to re-construct the whole thing with just those elements that are
> desired. Not exactly sure if worth the effort given the privilege dom0
> has.

We already do this for ACPI in PVH dom0, where Xen rebuilds the RSDT
in order to filter out 

Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-18 Thread Roger Pau Monné
On Tue, Aug 18, 2020 at 05:00:20PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 18, 2020 at 02:47:10PM +0200, Roger Pau Monné wrote:
> > On Tue, Aug 18, 2020 at 02:01:35PM +0200, Marek Marczykowski-Górecki wrote:
> > > Do you mean PV dom0 should receive full EFI memory map? Jan already
> > > objected this as it would be a layering violation.
> > 
> > dom0 is already capable of getting the native e820 memory map using
> > XENMEM_machine_memory_map, I'm not sure why allowing to return the
> > memory map in EFI form would be any different (or a layering
> > violation in the PV dom0 case).
> >
> > Do you have a reference to that thread? I certainly missed it.
> 
> See this thread: http://markmail.org/message/nrrvuau5whebksy2
> 
> > For PVH dom0 we could consider something different, since in that case
> > there's a guest memory map which could likely be returned in EFI
> > format and with the EFI regions if required.
> > 
> > > > > Skip this part on Xen PV (let Xen do the right thing if it deems
> > > > > necessary) and use ESRT table normally.
> > > > 
> > > > Maybe it would be better to introduce a new hypercall (or add a
> > > > parameter to XENMEM_machine_memory_map) in order to be able to fetch
> > > > the EFI memory map?
> > > >
> > > > That should allow a PV dom0 to check the ESRT is correct and thus not
> > > > diverge from bate metal.
> > > 
> > > Note the EFI memory map there is used not just to check things, but to
> > > actually modify it to reserve the region. I think that's rather Xen
> > > responsibility, not dom0. See the comment from Ard.
> > 
> > But that would modify Linux copy of the memory map, which is fine? My
> > understanding of EFI is limited, but I don't think such changes are
> > feed back into EFI, so Linux is completely free to do whatever it
> > wants with it's copy of the EFI memory map.
> 
> Yes, but the thing is to make sure Xen doesn't use that memory, not only
> dom0. See below.
> 
> > > > > + if (efi_enabled(EFI_MEMMAP)) {
> > > > > + rc = efi_mem_desc_lookup(efi.esrt, );
> > > > > + if (rc < 0 ||
> > > > > + (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> > > > > +  md.type != EFI_BOOT_SERVICES_DATA &&
> > > > > +  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> > > > > + pr_warn("ESRT header is not in the memory 
> > > > > map.\n");
> > > > > + return;
> > > > > + }
> > > > 
> > > > Here you blindly trust the data in the ESRT in the PV case, without
> > > > checking it matches the regions on the memory map, which could lead to
> > > > errors if ESRT turns to be wrong.
> > > 
> > > I don't think checking merely if ESRT lives somewhere in
> > > EFI_{BOOT,RUNTIME}_SERVICES_DATA area guarantees its correctness.
> > > 
> > > On the other hand, this seems to be done to prevent overwriting that
> > > memory with something else (see that in case of EFI_BOOT_SERVICES_DATA
> > > it is later marked as reserved. I think it should be rather done by Xen,
> > > not dom0.
> > 
> > Forcing Xen to do all those checks seems quite a tedious work, and in
> > fact the memory map that dom0 has might be more complete than the one
> > Xen is able to construct, as Xen doesn't have an AML parser so it's
> > not able to get all the possible info from ACPI.
> 
> Let me draw the picture from the beginning.

Thanks, greatly appreciated.

> EFI memory map contains various memory regions. Some of them are marked
> as not needed after ExitBootServices() call (done in Xen before
> launching dom0). This includes EFI_BOOT_SERVICES_DATA and
> EFI_BOOT_SERVICES_CODE.
> 
> EFI SystemTable contains pointers to various ConfigurationTables -
> physical addresses (at least in this case). Xen does interpret some of
> them, but not ESRT. Xen pass the whole (address of) SystemTable to Linux
> dom0 (at least in PV case). Xen doesn't do anything about tables it
> doesn't understand.
> 
> Now, the code in Linux takes the (ESRT) table address early and checks
> the memory map for it. We have 3 cases:
>  - it points at area marked as neither EFI_*_SERVICES_DATA, nor with
>EFI_MEMORY_RUNTIME attribute -> Linux refuse to use it
>  - it points to EFI_RUNTIME_SERVICES_DATA or with EFI_MEMORY_RUNTIME
>attribute - Li

Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-18 Thread Roger Pau Monné
On Tue, Aug 18, 2020 at 02:01:35PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Aug 17, 2020 at 11:00:13AM +0200, Roger Pau Monné wrote:
> > On Sun, Aug 16, 2020 at 02:19:49AM +0200, Marek Marczykowski-Górecki wrote:
> > > In case of Xen PV dom0, Xen passes along info about system tables (see
> > > arch/x86/xen/efi.c), but not the memory map from EFI.
> > 
> > I think that's because the memory map returned by
> > XENMEM_machine_memory_map is in e820 form, and doesn't contain the
> > required information about the EFI regions due to the translation done
> > by efi_arch_process_memory_map in Xen?
> 
> Yes, I think so.
> 
> > > This makes sense
> > > as it is Xen responsible for managing physical memory address space.
> > > In this case, it doesn't make sense to condition using ESRT table on
> > > availability of EFI memory map, as it isn't Linux kernel responsible for
> > > it.
> > 
> > PV dom0 is kind of special in that regard as it can create mappings to
> > (almost) any MMIO regions, and hence can change it's memory map
> > substantially.
> 
> Do you mean PV dom0 should receive full EFI memory map? Jan already
> objected this as it would be a layering violation.

dom0 is already capable of getting the native e820 memory map using
XENMEM_machine_memory_map, I'm not sure why allowing to return the
memory map in EFI form would be any different (or a layering
violation in the PV dom0 case).

Do you have a reference to that thread? I certainly missed it.

For PVH dom0 we could consider something different, since in that case
there's a guest memory map which could likely be returned in EFI
format and with the EFI regions if required.

> > > Skip this part on Xen PV (let Xen do the right thing if it deems
> > > necessary) and use ESRT table normally.
> > 
> > Maybe it would be better to introduce a new hypercall (or add a
> > parameter to XENMEM_machine_memory_map) in order to be able to fetch
> > the EFI memory map?
> >
> > That should allow a PV dom0 to check the ESRT is correct and thus not
> > diverge from bate metal.
> 
> Note the EFI memory map there is used not just to check things, but to
> actually modify it to reserve the region. I think that's rather Xen
> responsibility, not dom0. See the comment from Ard.

But that would modify Linux copy of the memory map, which is fine? My
understanding of EFI is limited, but I don't think such changes are
feed back into EFI, so Linux is completely free to do whatever it
wants with it's copy of the EFI memory map.

> > > 
> > > This is a requirement for using fwupd in PV dom0 to update UEFI using
> > > capsules.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > 
> > > ---
> > >  drivers/firmware/efi/esrt.c | 47 -
> > >  1 file changed, 25 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c
> > > index d5915272141f..5c49f2aaa4b1 100644
> > > --- a/drivers/firmware/efi/esrt.c
> > > +++ b/drivers/firmware/efi/esrt.c
> > > @@ -245,36 +245,38 @@ void __init efi_esrt_init(void)
> > >   int rc;
> > >   phys_addr_t end;
> > >  
> > > - if (!efi_enabled(EFI_MEMMAP))
> > > + if (!efi_enabled(EFI_MEMMAP) && !efi_enabled(EFI_PARAVIRT))
> > >   return;
> > >  
> > >   pr_debug("esrt-init: loading.\n");
> > >   if (!esrt_table_exists())
> > >   return;
> > >  
> > > - rc = efi_mem_desc_lookup(efi.esrt, );
> > > - if (rc < 0 ||
> > > - (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> > > -  md.type != EFI_BOOT_SERVICES_DATA &&
> > > -  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> > > - pr_warn("ESRT header is not in the memory map.\n");
> > > - return;
> > > - }
> > > + if (efi_enabled(EFI_MEMMAP)) {
> > > + rc = efi_mem_desc_lookup(efi.esrt, );
> > > + if (rc < 0 ||
> > > + (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> > > +  md.type != EFI_BOOT_SERVICES_DATA &&
> > > +  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> > > + pr_warn("ESRT header is not in the memory map.\n");
> > > + return;
> > > + }
> > 
> > Here you blindly trust the data in the ESRT in the PV case, without
> > checking it matches the regions on the memory map, which could lead to
> > errors if ESRT turns to be wrong.
> 
> I don't think checking merely if ESRT lives somewhere in
> EFI_{BOOT,RUNTIME}_SERVICES_DATA area guarantees its correctness.
> 
> On the other hand, this seems to be done to prevent overwriting that
> memory with something else (see that in case of EFI_BOOT_SERVICES_DATA
> it is later marked as reserved. I think it should be rather done by Xen,
> not dom0.

Forcing Xen to do all those checks seems quite a tedious work, and in
fact the memory map that dom0 has might be more complete than the one
Xen is able to construct, as Xen doesn't have an AML parser so it's
not able to get all the possible info from ACPI.

Thanks, Roger.


Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-17 Thread Roger Pau Monné
On Sun, Aug 16, 2020 at 02:19:49AM +0200, Marek Marczykowski-Górecki wrote:
> In case of Xen PV dom0, Xen passes along info about system tables (see
> arch/x86/xen/efi.c), but not the memory map from EFI.

I think that's because the memory map returned by
XENMEM_machine_memory_map is in e820 form, and doesn't contain the
required information about the EFI regions due to the translation done
by efi_arch_process_memory_map in Xen?

> This makes sense
> as it is Xen responsible for managing physical memory address space.
> In this case, it doesn't make sense to condition using ESRT table on
> availability of EFI memory map, as it isn't Linux kernel responsible for
> it.

PV dom0 is kind of special in that regard as it can create mappings to
(almost) any MMIO regions, and hence can change it's memory map
substantially.

> Skip this part on Xen PV (let Xen do the right thing if it deems
> necessary) and use ESRT table normally.

Maybe it would be better to introduce a new hypercall (or add a
parameter to XENMEM_machine_memory_map) in order to be able to fetch
the EFI memory map?

That should allow a PV dom0 to check the ESRT is correct and thus not
diverge from bate metal.

> 
> This is a requirement for using fwupd in PV dom0 to update UEFI using
> capsules.
> 
> Signed-off-by: Marek Marczykowski-Górecki 
> ---
>  drivers/firmware/efi/esrt.c | 47 -
>  1 file changed, 25 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c
> index d5915272141f..5c49f2aaa4b1 100644
> --- a/drivers/firmware/efi/esrt.c
> +++ b/drivers/firmware/efi/esrt.c
> @@ -245,36 +245,38 @@ void __init efi_esrt_init(void)
>   int rc;
>   phys_addr_t end;
>  
> - if (!efi_enabled(EFI_MEMMAP))
> + if (!efi_enabled(EFI_MEMMAP) && !efi_enabled(EFI_PARAVIRT))
>   return;
>  
>   pr_debug("esrt-init: loading.\n");
>   if (!esrt_table_exists())
>   return;
>  
> - rc = efi_mem_desc_lookup(efi.esrt, );
> - if (rc < 0 ||
> - (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> -  md.type != EFI_BOOT_SERVICES_DATA &&
> -  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> - pr_warn("ESRT header is not in the memory map.\n");
> - return;
> - }
> + if (efi_enabled(EFI_MEMMAP)) {
> + rc = efi_mem_desc_lookup(efi.esrt, );
> + if (rc < 0 ||
> + (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> +  md.type != EFI_BOOT_SERVICES_DATA &&
> +  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> + pr_warn("ESRT header is not in the memory map.\n");
> + return;
> + }

Here you blindly trust the data in the ESRT in the PV case, without
checking it matches the regions on the memory map, which could lead to
errors if ESRT turns to be wrong.

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-14 Thread Roger Pau Monné
On Fri, Aug 14, 2020 at 02:54:38PM +0200, Jürgen Groß wrote:
> On 14.08.20 14:47, Roger Pau Monné wrote:
> > On Fri, Aug 14, 2020 at 12:27:32PM +0200, Jürgen Groß wrote:
> > > On 14.08.20 11:56, Roger Pau Monné wrote:
> > > > On Fri, Aug 14, 2020 at 08:29:20AM +0100, Christoph Hellwig wrote:
> > > > > On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monn?? wrote:
> > > > > > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > > > > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > > > > > If enabled (because ZONE_DEVICE is supported) the usage of the 
> > > > > > > > new
> > > > > > > > functionality untangles Xen balloon and RAM hotplug from the 
> > > > > > > > usage of
> > > > > > > > unpopulated physical memory ranges to map foreign pages, which 
> > > > > > > > is the
> > > > > > > > correct thing to do in order to avoid mappings of foreign pages 
> > > > > > > > depend
> > > > > > > > on memory hotplug.
> > > > > > > 
> > > > > > > So please just select ZONE_DEVICE if this is so much better rather
> > > > > > > than maintaining two variants.
> > > > > > 
> > > > > > We still need to other variant for Arm at least, so both need to be
> > > > > > maintained anyway, even if we force ZONE_DEVICE on x86.
> > > > > 
> > > > > Well, it still really helps reproducability if you stick to one
> > > > > implementation of x86.
> > > > > 
> > > > > The alternative would be an explicit config option to opt into it,
> > > > > but just getting a different implementation based on a random
> > > > > kernel option is strange.
> > > > 
> > > > Would adding something like the chunk below to the patch be OK?
> > > > 
> > > > ---8<---
> > > > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > > > index 018020b91baa..5f321a1319e6 100644
> > > > --- a/drivers/xen/Kconfig
> > > > +++ b/drivers/xen/Kconfig
> > > > @@ -328,7 +328,14 @@ config XEN_FRONT_PGDIR_SHBUF
> > > > tristate
> > > >config XEN_UNPOPULATED_ALLOC
> > > > -   bool
> > > > -   default y if ZONE_DEVICE && !ARM && !ARM64
> > > > +   bool "Use unpopulated memory ranges for guest mappings"
> > > > +   depends on X86
> > > > +   select ZONE_DEVICE
> > > > +   default y
> > > 
> > > I'd rather use "default XEN_BACKEND" here, as mappings of other guest's
> > > memory is rarely used for non-backend guests.
> > 
> > There's also the privcmd and gnt devices which make heavy use of this,
> > so I'm not sure only selecting by default on XEN_BACKEND is the best
> > option.
> 
> I just want to avoid that kernels built for running as Xen guest, but
> not as dom0, will be forced to select ZONE_DEVICE.
> 
> As privcmd is dom0-only, this is no problem.

Oh, didn't know that, I somehow assumed privcmd would be available to
all Xen guests regardless of whether dom0 support was selected.

> In case you are worrying about gnt devices, I'd be fine to switch to
> 
> default XEN_BACKEND || XEN_GNTDEV

Sure. maybe even:

default XEN_BACKEND || XEN_GNTDEV || XEN_DOM0

Do you want me to resend with this changed or are you happy to fixup
if there are no further comments?

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-14 Thread Roger Pau Monné
On Fri, Aug 14, 2020 at 12:27:32PM +0200, Jürgen Groß wrote:
> On 14.08.20 11:56, Roger Pau Monné wrote:
> > On Fri, Aug 14, 2020 at 08:29:20AM +0100, Christoph Hellwig wrote:
> > > On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monn?? wrote:
> > > > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > > > functionality untangles Xen balloon and RAM hotplug from the usage 
> > > > > > of
> > > > > > unpopulated physical memory ranges to map foreign pages, which is 
> > > > > > the
> > > > > > correct thing to do in order to avoid mappings of foreign pages 
> > > > > > depend
> > > > > > on memory hotplug.
> > > > > 
> > > > > So please just select ZONE_DEVICE if this is so much better rather
> > > > > than maintaining two variants.
> > > > 
> > > > We still need to other variant for Arm at least, so both need to be
> > > > maintained anyway, even if we force ZONE_DEVICE on x86.
> > > 
> > > Well, it still really helps reproducability if you stick to one
> > > implementation of x86.
> > > 
> > > The alternative would be an explicit config option to opt into it,
> > > but just getting a different implementation based on a random
> > > kernel option is strange.
> > 
> > Would adding something like the chunk below to the patch be OK?
> > 
> > ---8<---
> > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > index 018020b91baa..5f321a1319e6 100644
> > --- a/drivers/xen/Kconfig
> > +++ b/drivers/xen/Kconfig
> > @@ -328,7 +328,14 @@ config XEN_FRONT_PGDIR_SHBUF
> > tristate
> >   config XEN_UNPOPULATED_ALLOC
> > -   bool
> > -   default y if ZONE_DEVICE && !ARM && !ARM64
> > +   bool "Use unpopulated memory ranges for guest mappings"
> > +   depends on X86
> > +   select ZONE_DEVICE
> > +   default y
> 
> I'd rather use "default XEN_BACKEND" here, as mappings of other guest's
> memory is rarely used for non-backend guests.

There's also the privcmd and gnt devices which make heavy use of this,
so I'm not sure only selecting by default on XEN_BACKEND is the best
option.

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-14 Thread Roger Pau Monné
On Fri, Aug 14, 2020 at 08:29:20AM +0100, Christoph Hellwig wrote:
> On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monn?? wrote:
> > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > 
> > > So please just select ZONE_DEVICE if this is so much better rather
> > > than maintaining two variants.
> > 
> > We still need to other variant for Arm at least, so both need to be
> > maintained anyway, even if we force ZONE_DEVICE on x86.
> 
> Well, it still really helps reproducability if you stick to one
> implementation of x86.
> 
> The alternative would be an explicit config option to opt into it,
> but just getting a different implementation based on a random
> kernel option is strange.

Would adding something like the chunk below to the patch be OK?

---8<---
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 018020b91baa..5f321a1319e6 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -328,7 +328,14 @@ config XEN_FRONT_PGDIR_SHBUF
tristate
 
 config XEN_UNPOPULATED_ALLOC
-   bool
-   default y if ZONE_DEVICE && !ARM && !ARM64
+   bool "Use unpopulated memory ranges for guest mappings"
+   depends on X86
+   select ZONE_DEVICE
+   default y
+   help
+ Use unpopulated memory ranges in order to create mappings for guest
+ memory regions, including grants maps and foreign pages. This avoids
+ having to balloon out RAM regions in order to obtain physical memory
+ space to create such mappings.
 
 endmenu



Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-13 Thread Roger Pau Monné
Your email client seems to set 'Reply-to:' to point to everyone on the
'Cc:' field, but not yourself, which is kind of weird. I've manually
fixed it on this reply by moving everyone to the 'Cc:' field and
setting you on 'To:'.

On Thu, Aug 13, 2020 at 11:49:46AM +0200, Daniel Vetter wrote:
> On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monné wrote:
> > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > 
> > > So please just select ZONE_DEVICE if this is so much better rather
> > > than maintaining two variants.
> > 
> > We still need to other variant for Arm at least, so both need to be
> > maintained anyway, even if we force ZONE_DEVICE on x86.
> 
> Why does arm not have ZONE_DEVICE?

It's not that Arm doesn't have ZONE_DEVICE, it's just that the
approach used here won't work correctly on an Arm Xen dom0 as-is.

This is due to the usage of an identity second stage translation in
order to workaround the lack of an IOMMU in some Arm boards.

It can be made to work on Arm, but will likely require someone from
the Arm side doing that.

Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-13 Thread Roger Pau Monné
On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> 
> So please just select ZONE_DEVICE if this is so much better rather
> than maintaining two variants.

We still need to other variant for Arm at least, so both need to be
maintained anyway, even if we force ZONE_DEVICE on x86.

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-12 Thread Roger Pau Monné
On Wed, Aug 12, 2020 at 09:28:45AM +0200, Jürgen Groß wrote:
> On 11.08.20 11:44, Roger Pau Monne wrote:
> > To be used in order to create foreign mappings. This is based on the
> > ZONE_DEVICE facility which is used by persistent memory devices in
> > order to create struct pages and kernel virtual mappings for the IOMEM
> > areas of such devices. Note that on kernels without support for
> > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > create foreign mappings.
> > 
> > The newly added helpers use the same parameters as the existing
> > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > replacement of the callers. Once a memory region has been added to be
> > used as scratch mapping space it will no longer be released, and pages
> > returned are kept in a linked list. This allows to have a buffer of
> > pages and prevents resorting to frequent additions and removals of
> > regions.
> > 
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> > 
> > Note the driver is currently not enabled on Arm platforms because it
> > would interfere with the identity mapping required on some platforms.
> > 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Oleksandr Andrushchenko 
> > Cc: David Airlie 
> > Cc: Daniel Vetter 
> > Cc: Boris Ostrovsky 
> > Cc: Juergen Gross 
> > Cc: Stefano Stabellini 
> > Cc: Dan Carpenter 
> > Cc: Roger Pau Monne 
> > Cc: Wei Liu 
> > Cc: Yan Yankovskyi 
> > Cc: dri-de...@lists.freedesktop.org
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux...@kvack.org
> > Cc: David Hildenbrand 
> > Cc: Michal Hocko 
> > Cc: Dan Williams 
> > ---
> > Changes since v3:
> >   - Introduce a Kconfig option that gates the addition of the
> > unpopulated alloc driver. This allows to easily disable it on Arm
> > platforms.
> >   - Dropped Juergen RB due to the addition of the Kconfig option.
> >   - Switched from MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC.
> > 
> > Changes since v2:
> >   - Drop BUILD_BUG_ON regarding PVMMU page sizes.
> >   - Use a SPDX license identifier.
> >   - Call fill with only the minimum required number of pages.
> >   - Include xen.h header in xen_drm_front_gem.c.
> >   - Use less generic function names.
> >   - Exit early from the init function if not a PV guest.
> >   - Don't use all caps for region name.
> > ---
> >   drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
> >   drivers/xen/Kconfig |   4 +
> >   drivers/xen/Makefile|   1 +
> >   drivers/xen/balloon.c   |   4 +-
> >   drivers/xen/grant-table.c   |   4 +-
> >   drivers/xen/privcmd.c   |   4 +-
> >   drivers/xen/unpopulated-alloc.c | 185 
> >   drivers/xen/xenbus/xenbus_client.c  |   6 +-
> >   drivers/xen/xlate_mmu.c |   4 +-
> >   include/xen/xen.h   |   9 ++
> >   10 files changed, 215 insertions(+), 15 deletions(-)
> >   create mode 100644 drivers/xen/unpopulated-alloc.c
> > 
> > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > index 1d339ef92422..018020b91baa 100644
> > --- a/drivers/xen/Kconfig
> > +++ b/drivers/xen/Kconfig
> > @@ -327,4 +327,8 @@ config XEN_HAVE_VPMU
> >   config XEN_FRONT_PGDIR_SHBUF
> > tristate
> > +config XEN_UNPOPULATED_ALLOC
> > +   bool
> > +   default y if ZONE_DEVICE && !ARM && !ARM64
> 
> There is a current effort to enable Xen on RISC-V. Do we expect this
> option to be usable for this architecture?

It will depend on whether the Xen RISC-V implementation mandates an
IOMMU, for example Arm doesn't and hence uses an identity p2m for
dom0. If RISC-V does the same then I would assume this won't be
suitable as-is (not that it couldn't be made to work).

IMO it wasn't clear from the community call what was the RISC-V port
position regarding this, but IIRC the IOMMU spec for RISC-V was behind
the virtualization one, so it's likely that quite a lot of hardware
will have hardware virtualization support but no IOMMU, in which case
I think it's likely the RISC-V port will follow Arm and implement an
identity p2m.

> If yes, I'm fine with the
> exclusion of Arm, otherwise I'd opt for defaulting to yes only for
> X86.
> 
> Either way you can have my:
> 
> Reviewed-by: Juergen Gross 

Thanks. Feel free to change the 'ZONE_DEVICE && !ARM && !ARM64' to
'ZONE_DEVICE && X86' if you think it's safer.

Roger.


Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Roger Pau Monné
On Tue, Jul 28, 2020 at 06:12:46PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 28/07/2020 17:59, Roger Pau Monné wrote:
> > On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
> > > Hi,
> > > 
> > > On 27/07/2020 10:13, Roger Pau Monne wrote:
> > > > To be used in order to create foreign mappings. This is based on the
> > > > ZONE_DEVICE facility which is used by persistent memory devices in
> > > > order to create struct pages and kernel virtual mappings for the IOMEM
> > > > areas of such devices. Note that on kernels without support for
> > > > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > > > create foreign mappings.
> > > > 
> > > > The newly added helpers use the same parameters as the existing
> > > > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > > > replacement of the callers. Once a memory region has been added to be
> > > > used as scratch mapping space it will no longer be released, and pages
> > > > returned are kept in a linked list. This allows to have a buffer of
> > > > pages and prevents resorting to frequent additions and removals of
> > > > regions.
> > > > 
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > I think this is going to break Dom0 on Arm if the kernel has been built 
> > > with
> > > hotplug. This is because you may end up to re-use region that will be used
> > > for the 1:1 mapping of a foreign map.
> > > 
> > > Note that I don't know whether hotplug has been tested on Xen on Arm yet. 
> > > So
> > > it might be possible to be already broken.
> > > 
> > > Meanwhile, my suggestion would be to make the use of hotplug in the 
> > > balloon
> > > code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?
> > 
> > Right, this feature (allocation of unpopulated memory separated from
> > the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
> > think could be used on Arm.
> > 
> > IMO the right solution seems to be to subtract the physical memory
> > regions that can be used for the identity mappings of foreign pages
> > (all RAM on the system AFAICT) from iomem_resource, as that would make
> > this and the memory hotplug done in the balloon driver safe?
> 
> Dom0 doesn't know the regions used for the identity mappings as this is only
> managed by Xen. So there is nothing you can really do here.

OK, I will add the guards to prevent this being built on Arm.

> But don't you have the same issue on x86 with "magic pages"?

Those are marked as reserved on the memory map, and hence I would
expect them to never end up in iomem_resource.

Thanks, Roger.


Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Roger Pau Monné
On Tue, Jul 28, 2020 at 06:06:25PM +0100, Andrew Cooper wrote:
> On 28/07/2020 17:59, Roger Pau Monné wrote:
> > On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
> >> Hi,
> >>
> >> On 27/07/2020 10:13, Roger Pau Monne wrote:
> >>> To be used in order to create foreign mappings. This is based on the
> >>> ZONE_DEVICE facility which is used by persistent memory devices in
> >>> order to create struct pages and kernel virtual mappings for the IOMEM
> >>> areas of such devices. Note that on kernels without support for
> >>> ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> >>> create foreign mappings.
> >>>
> >>> The newly added helpers use the same parameters as the existing
> >>> {alloc/free}_xenballooned_pages functions, which allows for in-place
> >>> replacement of the callers. Once a memory region has been added to be
> >>> used as scratch mapping space it will no longer be released, and pages
> >>> returned are kept in a linked list. This allows to have a buffer of
> >>> pages and prevents resorting to frequent additions and removals of
> >>> regions.
> >>>
> >>> If enabled (because ZONE_DEVICE is supported) the usage of the new
> >>> functionality untangles Xen balloon and RAM hotplug from the usage of
> >>> unpopulated physical memory ranges to map foreign pages, which is the
> >>> correct thing to do in order to avoid mappings of foreign pages depend
> >>> on memory hotplug.
> >> I think this is going to break Dom0 on Arm if the kernel has been built 
> >> with
> >> hotplug. This is because you may end up to re-use region that will be used
> >> for the 1:1 mapping of a foreign map.
> >>
> >> Note that I don't know whether hotplug has been tested on Xen on Arm yet. 
> >> So
> >> it might be possible to be already broken.
> >>
> >> Meanwhile, my suggestion would be to make the use of hotplug in the balloon
> >> code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?
> > Right, this feature (allocation of unpopulated memory separated from
> > the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
> > think could be used on Arm.
> >
> > IMO the right solution seems to be to subtract the physical memory
> > regions that can be used for the identity mappings of foreign pages
> > (all RAM on the system AFAICT) from iomem_resource, as that would make
> > this and the memory hotplug done in the balloon driver safe?
> 
> The right solution is a mechanism for translated guests to query Xen to
> find regions of guest physical address space which are unused, and can
> be safely be used for foreign/grant/other  mappings.
> 
> Please don't waste any more time applying more duct tape to a broken
> system, and instead spend the time organising some proper foundations.

The piece added here (using ZONE_DEVICE) will be relevant when Xen can
provide the space to map foreign pages, it's just that right now it
relies on iomem_resource instead of a Xen specific resource map that
should be provided by the hypervisor. It should indeed be fixed, but
right now this patch should allow a PVH dom0 to work slightly better.
When Xen provides such areas Linux just needs to populate a custom Xen
resource with them and use it instead of iomem_resurce.

The Arm stuff I'm certainly not familiar with, and can't provide much
insight on that. If it's best to just disable it and continue to rely
on ballooned out pages that's fine.

Roger.


Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Roger Pau Monné
On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
> Hi,
> 
> On 27/07/2020 10:13, Roger Pau Monne wrote:
> > To be used in order to create foreign mappings. This is based on the
> > ZONE_DEVICE facility which is used by persistent memory devices in
> > order to create struct pages and kernel virtual mappings for the IOMEM
> > areas of such devices. Note that on kernels without support for
> > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > create foreign mappings.
> > 
> > The newly added helpers use the same parameters as the existing
> > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > replacement of the callers. Once a memory region has been added to be
> > used as scratch mapping space it will no longer be released, and pages
> > returned are kept in a linked list. This allows to have a buffer of
> > pages and prevents resorting to frequent additions and removals of
> > regions.
> > 
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> I think this is going to break Dom0 on Arm if the kernel has been built with
> hotplug. This is because you may end up to re-use region that will be used
> for the 1:1 mapping of a foreign map.
> 
> Note that I don't know whether hotplug has been tested on Xen on Arm yet. So
> it might be possible to be already broken.
> 
> Meanwhile, my suggestion would be to make the use of hotplug in the balloon
> code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?

Right, this feature (allocation of unpopulated memory separated from
the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
think could be used on Arm.

IMO the right solution seems to be to subtract the physical memory
regions that can be used for the identity mappings of foreign pages
(all RAM on the system AFAICT) from iomem_resource, as that would make
this and the memory hotplug done in the balloon driver safe?

Thanks, Roger.


Re: [PATCH v2 4/4] xen: add helpers to allocate unpopulated memory

2020-07-27 Thread Roger Pau Monné
On Fri, Jul 24, 2020 at 12:36:33PM -0400, Boris Ostrovsky wrote:
> On 7/24/20 10:34 AM, David Hildenbrand wrote:
> > CCing Dan
> >
> > On 24.07.20 14:42, Roger Pau Monne wrote:
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#include 
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +static DEFINE_MUTEX(lock);
> >> +static LIST_HEAD(list);
> >> +static unsigned int count;
> >> +
> >> +static int fill(unsigned int nr_pages)
> 
> 
> Less generic names? How about  list_lock, pg_list, pg_count,
> fill_pglist()? (But these are bad too, so maybe you can come up with
> something better)

OK, I have to admit I like using such short names when the code allows
to, for example this code is so simple that it didn't seem to warrant
using longer names. Will rename on next version.

> >> +{
> >> +  struct dev_pagemap *pgmap;
> >> +  void *vaddr;
> >> +  unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
> >> +  int nid, ret;
> >> +
> >> +  pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
> >> +  if (!pgmap)
> >> +  return -ENOMEM;
> >> +
> >> +  pgmap->type = MEMORY_DEVICE_DEVDAX;
> >> +  pgmap->res.name = "XEN SCRATCH";
> 
> 
> Typically iomem resources only capitalize first letters.
> 
> 
> >> +  pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> >> +
> >> +  ret = allocate_resource(_resource, >res,
> >> +  alloc_pages * PAGE_SIZE, 0, -1,
> >> +  PAGES_PER_SECTION * PAGE_SIZE, NULL, NULL);
> 
> 
> Are we not going to end up with a whole bunch of "Xen scratch" resource
> ranges for each miss in the page list? Or do we expect them to get merged?

PAGES_PER_SECTION is IMO big enough to prevent ending up with a lot of
separated ranges. I think the value is 32 or 64MiB on x86, so while we
are likely to end up with more than one resource added, I don't think
it's going to be massive.

> 
> >> +  if (ret < 0) {
> >> +  pr_err("Cannot allocate new IOMEM resource\n");
> >> +  kfree(pgmap);
> >> +  return ret;
> >> +  }
> >> +
> >> +  nid = memory_add_physaddr_to_nid(pgmap->res.start);
> 
> 
> Should we consider page range crossing node boundaries?

I'm not sure whether this is possible (I would think allocate_resource
should return a range from a single node), but then it would greatly
complicate the code to perform the memremap_pages, as we would have to
split the region into multiple dev_pagemap structs.

FWIW the current code in the balloon driver does exactly the same
(which doesn't mean it's correct, but that's where I got the logic
from).

> >> +
> >> +#ifdef CONFIG_XEN_HAVE_PVMMU
> >> +  /*
> >> +   * We don't support PV MMU when Linux and Xen is using
> >> +   * different page granularity.
> >> +   */
> >> +  BUILD_BUG_ON(XEN_PAGE_SIZE != PAGE_SIZE);
> >> +
> >> +/*
> >> + * memremap will build page tables for the new memory so
> >> + * the p2m must contain invalid entries so the correct
> >> + * non-present PTEs will be written.
> >> + *
> >> + * If a failure occurs, the original (identity) p2m entries
> >> + * are not restored since this region is now known not to
> >> + * conflict with any devices.
> >> + */
> >> +  if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> >> +  xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
> >> +
> >> +  for (i = 0; i < alloc_pages; i++) {
> >> +  if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
> >> +  pr_warn("set_phys_to_machine() failed, no 
> >> memory added\n");
> >> +  release_resource(>res);
> >> +  kfree(pgmap);
> >> +  return -ENOMEM;
> >> +  }
> >> +}
> >> +  }
> >> +#endif
> >> +
> >> +  vaddr = memremap_pages(pgmap, nid);
> >> +  if (IS_ERR(vaddr)) {
> >> +  pr_err("Cannot remap memory range\n");
> >> +  release_resource(>res);
> >> +  kfree(pgmap);
> >> +  return PTR_ERR(vaddr);
> >> +  }
> >> +
> >> +  for (i = 0; i < alloc_pages; i++) {
> >> +  struct page *pg = virt_to_page(vaddr + PAGE_SIZE * i);
> >> +
> >> +  BUG_ON(!virt_addr_valid(vaddr + PAGE_SIZE * i));
> >> +  list_add(>lru, );
> >> +  count++;
> >> +  }
> >> +
> >> +  return 0;
> >> +}
> >> +
> >> +/**
> >> + * xen_alloc_unpopulated_pages - alloc unpopulated pages
> >> + * @nr_pages: Number of pages
> >> + * @pages: pages returned
> >> + * @return 0 on success, error otherwise
> >> + */
> >> +int xen_alloc_unpopulated_pages(unsigned int nr_pages, struct page 
> >> **pages)
> >> +{
> >> +  unsigned int i;
> >> +  int ret = 0;
> >> +
> >> +  mutex_lock();
> >> +  if (count < nr_pages) {
> >> +  ret = fill(nr_pages);
> 
> 
> (nr_pages - count) ?

Yup, already fixed as Juergen also pointed it out.

> >> +
> >> +#ifdef CONFIG_XEN_PV
> >> +static 

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 05:10:03PM +0200, Jürgen Groß wrote:
> On 23.07.20 15:59, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 03:22:49PM +0200, David Hildenbrand wrote:
> > > On 23.07.20 14:23, Roger Pau Monné wrote:
> > > > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> > > > > On 23.07.20 10:45, Roger Pau Monne wrote:
> > > > > > Add an extra option to add_memory_resource that overrides the memory
> > > > > > hotplug online behavior in order to force onlining of memory from
> > > > > > add_memory_resource unconditionally.
> > > > > > 
> > > > > > This is required for the Xen balloon driver, that must run the
> > > > > > online page callback in order to correctly process the newly added
> > > > > > memory region, note this is an unpopulated region that is used by 
> > > > > > Linux
> > > > > > to either hotplug RAM or to map foreign pages from other domains, 
> > > > > > and
> > > > > > hence memory hotplug when running on Xen can be used even without 
> > > > > > the
> > > > > > user explicitly requesting it, as part of the normal operations of 
> > > > > > the
> > > > > > OS when attempting to map memory from a different domain.
> > > > > > 
> > > > > > Setting a different default value of memhp_default_online_type when
> > > > > > attaching the balloon driver is not a robust solution, as the user 
> > > > > > (or
> > > > > > distro init scripts) could still change it and thus break the Xen
> > > > > > balloon driver.
> > > > > 
> > > > > I think we discussed this a couple of times before (even triggered by 
> > > > > my
> > > > > request), and this is responsibility of user space to configure. 
> > > > > Usually
> > > > > distros have udev rules to online memory automatically. Especially, 
> > > > > user
> > > > > space should eb able to configure *how* to online memory.
> > > > 
> > > > Note (as per the commit message) that in the specific case I'm
> > > > referring to the memory hotplugged by the Xen balloon driver will be
> > > > an unpopulated range to be used internally by certain Xen subsystems,
> > > > like the xen-blkback or the privcmd drivers. The addition of such
> > > > blocks of (unpopulated) memory can happen without the user explicitly
> > > > requesting it, and hence not even aware such hotplug process is taking
> > > > place. To be clear: no actual RAM will be added to the system.
> > > 
> > > Okay, but there is also the case where XEN will actually hotplug memory
> > > using this same handler IIRC (at least I've read papers about it). Both
> > > are using the same handler, correct?
> > 
> > Yes, it's used by this dual purpose, which I have to admit I don't
> > like that much either.
> > 
> > One set of pages should be clearly used for RAM memory hotplug, and
> > the other to map foreign pages that are not related to memory hotplug,
> > it's just that we happen to need a physical region with backing struct
> > pages.
> > 
> > > > 
> > > > > It's the admin/distro responsibility to configure this properly. In 
> > > > > case
> > > > > this doesn't happen (or as you say, users change it), bad luck.
> > > > > 
> > > > > E.g., virtio-mem takes care to not add more memory in case it is not
> > > > > getting onlined. I remember hyper-v has similar code to at least wait 
> > > > > a
> > > > > bit for memory to get onlined.
> > > > 
> > > > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > > > as Xen, as said this is done to add unpopulated memory regions that
> > > > will be used to map foreign memory (from other domains) by Xen drivers
> > > > on the system.
> > > 
> > > Indeed, if the memory is never exposed to the buddy (and all you need is
> > > struct pages +  a kernel virtual mapping), I wonder if
> > > memremap/ZONE_DEVICE is what you want?
> > 
> > I'm certainly not familiar with the Linux memory subsystem, but if
> > that gets us a backing struct page and a kernel mapping then I would
> > say yes.
> > 
> > > Then you won't have user-visible
>

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 03:22:49PM +0200, David Hildenbrand wrote:
> On 23.07.20 14:23, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> >> On 23.07.20 10:45, Roger Pau Monne wrote:
> >>> Add an extra option to add_memory_resource that overrides the memory
> >>> hotplug online behavior in order to force onlining of memory from
> >>> add_memory_resource unconditionally.
> >>>
> >>> This is required for the Xen balloon driver, that must run the
> >>> online page callback in order to correctly process the newly added
> >>> memory region, note this is an unpopulated region that is used by Linux
> >>> to either hotplug RAM or to map foreign pages from other domains, and
> >>> hence memory hotplug when running on Xen can be used even without the
> >>> user explicitly requesting it, as part of the normal operations of the
> >>> OS when attempting to map memory from a different domain.
> >>>
> >>> Setting a different default value of memhp_default_online_type when
> >>> attaching the balloon driver is not a robust solution, as the user (or
> >>> distro init scripts) could still change it and thus break the Xen
> >>> balloon driver.
> >>
> >> I think we discussed this a couple of times before (even triggered by my
> >> request), and this is responsibility of user space to configure. Usually
> >> distros have udev rules to online memory automatically. Especially, user
> >> space should eb able to configure *how* to online memory.
> > 
> > Note (as per the commit message) that in the specific case I'm
> > referring to the memory hotplugged by the Xen balloon driver will be
> > an unpopulated range to be used internally by certain Xen subsystems,
> > like the xen-blkback or the privcmd drivers. The addition of such
> > blocks of (unpopulated) memory can happen without the user explicitly
> > requesting it, and hence not even aware such hotplug process is taking
> > place. To be clear: no actual RAM will be added to the system.
> 
> Okay, but there is also the case where XEN will actually hotplug memory
> using this same handler IIRC (at least I've read papers about it). Both
> are using the same handler, correct?

Yes, it's used by this dual purpose, which I have to admit I don't
like that much either.

One set of pages should be clearly used for RAM memory hotplug, and
the other to map foreign pages that are not related to memory hotplug,
it's just that we happen to need a physical region with backing struct
pages.

> > 
> >> It's the admin/distro responsibility to configure this properly. In case
> >> this doesn't happen (or as you say, users change it), bad luck.
> >>
> >> E.g., virtio-mem takes care to not add more memory in case it is not
> >> getting onlined. I remember hyper-v has similar code to at least wait a
> >> bit for memory to get onlined.
> > 
> > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > as Xen, as said this is done to add unpopulated memory regions that
> > will be used to map foreign memory (from other domains) by Xen drivers
> > on the system.
> 
> Indeed, if the memory is never exposed to the buddy (and all you need is
> struct pages +  a kernel virtual mapping), I wonder if
> memremap/ZONE_DEVICE is what you want?

I'm certainly not familiar with the Linux memory subsystem, but if
that gets us a backing struct page and a kernel mapping then I would
say yes.

> Then you won't have user-visible
> memory blocks created with unclear online semantics, partially involving
> the buddy.

Seems like a fine solution.

Juergen: would you be OK to use a separate page-list for
alloc_xenballooned_pages on HVM/PVH using the logic described by
David?

I guess I would leave PV as-is, since it already has this reserved
region to map foreign pages.

Thanks, Roger.


Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 03:20:55PM +0200, Jürgen Groß wrote:
> On 23.07.20 15:08, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 02:28:13PM +0200, Jürgen Groß wrote:
> > > On 23.07.20 14:23, Roger Pau Monné wrote:
> > > > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> > > > > On 23.07.20 10:45, Roger Pau Monne wrote:
> > > > > > Add an extra option to add_memory_resource that overrides the memory
> > > > > > hotplug online behavior in order to force onlining of memory from
> > > > > > add_memory_resource unconditionally.
> > > > > > 
> > > > > > This is required for the Xen balloon driver, that must run the
> > > > > > online page callback in order to correctly process the newly added
> > > > > > memory region, note this is an unpopulated region that is used by 
> > > > > > Linux
> > > > > > to either hotplug RAM or to map foreign pages from other domains, 
> > > > > > and
> > > > > > hence memory hotplug when running on Xen can be used even without 
> > > > > > the
> > > > > > user explicitly requesting it, as part of the normal operations of 
> > > > > > the
> > > > > > OS when attempting to map memory from a different domain.
> > > > > > 
> > > > > > Setting a different default value of memhp_default_online_type when
> > > > > > attaching the balloon driver is not a robust solution, as the user 
> > > > > > (or
> > > > > > distro init scripts) could still change it and thus break the Xen
> > > > > > balloon driver.
> > > > > 
> > > > > I think we discussed this a couple of times before (even triggered by 
> > > > > my
> > > > > request), and this is responsibility of user space to configure. 
> > > > > Usually
> > > > > distros have udev rules to online memory automatically. Especially, 
> > > > > user
> > > > > space should eb able to configure *how* to online memory.
> > > > 
> > > > Note (as per the commit message) that in the specific case I'm
> > > > referring to the memory hotplugged by the Xen balloon driver will be
> > > > an unpopulated range to be used internally by certain Xen subsystems,
> > > > like the xen-blkback or the privcmd drivers. The addition of such
> > > > blocks of (unpopulated) memory can happen without the user explicitly
> > > > requesting it, and hence not even aware such hotplug process is taking
> > > > place. To be clear: no actual RAM will be added to the system.
> > > > 
> > > > Failure to online such blocks using the Xen specific online handler
> > > > (which does not handle back the memory to the allocator in any way)
> > > > will result in the system getting stuck and malfunctioning.
> > > > 
> > > > > It's the admin/distro responsibility to configure this properly. In 
> > > > > case
> > > > > this doesn't happen (or as you say, users change it), bad luck.
> > > > > 
> > > > > E.g., virtio-mem takes care to not add more memory in case it is not
> > > > > getting onlined. I remember hyper-v has similar code to at least wait 
> > > > > a
> > > > > bit for memory to get onlined.
> > > > 
> > > > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > > > as Xen, as said this is done to add unpopulated memory regions that
> > > > will be used to map foreign memory (from other domains) by Xen drivers
> > > > on the system.
> > > > 
> > > > Maybe this should somehow use a different mechanism to hotplug such
> > > > empty memory blocks? I don't mind doing this differently, but I would
> > > > need some pointers. Allowing user-space to change a (seemingly
> > > > unrelated) parameter and as a result produce failures on Xen drivers
> > > > is not an acceptable solution IMO.
> > > 
> > > Maybe we can use the same approach as Xen PV-domains: pre-allocate a
> > > region in the memory map to be used for mapping foreign pages. For the
> > > kernel it will look like pre-ballooned memory, so it will create struct
> > > page for the region (which is what we are after), but it won't give the
> > > memory to the allocator.
> > 
> > IMO u

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 03:14:31PM +0200, David Hildenbrand wrote:
> On 23.07.20 15:08, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 02:28:13PM +0200, Jürgen Groß wrote:
> >> On 23.07.20 14:23, Roger Pau Monné wrote:
> >>> On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> >>>> On 23.07.20 10:45, Roger Pau Monne wrote:
> >>>>> Add an extra option to add_memory_resource that overrides the memory
> >>>>> hotplug online behavior in order to force onlining of memory from
> >>>>> add_memory_resource unconditionally.
> >>>>>
> >>>>> This is required for the Xen balloon driver, that must run the
> >>>>> online page callback in order to correctly process the newly added
> >>>>> memory region, note this is an unpopulated region that is used by Linux
> >>>>> to either hotplug RAM or to map foreign pages from other domains, and
> >>>>> hence memory hotplug when running on Xen can be used even without the
> >>>>> user explicitly requesting it, as part of the normal operations of the
> >>>>> OS when attempting to map memory from a different domain.
> >>>>>
> >>>>> Setting a different default value of memhp_default_online_type when
> >>>>> attaching the balloon driver is not a robust solution, as the user (or
> >>>>> distro init scripts) could still change it and thus break the Xen
> >>>>> balloon driver.
> >>>>
> >>>> I think we discussed this a couple of times before (even triggered by my
> >>>> request), and this is responsibility of user space to configure. Usually
> >>>> distros have udev rules to online memory automatically. Especially, user
> >>>> space should eb able to configure *how* to online memory.
> >>>
> >>> Note (as per the commit message) that in the specific case I'm
> >>> referring to the memory hotplugged by the Xen balloon driver will be
> >>> an unpopulated range to be used internally by certain Xen subsystems,
> >>> like the xen-blkback or the privcmd drivers. The addition of such
> >>> blocks of (unpopulated) memory can happen without the user explicitly
> >>> requesting it, and hence not even aware such hotplug process is taking
> >>> place. To be clear: no actual RAM will be added to the system.
> >>>
> >>> Failure to online such blocks using the Xen specific online handler
> >>> (which does not handle back the memory to the allocator in any way)
> >>> will result in the system getting stuck and malfunctioning.
> >>>
> >>>> It's the admin/distro responsibility to configure this properly. In case
> >>>> this doesn't happen (or as you say, users change it), bad luck.
> >>>>
> >>>> E.g., virtio-mem takes care to not add more memory in case it is not
> >>>> getting onlined. I remember hyper-v has similar code to at least wait a
> >>>> bit for memory to get onlined.
> >>>
> >>> I don't think VirtIO or Hyper-V use the hotplug system in the same way
> >>> as Xen, as said this is done to add unpopulated memory regions that
> >>> will be used to map foreign memory (from other domains) by Xen drivers
> >>> on the system.
> >>>
> >>> Maybe this should somehow use a different mechanism to hotplug such
> >>> empty memory blocks? I don't mind doing this differently, but I would
> >>> need some pointers. Allowing user-space to change a (seemingly
> >>> unrelated) parameter and as a result produce failures on Xen drivers
> >>> is not an acceptable solution IMO.
> >>
> >> Maybe we can use the same approach as Xen PV-domains: pre-allocate a
> >> region in the memory map to be used for mapping foreign pages. For the
> >> kernel it will look like pre-ballooned memory, so it will create struct
> >> page for the region (which is what we are after), but it won't give the
> >> memory to the allocator.
> > 
> > IMO using something similar to memory hotplug would give us more
> > flexibility, and TBH the logic is already there in the balloon driver.
> > It seems quite wasteful to allocate such region(s) beforehand for all
> > domains, even when most of them won't end up using foreign mappings at
> > all.
> 
> I do wonder why these issues you describe start to pop up now, literally
> years after this stuff has been imp

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 02:28:13PM +0200, Jürgen Groß wrote:
> On 23.07.20 14:23, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> > > On 23.07.20 10:45, Roger Pau Monne wrote:
> > > > Add an extra option to add_memory_resource that overrides the memory
> > > > hotplug online behavior in order to force onlining of memory from
> > > > add_memory_resource unconditionally.
> > > > 
> > > > This is required for the Xen balloon driver, that must run the
> > > > online page callback in order to correctly process the newly added
> > > > memory region, note this is an unpopulated region that is used by Linux
> > > > to either hotplug RAM or to map foreign pages from other domains, and
> > > > hence memory hotplug when running on Xen can be used even without the
> > > > user explicitly requesting it, as part of the normal operations of the
> > > > OS when attempting to map memory from a different domain.
> > > > 
> > > > Setting a different default value of memhp_default_online_type when
> > > > attaching the balloon driver is not a robust solution, as the user (or
> > > > distro init scripts) could still change it and thus break the Xen
> > > > balloon driver.
> > > 
> > > I think we discussed this a couple of times before (even triggered by my
> > > request), and this is responsibility of user space to configure. Usually
> > > distros have udev rules to online memory automatically. Especially, user
> > > space should eb able to configure *how* to online memory.
> > 
> > Note (as per the commit message) that in the specific case I'm
> > referring to the memory hotplugged by the Xen balloon driver will be
> > an unpopulated range to be used internally by certain Xen subsystems,
> > like the xen-blkback or the privcmd drivers. The addition of such
> > blocks of (unpopulated) memory can happen without the user explicitly
> > requesting it, and hence not even aware such hotplug process is taking
> > place. To be clear: no actual RAM will be added to the system.
> > 
> > Failure to online such blocks using the Xen specific online handler
> > (which does not handle back the memory to the allocator in any way)
> > will result in the system getting stuck and malfunctioning.
> > 
> > > It's the admin/distro responsibility to configure this properly. In case
> > > this doesn't happen (or as you say, users change it), bad luck.
> > > 
> > > E.g., virtio-mem takes care to not add more memory in case it is not
> > > getting onlined. I remember hyper-v has similar code to at least wait a
> > > bit for memory to get onlined.
> > 
> > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > as Xen, as said this is done to add unpopulated memory regions that
> > will be used to map foreign memory (from other domains) by Xen drivers
> > on the system.
> > 
> > Maybe this should somehow use a different mechanism to hotplug such
> > empty memory blocks? I don't mind doing this differently, but I would
> > need some pointers. Allowing user-space to change a (seemingly
> > unrelated) parameter and as a result produce failures on Xen drivers
> > is not an acceptable solution IMO.
> 
> Maybe we can use the same approach as Xen PV-domains: pre-allocate a
> region in the memory map to be used for mapping foreign pages. For the
> kernel it will look like pre-ballooned memory, so it will create struct
> page for the region (which is what we are after), but it won't give the
> memory to the allocator.

IMO using something similar to memory hotplug would give us more
flexibility, and TBH the logic is already there in the balloon driver.
It seems quite wasteful to allocate such region(s) beforehand for all
domains, even when most of them won't end up using foreign mappings at
all.

Anyway, I'm going to take a look at how to do that, I guess it's going
to involve playing with the memory map and reserving some space.

I suggest we should remove the Xen balloon hotplug logic, as it's not
working properly and we don't have a plan to fix it.

Thanks, Roger.


Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> On 23.07.20 10:45, Roger Pau Monne wrote:
> > Add an extra option to add_memory_resource that overrides the memory
> > hotplug online behavior in order to force onlining of memory from
> > add_memory_resource unconditionally.
> > 
> > This is required for the Xen balloon driver, that must run the
> > online page callback in order to correctly process the newly added
> > memory region, note this is an unpopulated region that is used by Linux
> > to either hotplug RAM or to map foreign pages from other domains, and
> > hence memory hotplug when running on Xen can be used even without the
> > user explicitly requesting it, as part of the normal operations of the
> > OS when attempting to map memory from a different domain.
> > 
> > Setting a different default value of memhp_default_online_type when
> > attaching the balloon driver is not a robust solution, as the user (or
> > distro init scripts) could still change it and thus break the Xen
> > balloon driver.
> 
> I think we discussed this a couple of times before (even triggered by my
> request), and this is responsibility of user space to configure. Usually
> distros have udev rules to online memory automatically. Especially, user
> space should eb able to configure *how* to online memory.

Note (as per the commit message) that in the specific case I'm
referring to the memory hotplugged by the Xen balloon driver will be
an unpopulated range to be used internally by certain Xen subsystems,
like the xen-blkback or the privcmd drivers. The addition of such
blocks of (unpopulated) memory can happen without the user explicitly
requesting it, and hence not even aware such hotplug process is taking
place. To be clear: no actual RAM will be added to the system.

Failure to online such blocks using the Xen specific online handler
(which does not handle back the memory to the allocator in any way)
will result in the system getting stuck and malfunctioning.

> It's the admin/distro responsibility to configure this properly. In case
> this doesn't happen (or as you say, users change it), bad luck.
> 
> E.g., virtio-mem takes care to not add more memory in case it is not
> getting onlined. I remember hyper-v has similar code to at least wait a
> bit for memory to get onlined.

I don't think VirtIO or Hyper-V use the hotplug system in the same way
as Xen, as said this is done to add unpopulated memory regions that
will be used to map foreign memory (from other domains) by Xen drivers
on the system.

Maybe this should somehow use a different mechanism to hotplug such
empty memory blocks? I don't mind doing this differently, but I would
need some pointers. Allowing user-space to change a (seemingly
unrelated) parameter and as a result produce failures on Xen drivers
is not an acceptable solution IMO.

Thanks, Roger.


Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-22 Thread Roger Pau Monné
On Tue, Jul 21, 2020 at 07:55:09PM +, Anchal Agarwal wrote:
> On Tue, Jul 21, 2020 at 10:30:18AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > Marek: I'm adding you in case you could be able to give this a try and
> > make sure it doesn't break suspend for dom0.
> > 
> > On Tue, Jul 21, 2020 at 12:17:36AM +, Anchal Agarwal wrote:
> > > On Mon, Jul 20, 2020 at 11:37:05AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> > > > > (Roger, question for you at the very end)
> > > > >
> > > > > On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > > > > > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> > > > > >> CAUTION: This email originated from outside of the organization. 
> > > > > >> Do not click links or open attachments unless you can confirm the 
> > > > > >> sender and know the content is safe.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> > > > > >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
> > > > > >>>> CAUTION: This email originated from outside of the organization. 
> > > > > >>>> Do not click links or open attachments unless you can confirm 
> > > > > >>>> the sender and know the content is safe.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 7/2/20 2:21 PM, Anchal Agarwal wrote:
> > > > > >>>> And PVH dom0.
> > > > > >>> That's another good use case to make it work with however, I still
> > > > > >>> think that should be tested/worked upon separately as the feature 
> > > > > >>> itself
> > > > > >>> (PVH Dom0) is very new.
> > > > > >>
> > > > > >> Same question here --- will this break PVH dom0?
> > > > > >>
> > > > > > I haven't tested it as a part of this series. Is that a blocker 
> > > > > > here?
> > > > >
> > > > >
> > > > > I suspect dom0 will not do well now as far as hibernation goes, in 
> > > > > which
> > > > > case you are not breaking anything.
> > > > >
> > > > >
> > > > > Roger?
> > > >
> > > > I sadly don't have any box ATM that supports hibernation where I
> > > > could test it. We have hibernation support for PV dom0, so while I
> > > > haven't done anything specific to support or test hibernation on PVH
> > > > dom0 I would at least aim to not make this any worse, and hence the
> > > > check should at least also fail for a PVH dom0?
> > > >
> > > > if (!xen_hvm_domain() || xen_initial_domain())
> > > > return -ENODEV;
> > > >
> > > > Ie: none of this should be applied to a PVH dom0, as it doesn't have
> > > > PV devices and hence should follow the bare metal device suspend.
> > > >
> > > So from what I understand you meant for any guest running on pvh dom0 
> > > should not
> > > hibernate if hibernation is triggered from within the guest or should 
> > > they?
> > 
> > Er no to both I think. What I meant is that a PVH dom0 should be able
> > to properly suspend, and we should make sure this work doesn't make
> > this any harder (or breaks it if it's currently working).
> > 
> > Or at least that's how I understood the question raised by Boris.
> > 
> > You are adding code to the generic suspend path that's also used by dom0
> > in order to perform bare metal suspension. This is fine now for a PV
> > dom0 because the code is gated on xen_hvm_domain, but you should also
> > take into account that a PVH dom0 is considered a HVM domain, and
> > hence will get the notifier registered.
> >
> Ok that makes sense now. This is good to be safe, but my patch series is only 
> to
> support domU hibernation, so I am not sure if this will affect pvh dom0.
> However, since I do not have a good way of testing sure I will add the check.
> 
> Moreover, in Patch-0004, I do register suspend/resume syscore_ops 
> specifically for domU
> hibernation only if its xen_hvm_domain.

So if the hooks are only registered for domU, do you still need this
xen_hvm_domain check here?

I have to admit I'm not familiar with Linux PM suspend.

> I don't see any reason that should not
> be registered for domU's running on pvh dom0.

To be clear: it should be registered for all HVM domUs, regardless of
whether they are running on a PV or a PVH dom0. My intention was never
to suggest otherwise. It should be enabled for all HVM domUs, but
shouldn't be enabled for HVM dom0.

> Those suspend/resume callbacks will
> only be invoked in case hibernation and will be skipped if generic suspend 
> path
> is in progress. Do you see any issue with that?

No, I think it's fine.

Roger.


Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-21 Thread Roger Pau Monné
Marek: I'm adding you in case you could be able to give this a try and
make sure it doesn't break suspend for dom0.

On Tue, Jul 21, 2020 at 12:17:36AM +, Anchal Agarwal wrote:
> On Mon, Jul 20, 2020 at 11:37:05AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> > > (Roger, question for you at the very end)
> > >
> > > On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > > > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> > > >> CAUTION: This email originated from outside of the organization. Do 
> > > >> not click links or open attachments unless you can confirm the sender 
> > > >> and know the content is safe.
> > > >>
> > > >>
> > > >>
> > > >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> > > >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
> > > >>>> CAUTION: This email originated from outside of the organization. Do 
> > > >>>> not click links or open attachments unless you can confirm the 
> > > >>>> sender and know the content is safe.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 7/2/20 2:21 PM, Anchal Agarwal wrote:
> > > >>>> And PVH dom0.
> > > >>> That's another good use case to make it work with however, I still
> > > >>> think that should be tested/worked upon separately as the feature 
> > > >>> itself
> > > >>> (PVH Dom0) is very new.
> > > >>
> > > >> Same question here --- will this break PVH dom0?
> > > >>
> > > > I haven't tested it as a part of this series. Is that a blocker here?
> > >
> > >
> > > I suspect dom0 will not do well now as far as hibernation goes, in which
> > > case you are not breaking anything.
> > >
> > >
> > > Roger?
> > 
> > I sadly don't have any box ATM that supports hibernation where I
> > could test it. We have hibernation support for PV dom0, so while I
> > haven't done anything specific to support or test hibernation on PVH
> > dom0 I would at least aim to not make this any worse, and hence the
> > check should at least also fail for a PVH dom0?
> > 
> > if (!xen_hvm_domain() || xen_initial_domain())
> > return -ENODEV;
> > 
> > Ie: none of this should be applied to a PVH dom0, as it doesn't have
> > PV devices and hence should follow the bare metal device suspend.
> >
> So from what I understand you meant for any guest running on pvh dom0 should 
> not 
> hibernate if hibernation is triggered from within the guest or should they?

Er no to both I think. What I meant is that a PVH dom0 should be able
to properly suspend, and we should make sure this work doesn't make
this any harder (or breaks it if it's currently working).

Or at least that's how I understood the question raised by Boris.

You are adding code to the generic suspend path that's also used by dom0
in order to perform bare metal suspension. This is fine now for a PV
dom0 because the code is gated on xen_hvm_domain, but you should also
take into account that a PVH dom0 is considered a HVM domain, and
hence will get the notifier registered.

> > Also I would contact the QubesOS guys, they rely heavily on the
> > suspend feature for dom0, and that's something not currently tested by
> > osstest so any breakages there go unnoticed.
> > 
> Was this for me or Boris? If its the former then I have no idea how to?

I've now added Marek.

Roger.


Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-20 Thread Roger Pau Monné
On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> (Roger, question for you at the very end)
> 
> On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> >> CAUTION: This email originated from outside of the organization. Do not 
> >> click links or open attachments unless you can confirm the sender and know 
> >> the content is safe.
> >>
> >>
> >>
> >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
>  CAUTION: This email originated from outside of the organization. Do not 
>  click links or open attachments unless you can confirm the sender and 
>  know the content is safe.
> 
> 
> 
>  On 7/2/20 2:21 PM, Anchal Agarwal wrote:
>  And PVH dom0.
> >>> That's another good use case to make it work with however, I still
> >>> think that should be tested/worked upon separately as the feature itself
> >>> (PVH Dom0) is very new.
> >>
> >> Same question here --- will this break PVH dom0?
> >>
> > I haven't tested it as a part of this series. Is that a blocker here?
> 
> 
> I suspect dom0 will not do well now as far as hibernation goes, in which
> case you are not breaking anything.
> 
> 
> Roger?

I sadly don't have any box ATM that supports hibernation where I
could test it. We have hibernation support for PV dom0, so while I
haven't done anything specific to support or test hibernation on PVH
dom0 I would at least aim to not make this any worse, and hence the
check should at least also fail for a PVH dom0?

if (!xen_hvm_domain() || xen_initial_domain())
return -ENODEV;

Ie: none of this should be applied to a PVH dom0, as it doesn't have
PV devices and hence should follow the bare metal device suspend.

Also I would contact the QubesOS guys, they rely heavily on the
suspend feature for dom0, and that's something not currently tested by
osstest so any breakages there go unnoticed.

Thanks, Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-30 Thread Roger Pau Monné
On Mon, Jun 29, 2020 at 07:20:35PM +, Anchal Agarwal wrote:
> On Fri, Jun 26, 2020 at 11:12:39AM +0200, Roger Pau Monné wrote:
> > So the frontend should do:
> > 
> > - Switch to Closed state (and cleanup everything required).
> > - Wait for backend to switch to Closed state (must be done
> >   asynchronously, handled in blkback_changed).
> > - Switch frontend to XenbusStateInitialising, that will in turn force
> >   the backend to switch to XenbusStateInitWait.
> > - After that it should just follow the normal connection procedure.
> > 
> > I think the part that's missing is the frontend doing the state change
> > to XenbusStateInitialising when the backend switches to the Closed
> > state.
> > 
> > > I was of the view we may just want to mark frontend closed which should do
> > > the job of freeing resources and then following the same flow as
> > > blkfront_restore. That does not seems to work correctly 100% of the time.
> > 
> > I think the missing part is that you must wait for the backend to
> > switch to the Closed state, or else the switch to
> > XenbusStateInitialising won't be picked up correctly by the backend
> > (because it's still doing it's cleanup).
> > 
> > Using blkfront_restore might be an option, but you need to assert the
> > backend is in the initial state before using that path.
> >
> Yes, I agree and I make sure that XenbusStateInitialising only triggers
> on frontend once backend is disconnected. msleep in a loop not that graceful 
> but
> works.
> Frontend only switches to XenbusStateInitialising once it sees backend
> as Closed. The issue here is and may require more debugging is:
> 1. Hibernate instance->Closing failed, artificially created situation by not
> marking frontend Closed in the first place during freezing.
> 2. System comes back up fine restored to 'backend connected'.

I'm not sure I'm following what is happening here, what should happen
IMO is that the backend will eventually reach the Closed state? Ie:
the frontend has initiated the disconnection from the backend by
setting the Closing state, and the backend will have to eventually
reach the Closed state.

At that point the frontend can initiate a reconnection by switching to
the Initialising state.

> 3. Re-run (1) again without reboot
> 4. (4) fails to recover basically freezing does not fail at all which is weird
>because it should timeout as it passes through same path. It hits a BUG in
>talk_to_blkback() and instance crashes.

It's hard to tell exactly. I guess you would have to figure what makes
the frontend not get stuck at the same place as the first attempt.

Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-26 Thread Roger Pau Monné
On Thu, Jun 25, 2020 at 06:36:59PM +, Anchal Agarwal wrote:
> On Tue, Jun 23, 2020 at 10:19:03AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Tue, Jun 23, 2020 at 12:43:14AM +, Anchal Agarwal wrote:
> > > On Mon, Jun 22, 2020 at 10:38:46AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > On Fri, Jun 19, 2020 at 11:43:12PM +, Anchal Agarwal wrote:
> > > > > On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> > > > > > CAUTION: This email originated from outside of the organization. Do 
> > > > > > not click links or open attachments unless you can confirm the 
> > > > > > sender and know the content is safe.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > > > > > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > > > > > > CAUTION: This email originated from outside of the 
> > > > > > > > organization. Do not click links or open attachments unless you 
> > > > > > > > can confirm the sender and know the content is safe.
> > > > > > > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > > > > > >  CAUTION: This email originated from outside of the 
> > > > > > > > > organization. Do not click links or open attachments unless 
> > > > > > > > > you can confirm the sender and know the content is safe.
> > > > > > > > > > + xenbus_dev_error(dev, err, "Freezing 
> > > > > > > > > timed out;"
> > > > > > > > > > +  "the device may become 
> > > > > > > > > inconsistent state");
> > > > > > > > >
> > > > > > > > > Leaving the device in this state is quite bad, as it's in 
> > > > > > > > > a closed
> > > > > > > > > state and with the queues frozen. You should make an 
> > > > > > > > > attempt to
> > > > > > > > > restore things to a working state.
> > > > > > > > >
> > > > > > > > > You mean if backend closed after timeout? Is there a way to 
> > > > > > > > > know that? I understand it's not good to
> > > > > > > > > leave it in this state however, I am still trying to find if 
> > > > > > > > > there is a good way to know if backend is still connected 
> > > > > > > > > after timeout.
> > > > > > > > > Hence the message " the device may become inconsistent 
> > > > > > > > > state".  I didn't see a timeout not even once on my end so 
> > > > > > > > > that's why
> > > > > > > > > I may be looking for an alternate perspective here. may be 
> > > > > > > > > need to thaw everything back intentionally is one thing I 
> > > > > > > > > could think of.
> > > > > > > >
> > > > > > > > You can manually force this state, and then check that it will 
> > > > > > > > behave
> > > > > > > > correctly. I would expect that on a failure to disconnect from 
> > > > > > > > the
> > > > > > > > backend you should switch the frontend to the 'Init' state in 
> > > > > > > > order to
> > > > > > > > try to reconnect to the backend when possible.
> > > > > > > >
> > > > > > > From what I understand forcing manually is, failing the freeze 
> > > > > > > without
> > > > > > > disconnect and try to revive the connection by unfreezing the
> > > > > > > queues->r

Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-23 Thread Roger Pau Monné
On Tue, Jun 23, 2020 at 12:43:14AM +, Anchal Agarwal wrote:
> On Mon, Jun 22, 2020 at 10:38:46AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Fri, Jun 19, 2020 at 11:43:12PM +, Anchal Agarwal wrote:
> > > On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > > > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > > > > CAUTION: This email originated from outside of the organization. Do 
> > > > > > not click links or open attachments unless you can confirm the 
> > > > > > sender and know the content is safe.
> > > > > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > > > >  CAUTION: This email originated from outside of the organization. 
> > > > > > > Do not click links or open attachments unless you can confirm the 
> > > > > > > sender and know the content is safe.
> > > > > > > > + xenbus_dev_error(dev, err, "Freezing timed 
> > > > > > > out;"
> > > > > > > > +  "the device may become 
> > > > > > > inconsistent state");
> > > > > > >
> > > > > > > Leaving the device in this state is quite bad, as it's in a 
> > > > > > > closed
> > > > > > > state and with the queues frozen. You should make an attempt 
> > > > > > > to
> > > > > > > restore things to a working state.
> > > > > > >
> > > > > > > You mean if backend closed after timeout? Is there a way to know 
> > > > > > > that? I understand it's not good to
> > > > > > > leave it in this state however, I am still trying to find if 
> > > > > > > there is a good way to know if backend is still connected after 
> > > > > > > timeout.
> > > > > > > Hence the message " the device may become inconsistent state".  I 
> > > > > > > didn't see a timeout not even once on my end so that's why
> > > > > > > I may be looking for an alternate perspective here. may be need 
> > > > > > > to thaw everything back intentionally is one thing I could think 
> > > > > > > of.
> > > > > >
> > > > > > You can manually force this state, and then check that it will 
> > > > > > behave
> > > > > > correctly. I would expect that on a failure to disconnect from the
> > > > > > backend you should switch the frontend to the 'Init' state in order 
> > > > > > to
> > > > > > try to reconnect to the backend when possible.
> > > > > >
> > > > > From what I understand forcing manually is, failing the freeze without
> > > > > disconnect and try to revive the connection by unfreezing the
> > > > > queues->reconnecting to backend [which never got diconnected]. May be 
> > > > > even
> > > > > tearing down things manually because I am not sure what state will 
> > > > > frontend
> > > > > see if backend fails to to disconnect at any point in time. I assumed 
> > > > > connected.
> > > > > Then again if its "CONNECTED" I may not need to tear down everything 
> > > > > and start
> > > > > from Initialising state because that may not work.
> > > > >
> > > > > So I am not so sure about backend's state so much, lets say if  
> > > > > xen_blkif_disconnect fail,
> > > > > I don't see it getting handled in the backend then what will be 
> > > > > backend's state?
> > > > > Will it still switch xenbus state to 'Closed'? If not what will 
> > > > > frontend see,
> > > > > if it tries to read bac

Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-22 Thread Roger Pau Monné
On Fri, Jun 19, 2020 at 11:43:12PM +, Anchal Agarwal wrote:
> On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > >  CAUTION: This email originated from outside of the organization. Do 
> > > > > not click links or open attachments unless you can confirm the sender 
> > > > > and know the content is safe.
> > > > > > + xenbus_dev_error(dev, err, "Freezing timed out;"
> > > > > > +  "the device may become 
> > > > > inconsistent state");
> > > > >
> > > > > Leaving the device in this state is quite bad, as it's in a closed
> > > > > state and with the queues frozen. You should make an attempt to
> > > > > restore things to a working state.
> > > > >
> > > > > You mean if backend closed after timeout? Is there a way to know 
> > > > > that? I understand it's not good to
> > > > > leave it in this state however, I am still trying to find if there is 
> > > > > a good way to know if backend is still connected after timeout.
> > > > > Hence the message " the device may become inconsistent state".  I 
> > > > > didn't see a timeout not even once on my end so that's why
> > > > > I may be looking for an alternate perspective here. may be need to 
> > > > > thaw everything back intentionally is one thing I could think of.
> > > >
> > > > You can manually force this state, and then check that it will behave
> > > > correctly. I would expect that on a failure to disconnect from the
> > > > backend you should switch the frontend to the 'Init' state in order to
> > > > try to reconnect to the backend when possible.
> > > >
> > > From what I understand forcing manually is, failing the freeze without
> > > disconnect and try to revive the connection by unfreezing the
> > > queues->reconnecting to backend [which never got diconnected]. May be even
> > > tearing down things manually because I am not sure what state will 
> > > frontend
> > > see if backend fails to to disconnect at any point in time. I assumed 
> > > connected.
> > > Then again if its "CONNECTED" I may not need to tear down everything and 
> > > start
> > > from Initialising state because that may not work.
> > >
> > > So I am not so sure about backend's state so much, lets say if  
> > > xen_blkif_disconnect fail,
> > > I don't see it getting handled in the backend then what will be backend's 
> > > state?
> > > Will it still switch xenbus state to 'Closed'? If not what will frontend 
> > > see,
> > > if it tries to read backend's state through xenbus_read_driver_state ?
> > >
> > > So the flow be like:
> > > Front end marks XenbusStateClosing
> > > Backend marks its state as XenbusStateClosing
> > > Frontend marks XenbusStateClosed
> > > Backend disconnects calls xen_blkif_disconnect
> > >Backend fails to disconnect, the above function returns EBUSY
> > >What will be state of backend here?
> > 
> > Backend should stay in state 'Closing' then, until it can finish
> > tearing down.
> > 
> It disconnects the ring after switching to connected state too. 
> > >Frontend did not tear down the rings if backend does not switches 
> > > the
> > >state to 'Closed' in case of failure.
> > >
> > > If backend stays in CONNECTED state, then even if we mark it Initialised 
> > > in frontend, backend
> > 
> > Backend will stay in state 'Closing' I think.
> > 
> > > won't be calling connect(). {From reading code in frontend_changed}
> > > IMU, Initialising will fail since backend dev->state != XenbusStateClosed

Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-17 Thread Roger Pau Monné
On Tue, Jun 16, 2020 at 10:30:03PM +, Anchal Agarwal wrote:
> On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > On Tue, May 19, 2020 at 11:27:50PM +, Anchal Agarwal wrote:
> > > > > From: Munehisa Kamata 
> > > > > + xenbus_dev_error(dev, err, "Freezing timed out;"
> > > > > +  "the device may become 
> > > > inconsistent state");
> > > >
> > > > Leaving the device in this state is quite bad, as it's in a closed
> > > > state and with the queues frozen. You should make an attempt to
> > > > restore things to a working state.
> > > >
> > > > You mean if backend closed after timeout? Is there a way to know that? 
> > > > I understand it's not good to
> > > > leave it in this state however, I am still trying to find if there is a 
> > > > good way to know if backend is still connected after timeout.
> > > > Hence the message " the device may become inconsistent state".  I 
> > > > didn't see a timeout not even once on my end so that's why
> > > > I may be looking for an alternate perspective here. may be need to thaw 
> > > > everything back intentionally is one thing I could think of.
> > > 
> > > You can manually force this state, and then check that it will behave
> > > correctly. I would expect that on a failure to disconnect from the
> > > backend you should switch the frontend to the 'Init' state in order to
> > > try to reconnect to the backend when possible.
> > > 
> > From what I understand forcing manually is, failing the freeze without
> > disconnect and try to revive the connection by unfreezing the
> > queues->reconnecting to backend [which never got diconnected]. May be even
> > tearing down things manually because I am not sure what state will frontend
> > see if backend fails to to disconnect at any point in time. I assumed 
> > connected.
> > Then again if its "CONNECTED" I may not need to tear down everything and 
> > start
> > from Initialising state because that may not work.
> > 
> > So I am not so sure about backend's state so much, lets say if  
> > xen_blkif_disconnect fail,
> > I don't see it getting handled in the backend then what will be backend's 
> > state?
> > Will it still switch xenbus state to 'Closed'? If not what will frontend 
> > see, 
> > if it tries to read backend's state through xenbus_read_driver_state ?
> > 
> > So the flow be like:
> > Front end marks XenbusStateClosing
> > Backend marks its state as XenbusStateClosing
> > Frontend marks XenbusStateClosed
> > Backend disconnects calls xen_blkif_disconnect
> >Backend fails to disconnect, the above function returns EBUSY
> >What will be state of backend here? 
> >Frontend did not tear down the rings if backend does not switches the
> >state to 'Closed' in case of failure.
> > 
> > If backend stays in CONNECTED state, then even if we mark it Initialised in 
> > frontend, backend
> > won't be calling connect(). {From reading code in frontend_changed}
> > IMU, Initialising will fail since backend dev->state != XenbusStateClosed 
> > plus
> > we did not tear down anything so calling talk_to_blkback may not be needed
> > 
> > Does that sound correct?
> Send that too quickly, I also meant to add XenBusIntialised state should be ok
> only if we expect backend will stay in "Connected" state. Also, I experimented
> with that notion. I am little worried about the correctness here. 
> Can the backend  come to an Unknown state somehow?

Not really, there's no such thing as an Unknown state.

There are no guarantees about what a backend can do really, so it
could indeed switch to a not recognized state, but that would be a
bug in the backend.

Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-17 Thread Roger Pau Monné
On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > >  CAUTION: This email originated from outside of the organization. Do not 
> > > click links or open attachments unless you can confirm the sender and 
> > > know the content is safe.
> > > > + xenbus_dev_error(dev, err, "Freezing timed out;"
> > > > +  "the device may become inconsistent 
> > > state");
> > >
> > > Leaving the device in this state is quite bad, as it's in a closed
> > > state and with the queues frozen. You should make an attempt to
> > > restore things to a working state.
> > >
> > > You mean if backend closed after timeout? Is there a way to know that? I 
> > > understand it's not good to
> > > leave it in this state however, I am still trying to find if there is a 
> > > good way to know if backend is still connected after timeout.
> > > Hence the message " the device may become inconsistent state".  I didn't 
> > > see a timeout not even once on my end so that's why
> > > I may be looking for an alternate perspective here. may be need to thaw 
> > > everything back intentionally is one thing I could think of.
> > 
> > You can manually force this state, and then check that it will behave
> > correctly. I would expect that on a failure to disconnect from the
> > backend you should switch the frontend to the 'Init' state in order to
> > try to reconnect to the backend when possible.
> > 
> From what I understand forcing manually is, failing the freeze without
> disconnect and try to revive the connection by unfreezing the
> queues->reconnecting to backend [which never got diconnected]. May be even
> tearing down things manually because I am not sure what state will frontend
> see if backend fails to to disconnect at any point in time. I assumed 
> connected.
> Then again if its "CONNECTED" I may not need to tear down everything and start
> from Initialising state because that may not work.
> 
> So I am not so sure about backend's state so much, lets say if  
> xen_blkif_disconnect fail,
> I don't see it getting handled in the backend then what will be backend's 
> state?
> Will it still switch xenbus state to 'Closed'? If not what will frontend see, 
> if it tries to read backend's state through xenbus_read_driver_state ?
> 
> So the flow be like:
> Front end marks XenbusStateClosing
> Backend marks its state as XenbusStateClosing
> Frontend marks XenbusStateClosed
> Backend disconnects calls xen_blkif_disconnect
>Backend fails to disconnect, the above function returns EBUSY
>What will be state of backend here?

Backend should stay in state 'Closing' then, until it can finish
tearing down.

>Frontend did not tear down the rings if backend does not switches the
>state to 'Closed' in case of failure.
> 
> If backend stays in CONNECTED state, then even if we mark it Initialised in 
> frontend, backend

Backend will stay in state 'Closing' I think.

> won't be calling connect(). {From reading code in frontend_changed}
> IMU, Initialising will fail since backend dev->state != XenbusStateClosed plus
> we did not tear down anything so calling talk_to_blkback may not be needed
> 
> Does that sound correct?

I think switching to the initial state in order to try to attempt a
reconnection would be our best bet here.

Thanks, Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-04 Thread Roger Pau Monné
Hello,

On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
>  CAUTION: This email originated from outside of the organization. Do not 
> click links or open attachments unless you can confirm the sender and know 
> the content is safe.
> 
> 
> 
> On Tue, May 19, 2020 at 11:27:50PM +, Anchal Agarwal wrote:
> > From: Munehisa Kamata 
> > 
> > S4 power transition states are much different than xen
> > suspend/resume. Former is visible to the guest and frontend drivers 
> should
> > be aware of the state transitions and should be able to take appropriate
> > actions when needed. In transition to S4 we need to make sure that at 
> least
> > all the in-flight blkif requests get completed, since they probably 
> contain
> > bits of the guest's memory image and that's not going to get saved any
> > other way. Hence, re-issuing of in-flight requests as in case of xen 
> resume
> > will not work here. This is in contrast to xen-suspend where we need to
> > freeze with as little processing as possible to avoid dirtying RAM late 
> in
> > the migration cycle and we know that in-flight data can wait.
> > 
> > Add freeze, thaw and restore callbacks for PM suspend and hibernation
> > support. All frontend drivers that needs to use 
> PM_HIBERNATION/PM_SUSPEND
> > events, need to implement these xenbus_driver callbacks. The freeze 
> handler
> > stops block-layer queue and disconnect the frontend from the backend 
> while
> > freeing ring_info and associated resources. Before disconnecting from 
> the
> > backend, we need to prevent any new IO from being queued and wait for 
> existing
> > IO to complete. Freeze/unfreeze of the queues will guarantee that there 
> are no
> > requests in use on the shared ring. However, for sanity we should check
> > state of the ring before disconnecting to make sure that there are no
> > outstanding requests to be processed on the ring. The restore handler
> > re-allocates ring_info, unquiesces and unfreezes the queue and 
> re-connect to
> > the backend, so that rest of the kernel can continue to use the block 
> device
> > transparently.
> > 
> > Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> > xen/blkback: unmap all persistent grants when frontend gets 
> disconnected,
> > the frontend may see massive amount of grant table warning when freeing
> > resources.
> > [   36.852659] deferring g.e. 0xf9 (pfn 0x)
> > [   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
> > 
> > In this case, persistent grants would need to be disabled.
> > 
> > [Anchal Changelog: Removed timeout/request during blkfront freeze.
> > Reworked the whole patch to work with blk-mq and incorporate upstream's
> > comments]
> 
> Please tag versions using vX and it would be helpful if you could list
> the specific changes that you performed between versions. There where
> 3 RFC versions IIRC, and there's no log of the changes between them.
> 
> I will elaborate on "upstream's comments" in my changelog in my next round of 
> patches.

Sorry for being picky, but can you please make sure your email client
properly quotes previous emails on reply. Note the lack of '>' added
to the quoted parts of your reply.

> > + }
> > +
> >   break;
> > + }
> > +
> > + /*
> > +  * We may somehow receive backend's Closed again while 
> thawing
> > +  * or restoring and it causes thawing or restoring to 
> fail.
> > +  * Ignore such unexpected state regardless of the backend 
> state.
> > +  */
> > + if (info->connected == BLKIF_STATE_FROZEN) {
> 
> I think you can join this with the previous dev->state == 
> XenbusStateClosed?
> 
> Also, won't the device be in the Closed state already if it's in state
> frozen?
> Yes but I think this mostly due to a hypothetical case if during thawing 
> backend switches to Closed state.
> I am not entirely sure if that could happen. Could use some expertise here.

I think the frontend seeing the backend in the closed state during
restore would be a bug that should prevent the frontend from
resuming.

> > + /* Kick the backend to disconnect */
> > + xenbus_switch_state(dev, XenbusStateClosing);
> > +
> > + /*
> > +  * We don't want to move forward before the frontend is 
> diconnected
> > +  * from the backend cleanly.
> > +  */
> > + timeout = 
> wait_for_completion_timeout(>wait_backend_disconnected,
> > +   timeout);
> > + if (!timeout) {
> > + err = -EBUSY;
> 
> Note err is only used here, and I think could just be dropped.
> 
> This err is what's being 

Re: [BOOTLOADER SPECIFICATION RFC] The bootloader log format for TrenchBoot and others

2020-06-01 Thread Roger Pau Monné
On Fri, May 29, 2020 at 01:27:35PM +0200, Daniel Kiper wrote:
> Hey,
> 
> Below you can find my rough idea of the bootloader log format which is
> generic thing but initially will be used for TrenchBoot work. I discussed
> this proposal with Ross and Daniel S. So, the idea went through initial
> sanitization. Now I would like to take feedback from other folks too.
> So, please take a look and complain...
> 
> In general we want to pass the messages produced by the bootloader to the OS
> kernel and finally to the userspace for further processing and analysis. Below
> is the description of the structures which will be used for this thing.
> 
>   struct bootloader_log_msgs
>   {
> grub_uint32_t level;
> grub_uint32_t facility;

Nit: if this is aimed at cross-OS and cross-bootloader compatibility
uint32_t should be used here instead of a grub specific alias.

> char type[];
> char msg[];

I think you want char * here instead? Or are the above arrays expected
to have a fixed size in the final spec?

>   }
> 
>   struct bootloader_log
>   {
> grub_uint32_t version;
> grub_uint32_t producer;
> grub_uint32_t size;
> grub_uint32_t next_off;
> bootloader_log_msgs msgs[];

As I understand it this is not a pointer to an array of
bootloader_log_msgs but rather in-place?

>   }
> 
> The members of struct bootloader_log:
>   - version: the bootloader log format version number, 1 for now,
>   - producer: the producer/bootloader type; we can steal some values from
> linux/Documentation/x86/boot.rst:type_of_loader,
>   - size: total size of the log buffer including the bootloader_log struct,
>   - next_off: offset in bytes, from start of the bootloader_log struct,
> of the next byte after the last log message in the msgs[];
> i.e. the offset of the next available log message slot,

So this will be a memory area that's shared between the OS and the
bootloader and needs to be updated at runtime?

If this is something that's created by the bootloader during loading
and passed to the OS for consumption (but it's not further updated), I
don't see much point in the next_off field. The size field could be
updated to reflect the actual size of the produced messages?

Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-05-28 Thread Roger Pau Monné
On Tue, May 19, 2020 at 11:27:50PM +, Anchal Agarwal wrote:
> From: Munehisa Kamata 
> 
> S4 power transition states are much different than xen
> suspend/resume. Former is visible to the guest and frontend drivers should
> be aware of the state transitions and should be able to take appropriate
> actions when needed. In transition to S4 we need to make sure that at least
> all the in-flight blkif requests get completed, since they probably contain
> bits of the guest's memory image and that's not going to get saved any
> other way. Hence, re-issuing of in-flight requests as in case of xen resume
> will not work here. This is in contrast to xen-suspend where we need to
> freeze with as little processing as possible to avoid dirtying RAM late in
> the migration cycle and we know that in-flight data can wait.
> 
> Add freeze, thaw and restore callbacks for PM suspend and hibernation
> support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> events, need to implement these xenbus_driver callbacks. The freeze handler
> stops block-layer queue and disconnect the frontend from the backend while
> freeing ring_info and associated resources. Before disconnecting from the
> backend, we need to prevent any new IO from being queued and wait for existing
> IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
> requests in use on the shared ring. However, for sanity we should check
> state of the ring before disconnecting to make sure that there are no
> outstanding requests to be processed on the ring. The restore handler
> re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
> the backend, so that rest of the kernel can continue to use the block device
> transparently.
> 
> Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> xen/blkback: unmap all persistent grants when frontend gets disconnected,
> the frontend may see massive amount of grant table warning when freeing
> resources.
> [   36.852659] deferring g.e. 0xf9 (pfn 0x)
> [   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
> 
> In this case, persistent grants would need to be disabled.
> 
> [Anchal Changelog: Removed timeout/request during blkfront freeze.
> Reworked the whole patch to work with blk-mq and incorporate upstream's
> comments]

Please tag versions using vX and it would be helpful if you could list
the specific changes that you performed between versions. There where
3 RFC versions IIRC, and there's no log of the changes between them.

> 
> Signed-off-by: Anchal Agarwal 
> Signed-off-by: Munehisa Kamata 
> ---
>  drivers/block/xen-blkfront.c | 122 +--
>  1 file changed, 115 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3b889ea950c2..464863ed7093 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -48,6 +48,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -80,6 +82,8 @@ enum blkif_state {
>   BLKIF_STATE_DISCONNECTED,
>   BLKIF_STATE_CONNECTED,
>   BLKIF_STATE_SUSPENDED,
> + BLKIF_STATE_FREEZING,
> + BLKIF_STATE_FROZEN

Nit: adding a terminating ',' would prevent further additions from
having to modify this line.

>  };
>  
>  struct grant {
> @@ -219,6 +223,7 @@ struct blkfront_info
>   struct list_head requests;
>   struct bio_list bio_list;
>   struct list_head info_list;
> + struct completion wait_backend_disconnected;
>  };
>  
>  static unsigned int nr_minors;
> @@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
> sector_size,
>   info->sector_size = sector_size;
>   info->physical_sector_size = physical_sector_size;
>   blkif_set_queue_limits(info);
> + init_completion(>wait_backend_disconnected);
>  
>   return 0;
>  }
> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, 
> unsigned int *offset)
>   case XEN_SCSI_DISK5_MAJOR:
>   case XEN_SCSI_DISK6_MAJOR:
>   case XEN_SCSI_DISK7_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) + 
> + *offset = (*minor / PARTS_PER_DISK) +
>   ((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
>   EMULATED_SD_DISK_NAME_OFFSET;
>   *minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, 
> unsigned int *offset)
>   case XEN_SCSI_DISK13_MAJOR:
>   case XEN_SCSI_DISK14_MAJOR:
>   case XEN_SCSI_DISK15_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) + 
> + *offset = (*minor / PARTS_PER_DISK) +

Unrelated changes, please split to a pre-patch.

>   ((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
>  

Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-26 Thread Roger Pau Monné
On Tue, Feb 26, 2019 at 10:26:21AM +, Julien Grall wrote:
> Hi,
> 
> On 26/02/2019 10:17, Roger Pau Monné wrote:
> > On Tue, Feb 26, 2019 at 10:03:38AM +, Julien Grall wrote:
> > > Hi Roger,
> > > 
> > > On 26/02/2019 09:44, Roger Pau Monné wrote:
> > > > On Tue, Feb 26, 2019 at 09:30:07AM +, Andrew Cooper wrote:
> > > > > On 26/02/2019 09:14, Roger Pau Monné wrote:
> > > > > > On Mon, Feb 25, 2019 at 01:55:42PM +, Julien Grall wrote:
> > > > > > > Hi Oleksandr,
> > > > > > > 
> > > > > > > On 25/02/2019 13:24, Oleksandr Andrushchenko wrote:
> > > > > > > > On 2/22/19 3:33 PM, Julien Grall wrote:
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > On 22/02/2019 12:38, Oleksandr Andrushchenko wrote:
> > > > > > > > > > On 2/20/19 10:46 PM, Julien Grall wrote:
> > > > > > > > > > > Discussing with my team, a solution that came up would be 
> > > > > > > > > > > to
> > > > > > > > > > > introduce one atomic field per event to record the number 
> > > > > > > > > > > of
> > > > > > > > > > > event received. I will explore that solution tomorrow.
> > > > > > > > > > How will this help if events have some payload?
> > > > > > > > > What payload? The event channel does not carry any payload. 
> > > > > > > > > It only
> > > > > > > > > notify you that something happen. Then this is up to the user 
> > > > > > > > > to
> > > > > > > > > decide what to you with it.
> > > > > > > > Sorry, I was probably not precise enough. I mean that an event 
> > > > > > > > might have
> > > > > > > > associated payload in the ring buffer, for example [1]. So, 
> > > > > > > > counting events
> > > > > > > > may help somehow, but the ring's data may still be lost
> > > > > > >   From my understanding of event channels are edge interrupts. By 
> > > > > > > definition,
> > > > > > IMO event channels are active high level interrupts.
> > > > > > 
> > > > > > Let's take into account the following situation: you have an event
> > > > > > channel masked and the event channel pending bit (akin to the line 
> > > > > > on
> > > > > > bare metal) goes from low to high (0 -> 1), then you unmask the
> > > > > > interrupt and you get an event injected. If it was an edge interrupt
> > > > > > you wont get an event injected after unmasking, because you would
> > > > > > have lost the edge. I think the problem here is that Linux treats
> > > > > > event channels as edge interrupts, when they are actually level.
> > > > > 
> > > > > Event channels are edge interrupts.  There are several very subtle 
> > > > > bugs
> > > > > to be had by software which treats them as line interrupts.
> > > > > 
> > > > > Most critically, if you fail to ack them, rebind them to a new vcpu, 
> > > > > and
> > > > > reenable interrupts, you don't get a new interrupt notification.  This
> > > > > was the source of a 4 month bug when XenServer was moving from
> > > > > classic-xen to PVOps where using irqbalance would cause dom0 to
> > > > > occasionally lose interrupts.
> > > > 
> > > > I would argue that you need to mask them first, rebind to a new vcpu
> > > > and unmask, and then you will get an interrupt notification, or this
> > > > should be fixed in Xen to work as you expect: trigger an interrupt
> > > > notification when moving an asserted event channel between CPUs.
> > > > 
> > > > Is there any document that describes how such non trivial things (like
> > > > moving between CPUs) work for event/level interrupts?
> > > > 
> > > > Maybe I'm being obtuse, but from the example I gave above it's quite
> > > > clear to me event channels don't get triggered based on edge changes,
> > > > but rather on the line level.
> > > 
> > > Your example above is not enough to give the semantics of level. You would
> > > only use the MASK bit if your interrupt handler is threaded to avoid the
> > > interrupt coming up again.
> > > 
> > > So if you remove the mask from the equation, then the interrupt flow 
> > > should be:
> > > 
> > > 1) handle interrupt
> > > 2) EOI
> > 
> > This is bogus if you don't mask the interrupt source. You should
> > instead do
> > 
> > 1) EOI
> > 2) Handle interrupt
> > 
> > And loop over this.
> So that's not a level semantics. It is a edge one :). In the level case, you
> would clear the state once you are done with the interrupt.
> 
> Also, it would be ACK and not EOI.

For level triggered interrupts you have to somehow signal the device
to stop asserting the line, which doesn't happen for Xen devices
because they just signal interrupts to Xen, but don't have a way to
keep event channels asserted, so I agree that this is different from
traditional level interrupts because devices using event channels
don't have a way to keep lines asserted.

I guess the most similar native interrupt is MSI with masking
support?

Roger.


Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-26 Thread Roger Pau Monné
On Tue, Feb 26, 2019 at 10:03:38AM +, Julien Grall wrote:
> Hi Roger,
> 
> On 26/02/2019 09:44, Roger Pau Monné wrote:
> > On Tue, Feb 26, 2019 at 09:30:07AM +, Andrew Cooper wrote:
> > > On 26/02/2019 09:14, Roger Pau Monné wrote:
> > > > On Mon, Feb 25, 2019 at 01:55:42PM +, Julien Grall wrote:
> > > > > Hi Oleksandr,
> > > > > 
> > > > > On 25/02/2019 13:24, Oleksandr Andrushchenko wrote:
> > > > > > On 2/22/19 3:33 PM, Julien Grall wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > On 22/02/2019 12:38, Oleksandr Andrushchenko wrote:
> > > > > > > > On 2/20/19 10:46 PM, Julien Grall wrote:
> > > > > > > > > Discussing with my team, a solution that came up would be to
> > > > > > > > > introduce one atomic field per event to record the number of
> > > > > > > > > event received. I will explore that solution tomorrow.
> > > > > > > > How will this help if events have some payload?
> > > > > > > What payload? The event channel does not carry any payload. It 
> > > > > > > only
> > > > > > > notify you that something happen. Then this is up to the user to
> > > > > > > decide what to you with it.
> > > > > > Sorry, I was probably not precise enough. I mean that an event 
> > > > > > might have
> > > > > > associated payload in the ring buffer, for example [1]. So, 
> > > > > > counting events
> > > > > > may help somehow, but the ring's data may still be lost
> > > > >  From my understanding of event channels are edge interrupts. By 
> > > > > definition,
> > > > IMO event channels are active high level interrupts.
> > > > 
> > > > Let's take into account the following situation: you have an event
> > > > channel masked and the event channel pending bit (akin to the line on
> > > > bare metal) goes from low to high (0 -> 1), then you unmask the
> > > > interrupt and you get an event injected. If it was an edge interrupt
> > > > you wont get an event injected after unmasking, because you would
> > > > have lost the edge. I think the problem here is that Linux treats
> > > > event channels as edge interrupts, when they are actually level.
> > > 
> > > Event channels are edge interrupts.  There are several very subtle bugs
> > > to be had by software which treats them as line interrupts.
> > > 
> > > Most critically, if you fail to ack them, rebind them to a new vcpu, and
> > > reenable interrupts, you don't get a new interrupt notification.  This
> > > was the source of a 4 month bug when XenServer was moving from
> > > classic-xen to PVOps where using irqbalance would cause dom0 to
> > > occasionally lose interrupts.
> > 
> > I would argue that you need to mask them first, rebind to a new vcpu
> > and unmask, and then you will get an interrupt notification, or this
> > should be fixed in Xen to work as you expect: trigger an interrupt
> > notification when moving an asserted event channel between CPUs.
> > 
> > Is there any document that describes how such non trivial things (like
> > moving between CPUs) work for event/level interrupts?
> > 
> > Maybe I'm being obtuse, but from the example I gave above it's quite
> > clear to me event channels don't get triggered based on edge changes,
> > but rather on the line level.
> 
> Your example above is not enough to give the semantics of level. You would
> only use the MASK bit if your interrupt handler is threaded to avoid the
> interrupt coming up again.
> 
> So if you remove the mask from the equation, then the interrupt flow should 
> be:
> 
> 1) handle interrupt
> 2) EOI

This is bogus if you don't mask the interrupt source. You should
instead do

1) EOI
2) Handle interrupt

And loop over this.

> The EOI in our case would be clearing the PENDING state. In a proper level
> interrupt, the state would stay PENDING if there were more to come. This is
> not the case with the events and you will lose the interrupt.
>
> So I don't think they are proper level interrupts. They have more a
> semantics of edge interrupts with some property of level (i.e for the
> mask/unmask).

OK, I guess it depends on how you look at it, to me event channels are
maybe quirky level interrupts, but are definitely closer to level than
edge interrupts, specially taking into account the interrupt injection
that happens on unmask of a pending line, there's no such thing at all
with edge interrupts.

Thanks, Roger.


Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-26 Thread Roger Pau Monné
On Tue, Feb 26, 2019 at 09:30:07AM +, Andrew Cooper wrote:
> On 26/02/2019 09:14, Roger Pau Monné wrote:
> > On Mon, Feb 25, 2019 at 01:55:42PM +, Julien Grall wrote:
> >> Hi Oleksandr,
> >>
> >> On 25/02/2019 13:24, Oleksandr Andrushchenko wrote:
> >>> On 2/22/19 3:33 PM, Julien Grall wrote:
> >>>> Hi,
> >>>>
> >>>> On 22/02/2019 12:38, Oleksandr Andrushchenko wrote:
> >>>>> On 2/20/19 10:46 PM, Julien Grall wrote:
> >>>>>> Discussing with my team, a solution that came up would be to
> >>>>>> introduce one atomic field per event to record the number of
> >>>>>> event received. I will explore that solution tomorrow.
> >>>>> How will this help if events have some payload?
> >>>> What payload? The event channel does not carry any payload. It only
> >>>> notify you that something happen. Then this is up to the user to
> >>>> decide what to you with it.
> >>> Sorry, I was probably not precise enough. I mean that an event might have
> >>> associated payload in the ring buffer, for example [1]. So, counting 
> >>> events
> >>> may help somehow, but the ring's data may still be lost
> >> From my understanding of event channels are edge interrupts. By definition,
> > IMO event channels are active high level interrupts.
> >
> > Let's take into account the following situation: you have an event
> > channel masked and the event channel pending bit (akin to the line on
> > bare metal) goes from low to high (0 -> 1), then you unmask the
> > interrupt and you get an event injected. If it was an edge interrupt
> > you wont get an event injected after unmasking, because you would
> > have lost the edge. I think the problem here is that Linux treats
> > event channels as edge interrupts, when they are actually level.
> 
> Event channels are edge interrupts.  There are several very subtle bugs
> to be had by software which treats them as line interrupts.
> 
> Most critically, if you fail to ack them, rebind them to a new vcpu, and
> reenable interrupts, you don't get a new interrupt notification.  This
> was the source of a 4 month bug when XenServer was moving from
> classic-xen to PVOps where using irqbalance would cause dom0 to
> occasionally lose interrupts.

I would argue that you need to mask them first, rebind to a new vcpu
and unmask, and then you will get an interrupt notification, or this
should be fixed in Xen to work as you expect: trigger an interrupt
notification when moving an asserted event channel between CPUs.

Is there any document that describes how such non trivial things (like
moving between CPUs) work for event/level interrupts?

Maybe I'm being obtuse, but from the example I gave above it's quite
clear to me event channels don't get triggered based on edge changes,
but rather on the line level.

Thanks, Roger.


Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-26 Thread Roger Pau Monné
On Mon, Feb 25, 2019 at 01:55:42PM +, Julien Grall wrote:
> Hi Oleksandr,
> 
> On 25/02/2019 13:24, Oleksandr Andrushchenko wrote:
> > On 2/22/19 3:33 PM, Julien Grall wrote:
> > > Hi,
> > > 
> > > On 22/02/2019 12:38, Oleksandr Andrushchenko wrote:
> > > > On 2/20/19 10:46 PM, Julien Grall wrote:
> > > > > Discussing with my team, a solution that came up would be to
> > > > > introduce one atomic field per event to record the number of
> > > > > event received. I will explore that solution tomorrow.
> > > > How will this help if events have some payload?
> > > 
> > > What payload? The event channel does not carry any payload. It only
> > > notify you that something happen. Then this is up to the user to
> > > decide what to you with it.
> > Sorry, I was probably not precise enough. I mean that an event might have
> > associated payload in the ring buffer, for example [1]. So, counting events
> > may help somehow, but the ring's data may still be lost
> 
> From my understanding of event channels are edge interrupts. By definition,

IMO event channels are active high level interrupts.

Let's take into account the following situation: you have an event
channel masked and the event channel pending bit (akin to the line on
bare metal) goes from low to high (0 -> 1), then you unmask the
interrupt and you get an event injected. If it was an edge interrupt
you wont get an event injected after unmasking, because you would
have lost the edge. I think the problem here is that Linux treats
event channels as edge interrupts, when they are actually level.

Roger.


Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-21 Thread Roger Pau Monné
On Thu, Feb 21, 2019 at 08:38:39AM +, Julien Grall wrote:
> Hi Roger,
> 
> On Thu, 21 Feb 2019, 08:08 Roger Pau Monné,  wrote:
> 
> > FWIW, you can also mask the interrupt while waiting for the thread to
> > execute the interrupt handler. Ie:
> >
> 
> Thank you for providing steps, however where would the masking be done? By
> the irqchip or a custom solution?

I'm not familiar with the irqchip infrastructure in Linux, what I
proposed below is what FreeBSD does when running interrupt handlers in
deferred threads IIRC.

If irqchip has a specific handler to dispatch to a thread, then that's
the place where the masking should happen. Likely, the unmasking
should be done by the irq handling infrastructure after the thread
executing the interrupt handler has finished.

Isn't there a similar way to handle interrupts in threads for Linux?

> 
> > 1. Interrupt injected
> > 2. Execute guest event channel callback
> > 3. Scan for pending interrupts
> > 4. Mask interrupt
> > 5. Clear pending field
> > 6. Queue threaded handler
> > 7. Go to 3 until all interrupts are drained
> > [...]
> > 8. Execute interrupt handler in thread
> > 9. Unmask interrupt
> >
> > That should prevent you from stacking interrupts?
> >
> > Roger.
> >


Re: [Xen-devel] xen/evtchn and forced threaded irq

2019-02-21 Thread Roger Pau Monné
On Wed, Feb 20, 2019 at 10:03:57PM +, Julien Grall wrote:
> Hi Boris,
> 
> On 2/20/19 9:46 PM, Boris Ostrovsky wrote:
> > On 2/20/19 3:46 PM, Julien Grall wrote:
> > > (+ Andrew and Jan for feedback on the event channel interrupt)
> > > 
> > > Hi Boris,
> > > 
> > > Thank you for the your feedback.
> > > 
> > > On 2/20/19 8:04 PM, Boris Ostrovsky wrote:
> > > > On 2/20/19 1:05 PM, Julien Grall wrote:
> > > > > Hi,
> > > > > 
> > > > > On 20/02/2019 17:07, Boris Ostrovsky wrote:
> > > > > > On 2/20/19 9:15 AM, Julien Grall wrote:
> > > > > > > Hi Boris,
> > > > > > > 
> > > > > > > Thank you for your answer.
> > > > > > > 
> > > > > > > On 20/02/2019 00:02, Boris Ostrovsky wrote:
> > > > > > > > On Tue, Feb 19, 2019 at 05:31:10PM +, Julien Grall wrote:
> > > > > > > > > Hi all,
> > > > > > > > > 
> > > > > > > > > I have been looking at using Linux RT in Dom0. Once the guest 
> > > > > > > > > is
> > > > > > > > > started,
> > > > > > > > > the console is ending to have a lot of warning (see trace 
> > > > > > > > > below).
> > > > > > > > > 
> > > > > > > > > After some investigation, this is because the irq handler 
> > > > > > > > > will now
> > > > > > > > > be threaded.
> > > > > > > > > I can reproduce the same error with the vanilla Linux when 
> > > > > > > > > passing
> > > > > > > > > the option
> > > > > > > > > 'threadirqs' on the command line (the trace below is from 
> > > > > > > > > 5.0.0-rc7
> > > > > > > > > that has
> > > > > > > > > not RT support).
> > > > > > > > > 
> > > > > > > > > FWIW, the interrupt for port 6 is used to for the guest to
> > > > > > > > > communicate with
> > > > > > > > > xenstore.
> > > > > > > > > 
> > > > > > > > >     From my understanding, this is happening because the 
> > > > > > > > > interrupt
> > > > > > > > > handler is now
> > > > > > > > > run in a thread. So we can have the following happening.
> > > > > > > > > 
> > > > > > > > >    Interrupt context    | Interrupt thread
> > > > > > > > >     |
> > > > > > > > >    receive interrupt port 6 |
> > > > > > > > >    clear the evtchn port    |
> > > > > > > > >    set IRQF_RUNTHREAD    |
> > > > > > > > >    kick interrupt thread    |
> > > > > > > > >     |    clear IRQF_RUNTHREAD
> > > > > > > > >     |    call evtchn_interrupt
> > > > > > > > >    receive interrupt port 6 |
> > > > > > > > >    clear the evtchn port    |
> > > > > > > > >    set IRQF_RUNTHREAD   |
> > > > > > > > >    kick interrupt thread    |
> > > > > > > > >     |    disable interrupt 
> > > > > > > > > port 6
> > > > > > > > >     |    evtchn->enabled = 
> > > > > > > > > false
> > > > > > > > >     |    []
> > > > > > > > >     |
> > > > > > > > >     |    *** Handling the 
> > > > > > > > > second
> > > > > > > > > interrupt ***
> > > > > > > > >     |    clear IRQF_RUNTHREAD
> > > > > > > > >     |    call evtchn_interrupt
> > > > > > > > >     |    WARN(...)
> > > > > > > > > 
> > > > > > > > > I am not entirely sure how to fix this. I have two solutions 
> > > > > > > > > in
> > > > > > > > > mind:
> > > > > > > > > 
> > > > > > > > > 1) Prevent the interrupt handler to be threaded. We would also
> > > > > > > > > need to
> > > > > > > > > switch from spin_lock to raw_spin_lock as the former may 
> > > > > > > > > sleep on
> > > > > > > > > RT-Linux.
> > > > > > > > > 
> > > > > > > > > 2) Remove the warning
> > > > > > > > 
> > > > > > > > I think access to evtchn->enabled is racy so (with or without 
> > > > > > > > the
> > > > > > > > warning) we can't use it reliably.
> > > > > > > 
> > > > > > > Thinking about it, it would not be the only issue. The ring is 
> > > > > > > sized
> > > > > > > to contain only one instance of the same event. So if you receive
> > > > > > > twice the event, you may overflow the ring.
> > > > > > 
> > > > > > Hm... That's another argument in favor of "unthreading" the handler.
> > > > > 
> > > > > I first thought it would be possible to unthread it. However,
> > > > > wake_up_interruptible is using a spin_lock. On RT spin_lock can sleep,
> > > > > so this cannot be used in an interrupt context.
> > > > > 
> > > > > So I think "unthreading" the handler is not an option here.
> > > > 
> > > > That sounds like a different problem. I.e. there are two issues:
> > > > * threaded interrupts don't work properly (races, ring overflow)
> > > > * evtchn_interrupt() (threaded or not) has spin_lock(), which is not
> > > > going to work for RT
> > > 
> > > I am afraid that's not correct, you can use spin_lock() in threaded
> > > interrupt handler.
> > 
> > In 

Re: [Xen-devel] [PATCH 1/1] xen-blkback: do not wake up shutdown_wq after xen_blkif_schedule() is stopped

2019-01-17 Thread Roger Pau Monné
On Thu, Jan 17, 2019 at 10:10:00AM +0800, Dongli Zhang wrote:
> Hi Roger,
> 
> On 2019/1/16 下午10:52, Roger Pau Monné wrote:
> > On Wed, Jan 16, 2019 at 09:47:41PM +0800, Dongli Zhang wrote:
> >> There is no need to wake up xen_blkif_schedule() as kthread_stop() is able
> >> to already wake up the kernel thread.
> >>
> >> Signed-off-by: Dongli Zhang 
> >> ---
> >>  drivers/block/xen-blkback/xenbus.c | 4 +---
> >>  1 file changed, 1 insertion(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkback/xenbus.c 
> >> b/drivers/block/xen-blkback/xenbus.c
> >> index a4bc74e..37ac59e 100644
> >> --- a/drivers/block/xen-blkback/xenbus.c
> >> +++ b/drivers/block/xen-blkback/xenbus.c
> >> @@ -254,10 +254,8 @@ static int xen_blkif_disconnect(struct xen_blkif 
> >> *blkif)
> >>if (!ring->active)
> >>continue;
> >>  
> >> -  if (ring->xenblkd) {
> >> +  if (ring->xenblkd)
> >>kthread_stop(ring->xenblkd);
> >> -  wake_up(>shutdown_wq);
> > 
> > I've now realized that shutdown_wq is basically useless, and should be
> > removed, could you please do it in this patch?
> 
> I do not think shutdown_wq is useless.
> 
> It is used to halt the xen_blkif_schedule() kthread when
> RING_REQUEST_PROD_OVERFLOW() returns true in __do_block_io_op():
> 
> 1121 static int
> 1122 __do_block_io_op(struct xen_blkif_ring *ring)
> ... ...
> 1134 if (RING_REQUEST_PROD_OVERFLOW(_rings->common, rp)) {
> 1135 rc = blk_rings->common.rsp_prod_pvt;
> 1136 pr_warn("Frontend provided bogus ring requests (%d - %d =
> %d). Halting ring processing on dev=%04x\n",
> 1137 rp, rc, rp - rc, ring->blkif->vbd.pdevice);
> 1138 return -EACCES;
> 1139 }
> 
> 
> If there is bogus/invalid ring requests, __do_block_io_op() would return 
> -EACCES
> without modifying prod/cons index.
> 
> Without shutdown_wq (just simply assuming we remove the below code without
> handling -EACCES in xen_blkif_schedule()), the kernel thread would continue 
> the
> while loop.
> 
> 648 if (ret == -EACCES)
> 649 wait_event_interruptible(ring->shutdown_wq,
> 650  kthread_should_stop());
> 
> 
> If xen_blkif_be_int() is triggered again (let's assume there is no 
> optimization
> on guest part and guest would send event for every request it puts on ring
> buffer), we may come to do_block_io_op() again.
> 
> 
> As the prod/cons index are not modified last time the code runs into
> do_block_io_op() to process bogus request, we would hit the bogus request 
> issue
> again.
> 
> 
> With shutdown_wq, the kernel kthread is blocked forever until such queue/ring 
> is
> destroyed. If we remove shutdown_wq, we are changing the policy to handle 
> bogus
> requests on ring buffer?

AFAICT the only wakeup call to shutdown_wq is removed in this patch,
hence waiting on it seems useless. I would replace the
wait_event_interruptible call in xen_blkif_schedule with a break, so
that the kthread ends as soon as a bogus request is found. I think
there's no point in waiting for xen_blkif_disconnect to stop the
kthread.

Thanks, Roger.


Re: [PATCH 1/1] xen-blkback: do not wake up shutdown_wq after xen_blkif_schedule() is stopped

2019-01-16 Thread Roger Pau Monné
On Wed, Jan 16, 2019 at 09:47:41PM +0800, Dongli Zhang wrote:
> There is no need to wake up xen_blkif_schedule() as kthread_stop() is able
> to already wake up the kernel thread.
> 
> Signed-off-by: Dongli Zhang 
> ---
>  drivers/block/xen-blkback/xenbus.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index a4bc74e..37ac59e 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -254,10 +254,8 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
>   if (!ring->active)
>   continue;
>  
> - if (ring->xenblkd) {
> + if (ring->xenblkd)
>   kthread_stop(ring->xenblkd);
> - wake_up(>shutdown_wq);

I've now realized that shutdown_wq is basically useless, and should be
removed, could you please do it in this patch?

Thanks, Roger.


Re: [PATCH 1/1] xen-blkback: do not wake up shutdown_wq after xen_blkif_schedule() is stopped

2019-01-16 Thread Roger Pau Monné
On Wed, Jan 16, 2019 at 09:47:41PM +0800, Dongli Zhang wrote:
> There is no need to wake up xen_blkif_schedule() as kthread_stop() is able
> to already wake up the kernel thread.
> 
> Signed-off-by: Dongli Zhang 

Reviewed-by: Roger Pau Monné 

kthread_stop waits for the thread to exit, so it must obviously wake
it up.

Thanks, Roger.


Re: [Xen-devel] [PATCH v6 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-15 Thread Roger Pau Monné
On Tue, Jan 15, 2019 at 12:41:44AM +0800, Dongli Zhang wrote:
> The xenstore 'ring-page-order' is used globally for each blkback queue and
> therefore should be read from xenstore only once. However, it is obtained
> in read_per_ring_refs() which might be called multiple times during the
> initialization of each blkback queue.
> 
> If the blkfront is malicious and the 'ring-page-order' is set in different
> value by blkfront every time before blkback reads it, this may end up at
> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> xen_blkif_disconnect() when frontend is destroyed.
> 
> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> once.
> 
> Signed-off-by: Dongli Zhang 

LGTM:

Reviewed-by: Roger Pau Monné 

Thanks!


Re: [Xen-devel] [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-11 Thread Roger Pau Monné
On Tue, Jan 8, 2019 at 10:53 AM Dongli Zhang  wrote:
>
> Hi Roger,
>
> On 01/07/2019 11:27 PM, Roger Pau Monné wrote:
> > On Mon, Jan 07, 2019 at 10:07:34PM +0800, Dongli Zhang wrote:
> >>
> >>
> >> On 01/07/2019 10:05 PM, Dongli Zhang wrote:
> >>>
> >>>
> >>> On 01/07/2019 08:01 PM, Roger Pau Monné wrote:
> >>>> On Mon, Jan 07, 2019 at 01:35:59PM +0800, Dongli Zhang wrote:
> >>>>> The xenstore 'ring-page-order' is used globally for each blkback queue 
> >>>>> and
> >>>>> therefore should be read from xenstore only once. However, it is 
> >>>>> obtained
> >>>>> in read_per_ring_refs() which might be called multiple times during the
> >>>>> initialization of each blkback queue.
> >>>>>
> >>>>> If the blkfront is malicious and the 'ring-page-order' is set in 
> >>>>> different
> >>>>> value by blkfront every time before blkback reads it, this may end up at
> >>>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> >>>>> xen_blkif_disconnect() when frontend is destroyed.
> >>>>>
> >>>>> This patch reworks connect_ring() to read xenstore 'ring-page-order' 
> >>>>> only
> >>>>> once.
> >>>>>
> >>>>> Signed-off-by: Dongli Zhang 
> >>>>> ---
> >>>>> Changed since v1:
> >>>>>   * change the order of xenstore read in read_per_ring_refs
> >>>>>   * use xenbus_read_unsigned() in connect_ring()
> >>>>>
> >>>>> Changed since v2:
> >>>>>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
> >>>>>   * avoid setting err as -EINVAL to remove extra one line of code
> >>>>>
> >>>>> Changed since v3:
> >>>>>   * exit at the beginning if !nr_grefs
> >>>>>   * change the if statements to avoid test (err != 1) twice
> >>>>>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
> >>>>>
> >>>>>  drivers/block/xen-blkback/xenbus.c | 76 
> >>>>> +-
> >>>>>  1 file changed, 43 insertions(+), 33 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/block/xen-blkback/xenbus.c 
> >>>>> b/drivers/block/xen-blkback/xenbus.c
> >>>>> index a4aadac..a2acbc9 100644
> >>>>> --- a/drivers/block/xen-blkback/xenbus.c
> >>>>> +++ b/drivers/block/xen-blkback/xenbus.c
> >>>>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> >>>>> *ring, const char *dir)
> >>>>>   int err, i, j;
> >>>>>   struct xen_blkif *blkif = ring->blkif;
> >>>>>   struct xenbus_device *dev = blkif->be->dev;
> >>>>> - unsigned int ring_page_order, nr_grefs, evtchn;
> >>>>> + unsigned int nr_grefs, evtchn;
> >>>>>
> >>>>>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> >>>>> );
> >>>>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct 
> >>>>> xen_blkif_ring *ring, const char *dir)
> >>>>>   return err;
> >>>>>   }
> >>>>>
> >>>>> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> >>>>> -   _page_order);
> >>>>> - if (err != 1) {
> >>>>> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> >>>>> _ref[0]);
> >>>>> + nr_grefs = blkif->nr_ring_pages;
> >>>>> +
> >>>>> + if (unlikely(!nr_grefs))
> >>>>> + return -EINVAL;
> >>>>
> >>>> Is this even possible? AFAICT read_per_ring_refs will always be called
> >>>> with blkif->nr_ring_pages != 0?
> >>>>
> >>>> If so, I would consider turning this into a BUG_ON/WARN_ON.
> >>>
> >>> It used to be "WARN_ON(!nr_grefs);" in the v3 of the patch.
> >>>
> >>> I would turn it into WARN_ON if it is fine with both Paul and you.
> >>
> >> To clarify, I would use WARN_ON() before exit with -EINVAL (when
> >> blkif->nr_ring_pages is 0).
> >
> > Given that this function will never be called with nr_ring_pages == 0
> > I would be fine with just using a BUG_ON, getting here with
> > nr_ring_pages == 0 would imply memory corruption or some other severe
> > issue has happened, and there's no possible recovery.
> >
> > If you want to instead keep the return, please use plain WARN instead
> > of WARN_ON.
> >
> > Thanks, Roger.
> >
>
> Is there any reason using WARN than WARN_ON? Because of the message printed by
> WARN? something like below?

Oh, so WARN also takes a condition, I was expecting WARN to not take
any parameters. Just use WARN_ON(true); then, there's no need to
re-evaluate !nr_grefs.


Re: [Xen-devel] [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-07 Thread Roger Pau Monné
On Mon, Jan 07, 2019 at 10:07:34PM +0800, Dongli Zhang wrote:
> 
> 
> On 01/07/2019 10:05 PM, Dongli Zhang wrote:
> > 
> > 
> > On 01/07/2019 08:01 PM, Roger Pau Monné wrote:
> >> On Mon, Jan 07, 2019 at 01:35:59PM +0800, Dongli Zhang wrote:
> >>> The xenstore 'ring-page-order' is used globally for each blkback queue and
> >>> therefore should be read from xenstore only once. However, it is obtained
> >>> in read_per_ring_refs() which might be called multiple times during the
> >>> initialization of each blkback queue.
> >>>
> >>> If the blkfront is malicious and the 'ring-page-order' is set in different
> >>> value by blkfront every time before blkback reads it, this may end up at
> >>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> >>> xen_blkif_disconnect() when frontend is destroyed.
> >>>
> >>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> >>> once.
> >>>
> >>> Signed-off-by: Dongli Zhang 
> >>> ---
> >>> Changed since v1:
> >>>   * change the order of xenstore read in read_per_ring_refs
> >>>   * use xenbus_read_unsigned() in connect_ring()
> >>>
> >>> Changed since v2:
> >>>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
> >>>   * avoid setting err as -EINVAL to remove extra one line of code
> >>>
> >>> Changed since v3:
> >>>   * exit at the beginning if !nr_grefs
> >>>   * change the if statements to avoid test (err != 1) twice
> >>>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
> >>>
> >>>  drivers/block/xen-blkback/xenbus.c | 76 
> >>> +-
> >>>  1 file changed, 43 insertions(+), 33 deletions(-)
> >>>
> >>> diff --git a/drivers/block/xen-blkback/xenbus.c 
> >>> b/drivers/block/xen-blkback/xenbus.c
> >>> index a4aadac..a2acbc9 100644
> >>> --- a/drivers/block/xen-blkback/xenbus.c
> >>> +++ b/drivers/block/xen-blkback/xenbus.c
> >>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> >>> *ring, const char *dir)
> >>>   int err, i, j;
> >>>   struct xen_blkif *blkif = ring->blkif;
> >>>   struct xenbus_device *dev = blkif->be->dev;
> >>> - unsigned int ring_page_order, nr_grefs, evtchn;
> >>> + unsigned int nr_grefs, evtchn;
> >>>  
> >>>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> >>> );
> >>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> >>> *ring, const char *dir)
> >>>   return err;
> >>>   }
> >>>  
> >>> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> >>> -   _page_order);
> >>> - if (err != 1) {
> >>> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> >>> _ref[0]);
> >>> + nr_grefs = blkif->nr_ring_pages;
> >>> +
> >>> + if (unlikely(!nr_grefs))
> >>> + return -EINVAL;
> >>
> >> Is this even possible? AFAICT read_per_ring_refs will always be called
> >> with blkif->nr_ring_pages != 0?
> >>
> >> If so, I would consider turning this into a BUG_ON/WARN_ON.
> > 
> > It used to be "WARN_ON(!nr_grefs);" in the v3 of the patch.
> > 
> > I would turn it into WARN_ON if it is fine with both Paul and you.
> 
> To clarify, I would use WARN_ON() before exit with -EINVAL (when
> blkif->nr_ring_pages is 0).

Given that this function will never be called with nr_ring_pages == 0
I would be fine with just using a BUG_ON, getting here with
nr_ring_pages == 0 would imply memory corruption or some other severe
issue has happened, and there's no possible recovery.

If you want to instead keep the return, please use plain WARN instead
of WARN_ON.

Thanks, Roger.


Re: [PATCH v4 2/2] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2019-01-07 Thread Roger Pau Monné
On Mon, Jan 07, 2019 at 01:35:59PM +0800, Dongli Zhang wrote:
> The xenstore 'ring-page-order' is used globally for each blkback queue and
> therefore should be read from xenstore only once. However, it is obtained
> in read_per_ring_refs() which might be called multiple times during the
> initialization of each blkback queue.
> 
> If the blkfront is malicious and the 'ring-page-order' is set in different
> value by blkfront every time before blkback reads it, this may end up at
> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> xen_blkif_disconnect() when frontend is destroyed.
> 
> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> once.
> 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   * change the order of xenstore read in read_per_ring_refs
>   * use xenbus_read_unsigned() in connect_ring()
> 
> Changed since v2:
>   * simplify the condition check as "(err != 1 && nr_grefs > 1)"
>   * avoid setting err as -EINVAL to remove extra one line of code
> 
> Changed since v3:
>   * exit at the beginning if !nr_grefs
>   * change the if statements to avoid test (err != 1) twice
>   * initialize a 'blkif' stack variable (refer to PATCH 1/2)
> 
>  drivers/block/xen-blkback/xenbus.c | 76 
> +-
>  1 file changed, 43 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index a4aadac..a2acbc9 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   int err, i, j;
>   struct xen_blkif *blkif = ring->blkif;
>   struct xenbus_device *dev = blkif->be->dev;
> - unsigned int ring_page_order, nr_grefs, evtchn;
> + unsigned int nr_grefs, evtchn;
>  
>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> );
> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   return err;
>   }
>  
> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> -   _page_order);
> - if (err != 1) {
> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> _ref[0]);
> + nr_grefs = blkif->nr_ring_pages;
> +
> + if (unlikely(!nr_grefs))
> + return -EINVAL;

Is this even possible? AFAICT read_per_ring_refs will always be called
with blkif->nr_ring_pages != 0?

If so, I would consider turning this into a BUG_ON/WARN_ON.

> +
> + for (i = 0; i < nr_grefs; i++) {
> + char ring_ref_name[RINGREF_NAME_LEN];
> +
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> + err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> +"%u", _ref[i]);
> +
>   if (err != 1) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
> - return err;
> - }
> - nr_grefs = 1;
> - } else {
> - unsigned int i;
> -
> - if (ring_page_order > xen_blkif_max_ring_order) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "%s/request %d ring page 
> order exceed max:%d",
> -  dir, ring_page_order,
> -  xen_blkif_max_ring_order);
> - return err;
> + if (nr_grefs == 1)
> + break;
> +

You need to either set err to EINVAL before calling xenbus_dev_fatal,
or call xenbus_dev_fatal with EINVAL as the second parameter.

> + xenbus_dev_fatal(dev, err, "reading %s/%s",
> +  dir, ring_ref_name);
> + return -EINVAL;
>   }
> + }
>  
> - nr_grefs = 1 << ring_page_order;
> - for (i = 0; i < nr_grefs; i++) {
> - char ring_ref_name[RINGREF_NAME_LEN];
> -
> - snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
> i);
> - err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> -"%u", _ref[i]);
> - if (err != 1) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "reading %s/%s",
> -  dir, ring_ref_name);
> - return err;
> - }
> + if (err != 1) {
> + WARN_ON(nr_grefs != 1);
> +
> + err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u",
> +_ref[0]);
> + if (err != 1) {
> + xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);

Second 

Re: [PATCH v4 1/2] xen/blkback: add stack variable 'blkif' in connect_ring()

2019-01-07 Thread Roger Pau Monné
On Mon, Jan 07, 2019 at 01:35:58PM +0800, Dongli Zhang wrote:
> As 'be->blkif' is used for many times in connect_ring(), the stack variable
> 'blkif' is added to substitute 'be-blkif'.
> 
> Suggested-by: Paul Durrant 
> Signed-off-by: Dongli Zhang 

Reviewed-by: Roger Pau Monné 


Re: [Xen-devel] [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-19 Thread Roger Pau Monné
On Tue, Dec 18, 2018 at 11:29:16PM +0800, Dongli Zhang wrote:
> 
> 
> On 12/18/2018 11:13 PM, Roger Pau Monné wrote:
> > On Tue, Dec 18, 2018 at 07:31:59PM +0800, Dongli Zhang wrote:
> >> Hi Roger,
> >>
> >> On 12/18/2018 05:33 PM, Roger Pau Monné wrote:
> >>> On Tue, Dec 18, 2018 at 08:55:38AM +0800, Dongli Zhang wrote:
> >>>> The xenstore 'ring-page-order' is used globally for each blkback queue 
> >>>> and
> >>>> therefore should be read from xenstore only once. However, it is obtained
> >>>> in read_per_ring_refs() which might be called multiple times during the
> >>>> initialization of each blkback queue.
> >>>>
> >>>> If the blkfront is malicious and the 'ring-page-order' is set in 
> >>>> different
> >>>> value by blkfront every time before blkback reads it, this may end up at
> >>>> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> >>>> xen_blkif_disconnect() when frontend is destroyed.
> >>>>
> >>>> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> >>>> once.
> >>>>
> >>>> Signed-off-by: Dongli Zhang 
> >>>> ---
> >>>> Changed since v1:
> >>>>   * change the order of xenstore read in read_per_ring_refs(suggested by 
> >>>> Roger Pau Monne)
> >>>>   * use xenbus_read_unsigned() in connect_ring() (suggested by Roger Pau 
> >>>> Monne)
> >>>>
> >>>>  drivers/block/xen-blkback/xenbus.c | 70 
> >>>> ++
> >>>>  1 file changed, 40 insertions(+), 30 deletions(-)
> >>>>
> >>>> diff --git a/drivers/block/xen-blkback/xenbus.c 
> >>>> b/drivers/block/xen-blkback/xenbus.c
> >>>> index a4bc74e..7178f0f 100644
> >>>> --- a/drivers/block/xen-blkback/xenbus.c
> >>>> +++ b/drivers/block/xen-blkback/xenbus.c
> >>>> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> >>>> *ring, const char *dir)
> >>>>  int err, i, j;
> >>>>  struct xen_blkif *blkif = ring->blkif;
> >>>>  struct xenbus_device *dev = blkif->be->dev;
> >>>> -unsigned int ring_page_order, nr_grefs, evtchn;
> >>>> +unsigned int nr_grefs, evtchn;
> >>>>  
> >>>>  err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> >>>>);
> >>>> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct 
> >>>> xen_blkif_ring *ring, const char *dir)
> >>>>  return err;
> >>>>  }
> >>>>  
> >>>> -err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", 
> >>>> "%u",
> >>>> -  _page_order);
> >>>> -if (err != 1) {
> >>>> -err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> >>>> _ref[0]);
> >>>> -if (err != 1) {
> >>>> +nr_grefs = blkif->nr_ring_pages;
> >>>> +WARN_ON(!nr_grefs);
> >>>> +
> >>>> +for (i = 0; i < nr_grefs; i++) {
> >>>> +char ring_ref_name[RINGREF_NAME_LEN];
> >>>> +
> >>>> +snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", 
> >>>> i);
> >>>> +err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> >>>> +   "%u", _ref[i]);
> >>>> +
> >>>> +if (err != 1 && (i || (!i && nr_grefs > 1))) {
> >>>
> >>> AFAICT the above condition can be simplified as "err != 1 &&
> >>> nr_grefs".
> >>>
> >>>>  err = -EINVAL;
> >>>
> >>> There's no point in setting err here...
> >>>
> >>>> -xenbus_dev_fatal(dev, err, "reading 
> >>>> %s/ring-ref", dir);
> >>>> +xenbus_dev_fatal(dev, err, "reading %s/%s",
> >>>> + dir, ring_ref_name);
> >>>>  return err;
> >>>
> >>> ...since you can just return -EINVAL (same applies to the other
> >>> instance below).
> >>
> >> I would like to confirm if I would keep the err = -EINVAL in below because 
> >> most
> >> of the below code is copied from original implementation without 
> >> modification.
> >>
> >> There is no err set by xenbus_read_unsigned().
> > 
> > Right, but instead of doing:
> > 
> > err = -EINVAL;
> > return err;
> > 
> > You can just do:
> > 
> > return -EINVAL;
> > 
> > Which is one line shorter :).
> 
> However, for the "ring-page-order" case, the err used in xenbus_dev_fatal() is
> not set as xenbus_read_unsigned() does not return any err?
> 
> For "ring-page-order", I would still need to set err = -EINVAL with extra one
> line of code?

Given this, I don't have a strong opinion, so do as you please.

Thanks, Roger.


Re: [Xen-devel] [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-18 Thread Roger Pau Monné
On Tue, Dec 18, 2018 at 07:31:59PM +0800, Dongli Zhang wrote:
> Hi Roger,
> 
> On 12/18/2018 05:33 PM, Roger Pau Monné wrote:
> > On Tue, Dec 18, 2018 at 08:55:38AM +0800, Dongli Zhang wrote:
> >> The xenstore 'ring-page-order' is used globally for each blkback queue and
> >> therefore should be read from xenstore only once. However, it is obtained
> >> in read_per_ring_refs() which might be called multiple times during the
> >> initialization of each blkback queue.
> >>
> >> If the blkfront is malicious and the 'ring-page-order' is set in different
> >> value by blkfront every time before blkback reads it, this may end up at
> >> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> >> xen_blkif_disconnect() when frontend is destroyed.
> >>
> >> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> >> once.
> >>
> >> Signed-off-by: Dongli Zhang 
> >> ---
> >> Changed since v1:
> >>   * change the order of xenstore read in read_per_ring_refs(suggested by 
> >> Roger Pau Monne)
> >>   * use xenbus_read_unsigned() in connect_ring() (suggested by Roger Pau 
> >> Monne)
> >>
> >>  drivers/block/xen-blkback/xenbus.c | 70 
> >> ++
> >>  1 file changed, 40 insertions(+), 30 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkback/xenbus.c 
> >> b/drivers/block/xen-blkback/xenbus.c
> >> index a4bc74e..7178f0f 100644
> >> --- a/drivers/block/xen-blkback/xenbus.c
> >> +++ b/drivers/block/xen-blkback/xenbus.c
> >> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> >> *ring, const char *dir)
> >>int err, i, j;
> >>struct xen_blkif *blkif = ring->blkif;
> >>struct xenbus_device *dev = blkif->be->dev;
> >> -  unsigned int ring_page_order, nr_grefs, evtchn;
> >> +  unsigned int nr_grefs, evtchn;
> >>  
> >>err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> >>  );
> >> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> >> *ring, const char *dir)
> >>return err;
> >>}
> >>  
> >> -  err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> >> -_page_order);
> >> -  if (err != 1) {
> >> -  err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> >> _ref[0]);
> >> -  if (err != 1) {
> >> +  nr_grefs = blkif->nr_ring_pages;
> >> +  WARN_ON(!nr_grefs);
> >> +
> >> +  for (i = 0; i < nr_grefs; i++) {
> >> +  char ring_ref_name[RINGREF_NAME_LEN];
> >> +
> >> +  snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> >> +  err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> >> + "%u", _ref[i]);
> >> +
> >> +  if (err != 1 && (i || (!i && nr_grefs > 1))) {
> > 
> > AFAICT the above condition can be simplified as "err != 1 &&
> > nr_grefs".
> > 
> >>err = -EINVAL;
> > 
> > There's no point in setting err here...
> > 
> >> -  xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
> >> +  xenbus_dev_fatal(dev, err, "reading %s/%s",
> >> +   dir, ring_ref_name);
> >>return err;
> > 
> > ...since you can just return -EINVAL (same applies to the other
> > instance below).
> 
> I would like to confirm if I would keep the err = -EINVAL in below because 
> most
> of the below code is copied from original implementation without modification.
> 
> There is no err set by xenbus_read_unsigned().

Right, but instead of doing:

err = -EINVAL;
return err;

You can just do:

return -EINVAL;

Which is one line shorter :).

> +   ring_page_order = xenbus_read_unsigned(dev->otherend,
> +  "ring-page-order", 0);
> +
> +   if (ring_page_order > xen_blkif_max_ring_order) {
> +   err = -EINVAL;
> +   xenbus_dev_fatal(dev, err,
> +"requested ring page order %d exceed max:%d",
> +ring_page_order,
> +xen_blkif_max_ring_order);
> +   return err;
> +   }
> +
> +   be->blkif->nr_ring_pages = 1 << ring_page_order;
> 
> 
> For the rest, I would do something like:
> 
> +   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> +  "%u", _ref[i]);
> +
> +   if (err != 1 && nr_grefs > 1) {
> +   xenbus_dev_fatal(dev, err, "reading %s/%s",
> +dir, ring_ref_name);
> +   return -EINVAL;
> +   }
> 
> 
> Thank you very much!

Thanks!


Re: [Xen-devel] [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-18 Thread Roger Pau Monné
On Tue, Dec 18, 2018 at 10:33:00AM +0100, Roger Pau Monné wrote:
> On Tue, Dec 18, 2018 at 08:55:38AM +0800, Dongli Zhang wrote:
> > +   for (i = 0; i < nr_grefs; i++) {
> > +   char ring_ref_name[RINGREF_NAME_LEN];
> > +
> > +   snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> > +   err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> > +  "%u", _ref[i]);
> > +
> > +   if (err != 1 && (i || (!i && nr_grefs > 1))) {
> 
> AFAICT the above condition can be simplified as "err != 1 &&
> nr_grefs".

Sorry, this should be "err != 1 && nr_grefs > 1", since it's not order
but rather the number of grefs.

Roger.


Re: [PATCH v2 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-18 Thread Roger Pau Monné
On Tue, Dec 18, 2018 at 08:55:38AM +0800, Dongli Zhang wrote:
> The xenstore 'ring-page-order' is used globally for each blkback queue and
> therefore should be read from xenstore only once. However, it is obtained
> in read_per_ring_refs() which might be called multiple times during the
> initialization of each blkback queue.
> 
> If the blkfront is malicious and the 'ring-page-order' is set in different
> value by blkfront every time before blkback reads it, this may end up at
> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> xen_blkif_disconnect() when frontend is destroyed.
> 
> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> once.
> 
> Signed-off-by: Dongli Zhang 
> ---
> Changed since v1:
>   * change the order of xenstore read in read_per_ring_refs(suggested by 
> Roger Pau Monne)
>   * use xenbus_read_unsigned() in connect_ring() (suggested by Roger Pau 
> Monne)
> 
>  drivers/block/xen-blkback/xenbus.c | 70 
> ++
>  1 file changed, 40 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index a4bc74e..7178f0f 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -926,7 +926,7 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   int err, i, j;
>   struct xen_blkif *blkif = ring->blkif;
>   struct xenbus_device *dev = blkif->be->dev;
> - unsigned int ring_page_order, nr_grefs, evtchn;
> + unsigned int nr_grefs, evtchn;
>  
>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> );
> @@ -936,43 +936,38 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   return err;
>   }
>  
> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> -   _page_order);
> - if (err != 1) {
> - err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> _ref[0]);
> - if (err != 1) {
> + nr_grefs = blkif->nr_ring_pages;
> + WARN_ON(!nr_grefs);
> +
> + for (i = 0; i < nr_grefs; i++) {
> + char ring_ref_name[RINGREF_NAME_LEN];
> +
> + snprintf(ring_ref_name, RINGREF_NAME_LEN, "ring-ref%u", i);
> + err = xenbus_scanf(XBT_NIL, dir, ring_ref_name,
> +"%u", _ref[i]);
> +
> + if (err != 1 && (i || (!i && nr_grefs > 1))) {

AFAICT the above condition can be simplified as "err != 1 &&
nr_grefs".

>   err = -EINVAL;

There's no point in setting err here...

> - xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
> + xenbus_dev_fatal(dev, err, "reading %s/%s",
> +  dir, ring_ref_name);
>   return err;

...since you can just return -EINVAL (same applies to the other
instance below).

The rest LGTM, Thanks.


Re: [PATCH 1/1] xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront

2018-12-10 Thread Roger Pau Monné
On Fri, Dec 07, 2018 at 12:18:04PM +0800, Dongli Zhang wrote:
> The xenstore 'ring-page-order' is used globally for each blkback queue and
> therefore should be read from xenstore only once. However, it is obtained
> in read_per_ring_refs() which might be called multiple times during the
> initialization of each blkback queue.
> 
> If the blkfront is malicious and the 'ring-page-order' is set in different
> value by blkfront every time before blkback reads it, this may end up at
> the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
> xen_blkif_disconnect() when frontend is destroyed.
> 
> This patch reworks connect_ring() to read xenstore 'ring-page-order' only
> once.
> 
> Signed-off-by: Dongli Zhang 
> ---
>  drivers/block/xen-blkback/xenbus.c | 49 
> --
>  1 file changed, 31 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index a4bc74e..4a8ce20 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -919,14 +919,15 @@ static void connect(struct backend_info *be)
>  /*
>   * Each ring may have multi pages, depends on "ring-page-order".
>   */
> -static int read_per_ring_refs(struct xen_blkif_ring *ring, const char *dir)
> +static int read_per_ring_refs(struct xen_blkif_ring *ring, const char *dir,
> +   bool use_ring_page_order)

Why not change the order of read_per_ring_refs so that it tries to
fetch the grant references from "ring-refXX" first and then fallback
to reading the legacy grant reference if nr_ring_pages == 1?

This will avoid having to pass an extra parameter to it.

>  {
>   unsigned int ring_ref[XENBUS_MAX_RING_GRANTS];
>   struct pending_req *req, *n;
>   int err, i, j;
>   struct xen_blkif *blkif = ring->blkif;
>   struct xenbus_device *dev = blkif->be->dev;
> - unsigned int ring_page_order, nr_grefs, evtchn;
> + unsigned int nr_grefs, evtchn;
>  
>   err = xenbus_scanf(XBT_NIL, dir, "event-channel", "%u",
> );
> @@ -936,28 +937,18 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   return err;
>   }
>  
> - err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> -   _page_order);
> - if (err != 1) {
> + nr_grefs = blkif->nr_ring_pages;
> +
> + if (!use_ring_page_order) {
>   err = xenbus_scanf(XBT_NIL, dir, "ring-ref", "%u", 
> _ref[0]);
>   if (err != 1) {
>   err = -EINVAL;
>   xenbus_dev_fatal(dev, err, "reading %s/ring-ref", dir);
>   return err;
>   }
> - nr_grefs = 1;
>   } else {
>   unsigned int i;
>  
> - if (ring_page_order > xen_blkif_max_ring_order) {
> - err = -EINVAL;
> - xenbus_dev_fatal(dev, err, "%s/request %d ring page 
> order exceed max:%d",
> -  dir, ring_page_order,
> -  xen_blkif_max_ring_order);
> - return err;
> - }
> -
> - nr_grefs = 1 << ring_page_order;
>   for (i = 0; i < nr_grefs; i++) {
>   char ring_ref_name[RINGREF_NAME_LEN];
>  
> @@ -972,7 +963,6 @@ static int read_per_ring_refs(struct xen_blkif_ring 
> *ring, const char *dir)
>   }
>   }
>   }
> - blkif->nr_ring_pages = nr_grefs;
>  
>   for (i = 0; i < nr_grefs * XEN_BLKIF_REQS_PER_PAGE; i++) {
>   req = kzalloc(sizeof(*req), GFP_KERNEL);
> @@ -1030,6 +1020,8 @@ static int connect_ring(struct backend_info *be)
>   size_t xspathsize;
>   const size_t xenstore_path_ext_size = 11; /* sufficient for 
> "/queue-NNN" */
>   unsigned int requested_num_queues = 0;
> + bool use_ring_page_order = false;
> + unsigned int ring_page_order;
>  
>   pr_debug("%s %s\n", __func__, dev->otherend);
>  
> @@ -1075,8 +1067,28 @@ static int connect_ring(struct backend_info *be)
>be->blkif->nr_rings, be->blkif->blk_protocol, protocol,
>pers_grants ? "persistent grants" : "");
>  
> + err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-page-order", "%u",
> +_page_order);

You can use xenbus_read_unsigned IMO, which will simplify the logic
below a little bit.

Thanks, Roger.


Re: [Xen-devel] [PATCH] drivers/block/xen-blkback/common.h: use DIV_ROUND_UP instead of reimplementing its function

2018-09-12 Thread Roger Pau Monné
Adding Julien how did the work to support XEN_PAGE_SIZE != PAGE_SIZE.

On Wed, Sep 12, 2018 at 02:14:26AM -0600, Jan Beulich wrote:
> >>> On 12.09.18 at 07:45,  wrote:
> > --- a/drivers/block/xen-blkback/common.h
> > +++ b/drivers/block/xen-blkback/common.h
> > @@ -65,7 +65,7 @@
> > (XEN_PAGES_PER_INDIRECT_FRAME / XEN_PAGES_PER_SEGMENT)
> >  
> >  #define MAX_INDIRECT_PAGES \
> > -   ((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 
> > 1)/SEGS_PER_INDIRECT_FRAME)
> > +   DIV_ROUND_UP(MAX_INDIRECT_SEGMENTS, SEGS_PER_INDIRECT_FRAME)
> >  #define INDIRECT_PAGES(_segs) DIV_ROUND_UP(_segs, 
> > XEN_PAGES_PER_INDIRECT_FRAME)
> 
> My first reaction was to suggest
> 
> #define MAX_INDIRECT_PAGES INDIRECT_PAGES(MAX_INDIRECT_SEGMENTS)
> 
> but that wouldn't match what's there currently (note the two different
> divisors). I can't really decide whether that's just unfortunate naming
> of the two macros, or an actual bug.

I think there's indeed a bug here.

AFAICT, MAX_INDIRECT_PAGES should use XEN_PAGES_PER_INDIRECT_FRAME and
then it could be changed as Jan suggested.

Current MAX_INDIRECT_PAGES is misnamed and should instead be
MAX_INDIRECT_SEGS (which on x86 is exactly the same because PAGE_SIZE
== XEN_PAGE_SIZE).

Roger.


Re: [Xen-devel] [PATCH] drivers/block/xen-blkback/common.h: use DIV_ROUND_UP instead of reimplementing its function

2018-09-12 Thread Roger Pau Monné
Adding Julien how did the work to support XEN_PAGE_SIZE != PAGE_SIZE.

On Wed, Sep 12, 2018 at 02:14:26AM -0600, Jan Beulich wrote:
> >>> On 12.09.18 at 07:45,  wrote:
> > --- a/drivers/block/xen-blkback/common.h
> > +++ b/drivers/block/xen-blkback/common.h
> > @@ -65,7 +65,7 @@
> > (XEN_PAGES_PER_INDIRECT_FRAME / XEN_PAGES_PER_SEGMENT)
> >  
> >  #define MAX_INDIRECT_PAGES \
> > -   ((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 
> > 1)/SEGS_PER_INDIRECT_FRAME)
> > +   DIV_ROUND_UP(MAX_INDIRECT_SEGMENTS, SEGS_PER_INDIRECT_FRAME)
> >  #define INDIRECT_PAGES(_segs) DIV_ROUND_UP(_segs, 
> > XEN_PAGES_PER_INDIRECT_FRAME)
> 
> My first reaction was to suggest
> 
> #define MAX_INDIRECT_PAGES INDIRECT_PAGES(MAX_INDIRECT_SEGMENTS)
> 
> but that wouldn't match what's there currently (note the two different
> divisors). I can't really decide whether that's just unfortunate naming
> of the two macros, or an actual bug.

I think there's indeed a bug here.

AFAICT, MAX_INDIRECT_PAGES should use XEN_PAGES_PER_INDIRECT_FRAME and
then it could be changed as Jan suggested.

Current MAX_INDIRECT_PAGES is misnamed and should instead be
MAX_INDIRECT_SEGS (which on x86 is exactly the same because PAGE_SIZE
== XEN_PAGE_SIZE).

Roger.


Re: [Xen-devel] [PATCH v2 1/5] xen/blkback: don't keep persistent grants too long

2018-08-08 Thread Roger Pau Monné
On Wed, Aug 08, 2018 at 04:25:24PM +0200, Juergen Gross wrote:
> Persistent grants are allocated until a threshold per ring is being
> reached. Those grants won't be freed until the ring is being destroyed
> meaning there will be resources kept busy which might no longer be
> used.
> 
> Instead of freeing only persistent grants until the threshold is
> reached add a timestamp and remove all persistent grants not having
> been in use for a minute.
> 
> Signed-off-by: Juergen Gross 

Reviewed-by: Roger Pau Monné 

Thanks, Roger.


Re: [Xen-devel] [PATCH v2 1/5] xen/blkback: don't keep persistent grants too long

2018-08-08 Thread Roger Pau Monné
On Wed, Aug 08, 2018 at 04:25:24PM +0200, Juergen Gross wrote:
> Persistent grants are allocated until a threshold per ring is being
> reached. Those grants won't be freed until the ring is being destroyed
> meaning there will be resources kept busy which might no longer be
> used.
> 
> Instead of freeing only persistent grants until the threshold is
> reached add a timestamp and remove all persistent grants not having
> been in use for a minute.
> 
> Signed-off-by: Juergen Gross 

Reviewed-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH] xen/blkfront: remove unused macros

2018-07-25 Thread Roger Pau Monné
On Wed, Jul 25, 2018 at 09:42:07AM +0200, Juergen Gross wrote:
> Remove some macros not used anywhere.
> 
> Signed-off-by: Juergen Gross 

Acked-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH] xen/blkfront: remove unused macros

2018-07-25 Thread Roger Pau Monné
On Wed, Jul 25, 2018 at 09:42:07AM +0200, Juergen Gross wrote:
> Remove some macros not used anywhere.
> 
> Signed-off-by: Juergen Gross 

Acked-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH 4.4 119/268] xen/pirq: fix error path cleanup when binding MSIs

2018-06-14 Thread Roger Pau Monné
On Wed, Jun 13, 2018 at 07:48:50PM +0100, Ben Hutchings wrote:
> On Wed, 2018-02-28 at 09:19 +, Roger Pau Monne wrote:
> > From: Roger Pau Monne 
> > 
> > [ Upstream commit 910f8befdf5bccf25287d9f1743e3e546bcb7ce0 ]
> > 
> > Current cleanup in the error path of xen_bind_pirq_msi_to_irq is
> > wrong. First of all there's an off-by-one in the cleanup loop, which
> > can lead to unbinding wrong IRQs.
> > 
> > Secondly IRQs not bound won't be freed, thus leaking IRQ numbers.
> > 
> > Note that there's no need to differentiate between bound and unbound
> > IRQs when freeing them, __unbind_from_irq will deal with both of them
> > correctly.
> 
> It appears to me that it is safe to call __unbind_from_irq() after
> xen_irq_info_common_setup() fails, but *not* if the latter hasn't been
> called at all.  In that case the IRQ type will still be set to
> IRQT_UNBOUND and this will trigger the BUG_ON() in __unbind_from_irq().
> 
> [...]
> > --- a/drivers/xen/events/events_base.c
> > +++ b/drivers/xen/events/events_base.c
> > @@ -764,8 +764,8 @@ out:
> >     mutex_unlock(_mapping_update_lock);
> >     return irq;
> >  error_irq:
> > -   for (; i >= 0; i--)
> > -   __unbind_from_irq(irq + i);
> > +   while (nvec--)
> > +   __unbind_from_irq(irq + nvec);
> 
> If nvec > 1, and xen_irq_info_pirq_setup() fails for i != nvec - 1,
> then we reach here without having called xen_irq_info_common_setup()
> for all these IRQs.
> 
> In that case, I think we will still want to call xen_free_irq() for all
> IRQs.  So maybe the fix would be to remove the BUG_ON() in
> __unbind_from_irq()?

I think your analysis is right, and I agree that removing the BUG_ON
from __unbind_from_irq seems like the right solution.

I can't see any issues from calling xen_free_irq with type ==
IRQT_UNBOUND, but I've already attempted to fix this once and failed,
so I would like to get second opinions. Also I'm not sure of the
reason behind that BUG_ON.

Roger.


Re: [PATCH 4.4 119/268] xen/pirq: fix error path cleanup when binding MSIs

2018-06-14 Thread Roger Pau Monné
On Wed, Jun 13, 2018 at 07:48:50PM +0100, Ben Hutchings wrote:
> On Wed, 2018-02-28 at 09:19 +, Roger Pau Monne wrote:
> > From: Roger Pau Monne 
> > 
> > [ Upstream commit 910f8befdf5bccf25287d9f1743e3e546bcb7ce0 ]
> > 
> > Current cleanup in the error path of xen_bind_pirq_msi_to_irq is
> > wrong. First of all there's an off-by-one in the cleanup loop, which
> > can lead to unbinding wrong IRQs.
> > 
> > Secondly IRQs not bound won't be freed, thus leaking IRQ numbers.
> > 
> > Note that there's no need to differentiate between bound and unbound
> > IRQs when freeing them, __unbind_from_irq will deal with both of them
> > correctly.
> 
> It appears to me that it is safe to call __unbind_from_irq() after
> xen_irq_info_common_setup() fails, but *not* if the latter hasn't been
> called at all.  In that case the IRQ type will still be set to
> IRQT_UNBOUND and this will trigger the BUG_ON() in __unbind_from_irq().
> 
> [...]
> > --- a/drivers/xen/events/events_base.c
> > +++ b/drivers/xen/events/events_base.c
> > @@ -764,8 +764,8 @@ out:
> >     mutex_unlock(_mapping_update_lock);
> >     return irq;
> >  error_irq:
> > -   for (; i >= 0; i--)
> > -   __unbind_from_irq(irq + i);
> > +   while (nvec--)
> > +   __unbind_from_irq(irq + nvec);
> 
> If nvec > 1, and xen_irq_info_pirq_setup() fails for i != nvec - 1,
> then we reach here without having called xen_irq_info_common_setup()
> for all these IRQs.
> 
> In that case, I think we will still want to call xen_free_irq() for all
> IRQs.  So maybe the fix would be to remove the BUG_ON() in
> __unbind_from_irq()?

I think your analysis is right, and I agree that removing the BUG_ON
from __unbind_from_irq seems like the right solution.

I can't see any issues from calling xen_free_irq with type ==
IRQT_UNBOUND, but I've already attempted to fix this once and failed,
so I would like to get second opinions. Also I'm not sure of the
reason behind that BUG_ON.

Roger.


Re: [Xen-devel] [PATCH v2 1/3] xen/pvh: enable and set default MTRR type

2018-05-09 Thread Roger Pau Monné
On Wed, May 09, 2018 at 12:30:16PM +0100, Roger Pau Monné wrote:
> On Wed, May 09, 2018 at 11:56:40AM +0100, Andrew Cooper wrote:
> > On 09/05/18 11:21, Roger Pau Monne wrote:
> > I'm not sure that setting the default MTRR type is going to be a
> > clever idea in hindsight when we come to doing PCI Passthrough support.
> 
> Setting the default type to WB is also set by hvmloader, it's just
> that hvmloader also sets some of the fixed and variable ranges to UC
> in order to cover the iomem areas.
> 
> The expectations when doing pci-passthrough is that the guest will
> always use paging and PAT in order to set the appropriate cache
> attributes, or else the guest itself will have to program the UC MTRR
> ranges, I admit that's not very nice however.
> 
> What about enabling the default MTRR type and setting it to WB in the
> toolstack for PVH? IMO doing it Xen itself would be wrong.

I have the following patch to set the default MTRR type, but I think
if we go down this road then we will also have to set UC MTRRs for
MMIO areas, which again seems fine to me.

---8<---
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index e33a28847d..3cb1a1720f 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -938,6 +938,8 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 HVM_SAVE_TYPE(HEADER) header;
 struct hvm_save_descriptor cpu_d;
 HVM_SAVE_TYPE(CPU) cpu;
+struct hvm_save_descriptor mtrr_d;
+HVM_SAVE_TYPE(MTRR) mtrr;
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } bsp_ctx;
@@ -1014,6 +1016,15 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 if ( dom->start_info_seg.pfn )
 bsp_ctx.cpu.rbx = dom->start_info_seg.pfn << PAGE_SHIFT;
 
+/* Set the MTRR. */
+bsp_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
+bsp_ctx.mtrr_d.instance = 0;
+bsp_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
+/* XXX: maybe this should be a firmware option instead? */
+if ( !dom->device_model )
+/* Enable MTRR (bit 11) and set the default type to WB (6). */
+bsp_ctx.mtrr.msr_mtrr_def_type = (1u << 11) | 6;
+
 /* Set the end descriptor. */
 bsp_ctx.end_d.typecode = HVM_SAVE_CODE(END);
 bsp_ctx.end_d.instance = 0;



Re: [Xen-devel] [PATCH v2 1/3] xen/pvh: enable and set default MTRR type

2018-05-09 Thread Roger Pau Monné
On Wed, May 09, 2018 at 12:30:16PM +0100, Roger Pau Monné wrote:
> On Wed, May 09, 2018 at 11:56:40AM +0100, Andrew Cooper wrote:
> > On 09/05/18 11:21, Roger Pau Monne wrote:
> > I'm not sure that setting the default MTRR type is going to be a
> > clever idea in hindsight when we come to doing PCI Passthrough support.
> 
> Setting the default type to WB is also set by hvmloader, it's just
> that hvmloader also sets some of the fixed and variable ranges to UC
> in order to cover the iomem areas.
> 
> The expectations when doing pci-passthrough is that the guest will
> always use paging and PAT in order to set the appropriate cache
> attributes, or else the guest itself will have to program the UC MTRR
> ranges, I admit that's not very nice however.
> 
> What about enabling the default MTRR type and setting it to WB in the
> toolstack for PVH? IMO doing it Xen itself would be wrong.

I have the following patch to set the default MTRR type, but I think
if we go down this road then we will also have to set UC MTRRs for
MMIO areas, which again seems fine to me.

---8<---
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index e33a28847d..3cb1a1720f 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -938,6 +938,8 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 HVM_SAVE_TYPE(HEADER) header;
 struct hvm_save_descriptor cpu_d;
 HVM_SAVE_TYPE(CPU) cpu;
+struct hvm_save_descriptor mtrr_d;
+HVM_SAVE_TYPE(MTRR) mtrr;
 struct hvm_save_descriptor end_d;
 HVM_SAVE_TYPE(END) end;
 } bsp_ctx;
@@ -1014,6 +1016,15 @@ static int vcpu_hvm(struct xc_dom_image *dom)
 if ( dom->start_info_seg.pfn )
 bsp_ctx.cpu.rbx = dom->start_info_seg.pfn << PAGE_SHIFT;
 
+/* Set the MTRR. */
+bsp_ctx.mtrr_d.typecode = HVM_SAVE_CODE(MTRR);
+bsp_ctx.mtrr_d.instance = 0;
+bsp_ctx.mtrr_d.length = HVM_SAVE_LENGTH(MTRR);
+/* XXX: maybe this should be a firmware option instead? */
+if ( !dom->device_model )
+/* Enable MTRR (bit 11) and set the default type to WB (6). */
+bsp_ctx.mtrr.msr_mtrr_def_type = (1u << 11) | 6;
+
 /* Set the end descriptor. */
 bsp_ctx.end_d.typecode = HVM_SAVE_CODE(END);
 bsp_ctx.end_d.instance = 0;



Re: [Xen-devel] [PATCH v2 1/3] xen/pvh: enable and set default MTRR type

2018-05-09 Thread Roger Pau Monné
On Wed, May 09, 2018 at 11:56:40AM +0100, Andrew Cooper wrote:
> On 09/05/18 11:21, Roger Pau Monne wrote:
> > On PVH MTRR is not initialized by the firmware (because there's no
> > firmware), so the kernel is started with MTRR disabled which means all
> > memory accesses are UC.
> >
> > So far there have been no issues (ie: slowdowns) caused by this
> > because PVH only supported DomU mode without passed-through devices,
> > so Xen was using WB as the default memory type instead of UC.
> >
> > Fix this by enabling MTRR and setting the default type to WB. Linux
> > will use PAT to set the actual memory cache attributes.
> >
> > Signed-off-by: Boris Ostrovsky <boris.ostrov...@oracle.com>
> > Signed-off-by: Roger Pau Monné <roger@citrix.com>
> 
> I'd argue that this is a bug in PVH starting state.

Do you mean that MTRR should be setup before starting the guest?

> Do you know what mechanism is used to bodge things to WB in the first
> place?

If you mean when passthorugh is not used (ie: no IOMMU), then it's at
epte_get_entry_emt, grep for need_iommu(d) (line ~801).

> I'm not sure that setting the default MTRR type is going to be a
> clever idea in hindsight when we come to doing PCI Passthrough support.

Setting the default type to WB is also set by hvmloader, it's just
that hvmloader also sets some of the fixed and variable ranges to UC
in order to cover the iomem areas.

The expectations when doing pci-passthrough is that the guest will
always use paging and PAT in order to set the appropriate cache
attributes, or else the guest itself will have to program the UC MTRR
ranges, I admit that's not very nice however.

What about enabling the default MTRR type and setting it to WB in the
toolstack for PVH? IMO doing it Xen itself would be wrong.

Thanks, Roger.


Re: [Xen-devel] [PATCH v2 1/3] xen/pvh: enable and set default MTRR type

2018-05-09 Thread Roger Pau Monné
On Wed, May 09, 2018 at 11:56:40AM +0100, Andrew Cooper wrote:
> On 09/05/18 11:21, Roger Pau Monne wrote:
> > On PVH MTRR is not initialized by the firmware (because there's no
> > firmware), so the kernel is started with MTRR disabled which means all
> > memory accesses are UC.
> >
> > So far there have been no issues (ie: slowdowns) caused by this
> > because PVH only supported DomU mode without passed-through devices,
> > so Xen was using WB as the default memory type instead of UC.
> >
> > Fix this by enabling MTRR and setting the default type to WB. Linux
> > will use PAT to set the actual memory cache attributes.
> >
> > Signed-off-by: Boris Ostrovsky 
> > Signed-off-by: Roger Pau Monné 
> 
> I'd argue that this is a bug in PVH starting state.

Do you mean that MTRR should be setup before starting the guest?

> Do you know what mechanism is used to bodge things to WB in the first
> place?

If you mean when passthorugh is not used (ie: no IOMMU), then it's at
epte_get_entry_emt, grep for need_iommu(d) (line ~801).

> I'm not sure that setting the default MTRR type is going to be a
> clever idea in hindsight when we come to doing PCI Passthrough support.

Setting the default type to WB is also set by hvmloader, it's just
that hvmloader also sets some of the fixed and variable ranges to UC
in order to cover the iomem areas.

The expectations when doing pci-passthrough is that the guest will
always use paging and PAT in order to set the appropriate cache
attributes, or else the guest itself will have to program the UC MTRR
ranges, I admit that's not very nice however.

What about enabling the default MTRR type and setting it to WB in the
toolstack for PVH? IMO doing it Xen itself would be wrong.

Thanks, Roger.


Re: [Xen-devel] [PATCH 6/6] xen-blkfront: prepare request locally, only then put it on the shared ring

2018-05-01 Thread Roger Pau Monné
On Tue, May 01, 2018 at 09:22:31AM +0100, Roger Pau Monné wrote:
> On Mon, Apr 30, 2018 at 11:01:50PM +0200, Marek Marczykowski-Górecki wrote:
> > struct request *req,
> > -   struct blkif_request **ring_req)
> > +   struct blkif_request *ring_req)
> >  {
> > unsigned long id;
> >  
> > -   *ring_req = RING_GET_REQUEST(>ring, rinfo->ring.req_prod_pvt);
> > -   rinfo->ring.req_prod_pvt++;
> > -
> > id = get_id_from_freelist(rinfo);
> > rinfo->shadow[id].request = req;
> > rinfo->shadow[id].status = REQ_WAITING;
> > rinfo->shadow[id].associated_id = NO_ASSOCIATED_ID;
> >  
> > -   (*ring_req)->u.rw.id = id;
> > +   ring_req->u.rw.id = id;
> >  
> > return id;
> >  }
> > @@ -545,23 +542,28 @@ static unsigned long blkif_ring_get_request(struct 
> > blkfront_ring_info *rinfo,
> >  static int blkif_queue_discard_req(struct request *req, struct 
> > blkfront_ring_info *rinfo)
> >  {
> > struct blkfront_info *info = rinfo->dev_info;
> > -   struct blkif_request *ring_req;
> > +   struct blkif_request ring_req = { 0 };
> > unsigned long id;
> >  
> > /* Fill out a communications ring structure. */
> > id = blkif_ring_get_request(rinfo, req, _req);
> 
> Maybe I'm missing something obvious here, but you are adding a struct
> allocated on the stack to the shadow ring copy, isn't this dangerous?

The above comment is wrong, you are storing a pointer to 'req' in the
shadow ring copy, which is fine and is not the ring request.

Roger.


Re: [Xen-devel] [PATCH 6/6] xen-blkfront: prepare request locally, only then put it on the shared ring

2018-05-01 Thread Roger Pau Monné
On Tue, May 01, 2018 at 09:22:31AM +0100, Roger Pau Monné wrote:
> On Mon, Apr 30, 2018 at 11:01:50PM +0200, Marek Marczykowski-Górecki wrote:
> > struct request *req,
> > -   struct blkif_request **ring_req)
> > +   struct blkif_request *ring_req)
> >  {
> > unsigned long id;
> >  
> > -   *ring_req = RING_GET_REQUEST(>ring, rinfo->ring.req_prod_pvt);
> > -   rinfo->ring.req_prod_pvt++;
> > -
> > id = get_id_from_freelist(rinfo);
> > rinfo->shadow[id].request = req;
> > rinfo->shadow[id].status = REQ_WAITING;
> > rinfo->shadow[id].associated_id = NO_ASSOCIATED_ID;
> >  
> > -   (*ring_req)->u.rw.id = id;
> > +   ring_req->u.rw.id = id;
> >  
> > return id;
> >  }
> > @@ -545,23 +542,28 @@ static unsigned long blkif_ring_get_request(struct 
> > blkfront_ring_info *rinfo,
> >  static int blkif_queue_discard_req(struct request *req, struct 
> > blkfront_ring_info *rinfo)
> >  {
> > struct blkfront_info *info = rinfo->dev_info;
> > -   struct blkif_request *ring_req;
> > +   struct blkif_request ring_req = { 0 };
> > unsigned long id;
> >  
> > /* Fill out a communications ring structure. */
> > id = blkif_ring_get_request(rinfo, req, _req);
> 
> Maybe I'm missing something obvious here, but you are adding a struct
> allocated on the stack to the shadow ring copy, isn't this dangerous?

The above comment is wrong, you are storing a pointer to 'req' in the
shadow ring copy, which is fine and is not the ring request.

Roger.


Re: [PATCH 6/6] xen-blkfront: prepare request locally, only then put it on the shared ring

2018-05-01 Thread Roger Pau Monné
On Mon, Apr 30, 2018 at 11:01:50PM +0200, Marek Marczykowski-Górecki wrote:
> Do not reuse data which theoretically might be already modified by the
> backend. This is mostly about private copy of the request
> (info->shadow[id].req) - make sure the request saved there is really the
> one just filled.
>
> This is complementary to XSA155.
> 
> CC: sta...@vger.kernel.org
> Signed-off-by: Marek Marczykowski-Górecki 
> ---
>  drivers/block/xen-blkfront.c | 76 +
>  1 file changed, 44 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3926811..b100b55 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -525,19 +525,16 @@ static int blkif_ioctl(struct block_device *bdev, 
> fmode_t mode,
>  
>  static unsigned long blkif_ring_get_request(struct blkfront_ring_info *rinfo,

The name of this function should be changed IMO, since you are no
longer getting a request from the ring, but just initializing a
request struct.

>   struct request *req,
> - struct blkif_request **ring_req)
> + struct blkif_request *ring_req)
>  {
>   unsigned long id;
>  
> - *ring_req = RING_GET_REQUEST(>ring, rinfo->ring.req_prod_pvt);
> - rinfo->ring.req_prod_pvt++;
> -
>   id = get_id_from_freelist(rinfo);
>   rinfo->shadow[id].request = req;
>   rinfo->shadow[id].status = REQ_WAITING;
>   rinfo->shadow[id].associated_id = NO_ASSOCIATED_ID;
>  
> - (*ring_req)->u.rw.id = id;
> + ring_req->u.rw.id = id;
>  
>   return id;
>  }
> @@ -545,23 +542,28 @@ static unsigned long blkif_ring_get_request(struct 
> blkfront_ring_info *rinfo,
>  static int blkif_queue_discard_req(struct request *req, struct 
> blkfront_ring_info *rinfo)
>  {
>   struct blkfront_info *info = rinfo->dev_info;
> - struct blkif_request *ring_req;
> + struct blkif_request ring_req = { 0 };
>   unsigned long id;
>  
>   /* Fill out a communications ring structure. */
>   id = blkif_ring_get_request(rinfo, req, _req);

Maybe I'm missing something obvious here, but you are adding a struct
allocated on the stack to the shadow ring copy, isn't this dangerous?

The pointer stored in the shadow ring copy is going to be invalid
after returning from this function.

The same comment applies to the other calls to blkif_ring_get_request
below that pass a ring_reg allocated on the stack.

Thanks, Roger.


Re: [PATCH 6/6] xen-blkfront: prepare request locally, only then put it on the shared ring

2018-05-01 Thread Roger Pau Monné
On Mon, Apr 30, 2018 at 11:01:50PM +0200, Marek Marczykowski-Górecki wrote:
> Do not reuse data which theoretically might be already modified by the
> backend. This is mostly about private copy of the request
> (info->shadow[id].req) - make sure the request saved there is really the
> one just filled.
>
> This is complementary to XSA155.
> 
> CC: sta...@vger.kernel.org
> Signed-off-by: Marek Marczykowski-Górecki 
> ---
>  drivers/block/xen-blkfront.c | 76 +
>  1 file changed, 44 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3926811..b100b55 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -525,19 +525,16 @@ static int blkif_ioctl(struct block_device *bdev, 
> fmode_t mode,
>  
>  static unsigned long blkif_ring_get_request(struct blkfront_ring_info *rinfo,

The name of this function should be changed IMO, since you are no
longer getting a request from the ring, but just initializing a
request struct.

>   struct request *req,
> - struct blkif_request **ring_req)
> + struct blkif_request *ring_req)
>  {
>   unsigned long id;
>  
> - *ring_req = RING_GET_REQUEST(>ring, rinfo->ring.req_prod_pvt);
> - rinfo->ring.req_prod_pvt++;
> -
>   id = get_id_from_freelist(rinfo);
>   rinfo->shadow[id].request = req;
>   rinfo->shadow[id].status = REQ_WAITING;
>   rinfo->shadow[id].associated_id = NO_ASSOCIATED_ID;
>  
> - (*ring_req)->u.rw.id = id;
> + ring_req->u.rw.id = id;
>  
>   return id;
>  }
> @@ -545,23 +542,28 @@ static unsigned long blkif_ring_get_request(struct 
> blkfront_ring_info *rinfo,
>  static int blkif_queue_discard_req(struct request *req, struct 
> blkfront_ring_info *rinfo)
>  {
>   struct blkfront_info *info = rinfo->dev_info;
> - struct blkif_request *ring_req;
> + struct blkif_request ring_req = { 0 };
>   unsigned long id;
>  
>   /* Fill out a communications ring structure. */
>   id = blkif_ring_get_request(rinfo, req, _req);

Maybe I'm missing something obvious here, but you are adding a struct
allocated on the stack to the shadow ring copy, isn't this dangerous?

The pointer stored in the shadow ring copy is going to be invalid
after returning from this function.

The same comment applies to the other calls to blkif_ring_get_request
below that pass a ring_reg allocated on the stack.

Thanks, Roger.


  1   2   3   4   5   6   7   >