Re: [PATCH 10/11] block: xen-blkfront: Demote kernel-doc abuses

2021-04-06 Thread Roger Pau Monné
On Fri, Mar 12, 2021 at 10:55:29AM +, Lee Jones wrote:
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/block/xen-blkfront.c:1960: warning: Function parameter or member 
> 'dev' not described in 'blkfront_probe'
>  drivers/block/xen-blkfront.c:1960: warning: Function parameter or member 
> 'id' not described in 'blkfront_probe'
>  drivers/block/xen-blkfront.c:1960: warning: expecting prototype for Allocate 
> the basic(). Prototype was for blkfront_probe() instead
>  drivers/block/xen-blkfront.c:2085: warning: Function parameter or member 
> 'dev' not described in 'blkfront_resume'
>  drivers/block/xen-blkfront.c:2085: warning: expecting prototype for or a 
> backend(). Prototype was for blkfront_resume() instead
>  drivers/block/xen-blkfront.c:2444: warning: wrong kernel-doc identifier on 
> line:
> 
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Jens Axboe 
> Cc: xen-de...@lists.xenproject.org
> Cc: linux-bl...@vger.kernel.org
> Signed-off-by: Lee Jones 

Acked-by: Roger Pau Monné 

Thanks, Roger.


[PATCH v3] intel/pinctrl: check REVID register value for device presence

2021-03-25 Thread Roger Pau Monne
Use the value read from the REVID register in order to check for the
presence of the device. A read of all ones is treated as if the device
is not present, and hence probing is ended.

This fixes an issue when running as a Xen PVH dom0, where the ACPI
DSDT table is provided unmodified to dom0 and hence contains the
pinctrl devices, but the MMIO region(s) containing the device
registers might not be mapped in the guest physical memory map if such
region(s) are not exposed on a PCI device BAR or marked as reserved in
the host memory map.

91d898e51e60 ('pinctrl: intel: Convert capability list to features')
Suggested-by: Andy Shevchenko 
Signed-off-by: Roger Pau Monné 
---
Changes since v2:
 - Return ENODEV.
 - Adjust code comment.

Changes since v1:
 - New in this version.
---
Cc: Mika Westerberg 
Cc: Andy Shevchenko 
Cc: Linus Walleij 
Cc: linux-g...@vger.kernel.org
---
 drivers/pinctrl/intel/pinctrl-intel.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pinctrl/intel/pinctrl-intel.c 
b/drivers/pinctrl/intel/pinctrl-intel.c
index 8085782cd8f9..9fc5bba514ea 100644
--- a/drivers/pinctrl/intel/pinctrl-intel.c
+++ b/drivers/pinctrl/intel/pinctrl-intel.c
@@ -1491,8 +1491,13 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
if (IS_ERR(regs))
return PTR_ERR(regs);
 
-   /* Determine community features based on the revision */
+   /*
+* Determine community features based on the revision.
+* A value of all ones means the device is not present.
+*/
value = readl(regs + REVID);
+   if (value == ~0u)
+   return -ENODEV;
if (((value & REVID_MASK) >> REVID_SHIFT) >= 0x94) {
community->features |= PINCTRL_FEATURE_DEBOUNCE;
community->features |= PINCTRL_FEATURE_1K_PD;
-- 
2.30.1



Re: [PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-25 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 06:57:12PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 04:13:59PM +0100, Roger Pau Monné wrote:
> > On Wed, Mar 24, 2021 at 04:22:44PM +0200, Andy Shevchenko wrote:
> > > On Wed, Mar 24, 2021 at 02:55:15PM +0100, Roger Pau Monné wrote:
> > > > On Wed, Mar 24, 2021 at 02:58:07PM +0200, Andy Shevchenko wrote:
> > > > > On Wed, Mar 24, 2021 at 01:31:18PM +0100, Roger Pau Monne wrote:
> > > > > Moreover, it seems you are bailing out and basically denying driver 
> > > > > to load.
> > > > > This does look that capability is simply the first register that 
> > > > > blows the setup.
> > > > > I think you have to fix something into Xen to avoid loading these 
> > > > > drivers or
> > > > > check with something like pci_device_is_present() approach.
> > > > 
> > > > Is there a backing PCI device BAR for those MMIO regions that the
> > > > pinctrl driver is trying to access? AFAICT those regions are only
> > > > reported in the ACPI DSDT table on the _CRS method of the object (at
> > > > least on my system).
> > > 
> > > Unfortunately it does not expose PCI configuration space.
> > 
> > Are those regions supposed to be marked as reserved in the memory map,
> > or that's left to the discretion of the hardware vendor?
> 
> I didn't get. The OS doesn't see them and an internal backbone simply drops 
> any
> IO access to that region.

I'm not sure I understand the above reply. My question was whether the
MMIO regions used by the pinctrl device (as fetched from the ACPI DSDT
table) are supposed belong to regions marked as RESERVED in the
firmware memory map (ie: either the e820 or the EFI one).

> > > > Doing something like pci_device_is_present would require a register
> > > > that we know will never return ~0 unless the device is not present. As
> > > > said above, maybe we could use REVID to that end?
> > > 
> > > Yes, that's good, see above.
> > > 
> > > WRT capabilities, if we crash we will see the report immediately on the
> > > hardware which has such an issue. (It's quite unlikely we will ever have 
> > > one,
> > > that's why I consider it's not critical)
> > 
> > I would rather prefer to not crash, because I think the kernel should
> > only resort to crashing when there's no alternative, and here it's
> > perfectly fine to just print an error message and don't load the
> > driver.
> 
> Are we speaking about real hardware that has an issue? I eagerly want to know
> what is that beast.

OK, I'm not going to resend this anymore. I'm happy with just getting
the first patch in.

I think you trust the hardware more that I would do, and I also think
the check added here is very minimal an unintrusive and serves as a
way to sanitize the data fetched from the hardware in order to prevent
a kernel page fault if such data turns out to be wrong.

Taking a reactive approach of requiring a broken piece of hardware to
exist in order to sanitize a fetched value seems too risky. I could
add a WARN_ON or similar if you want some kind of splat that's very
noticeable when this goes wrong but that doesn't end up in a fatal
kernel page fault.

Thanks, Roger.


Re: [PATCH v2 1/2] intel/pinctrl: check REVID register value for device presence

2021-03-25 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 07:01:18PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 04:43:11PM +0100, Roger Pau Monne wrote:
> 
> Thanks for a fix! My comments below.
> 
> > Use the value read from the REVID register in order to check for the
> > presence of the device. A read of all ones is treated as if the device
> > is not present, and hence probing is ended.
> > 
> > This fixes an issue when running as a Xen PVH dom0, where the ACPI
> > DSDT table is provided unmodified to dom0 and hence contains the
> > pinctrl devices, but the MMIO region(s) containing the device
> > registers might not be mapped in the guest physical memory map if such
> > region(s) are not exposed on a PCI device BAR or marked as reserved in
> > the host memory map.
> 
> Any particular point that we can use in the Fixes tag?

Hm, I haven't seen those issues up until 91d898e51e60 ('pinctrl:
intel: Convert capability list to features'), but the device wasn't
working properly for sure, as the registers where not accessible, it
just didn't lead to a kernel crash.

> ...
> 
> > Suggested-by: Andy Shevchenko 
> 
> Hmm... was it that address I have used? In any case I think my 
> @linux.intel.com
> is better.

I just used the same as the one that's on the MAINTAINERS file,
because I already had that n my Cc list. I can change to the @intel
one if that's your preference.

> ...
> 
> > /* Determine community features based on the revision */
> > value = readl(regs + REVID);
> > +   if (value == ~0u)
> > +   return -ENODATA;
> 
> I think -ENODEV is more appropriate here.
> Also comment above should be adjusted to explain this check.

Right, will change and send v3.

Thanks.


[PATCH v2 2/2] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monne
When parsing the capability list make sure the offset is between the
MMIO region mapped in 'regs', or else the kernel hits a page fault.

Adding the check is harmless, and prevents buggy or broken systems
from crashing the kernel if the capability linked list is somehow
broken.

Fixes: 91d898e51e60 ('pinctrl: intel: Convert capability list to features')
Signed-off-by: Roger Pau Monné 
---
Changes since v1:
 - Adjust commit message.
---
Cc: Mika Westerberg 
Cc: Andy Shevchenko 
Cc: Linus Walleij 
Cc: linux-g...@vger.kernel.org
---
 drivers/pinctrl/intel/pinctrl-intel.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pinctrl/intel/pinctrl-intel.c 
b/drivers/pinctrl/intel/pinctrl-intel.c
index 59d13342caf6..d45a6994b2a3 100644
--- a/drivers/pinctrl/intel/pinctrl-intel.c
+++ b/drivers/pinctrl/intel/pinctrl-intel.c
@@ -1481,16 +1481,22 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
 
for (i = 0; i < pctrl->ncommunities; i++) {
struct intel_community *community = >communities[i];
+   struct resource *res;
void __iomem *regs;
+   size_t size;
u32 offset;
u32 value;
 
*community = pctrl->soc->communities[i];
 
-   regs = devm_platform_ioremap_resource(pdev, community->barno);
+   regs = devm_platform_get_and_ioremap_resource(pdev,
+ community->barno,
+ );
if (IS_ERR(regs))
return PTR_ERR(regs);
 
+   size = res->end - res->start;
+
/* Determine community features based on the revision */
value = readl(regs + REVID);
if (value == ~0u)
@@ -1521,6 +1527,12 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
break;
}
offset = (value & CAPLIST_NEXT_MASK) >> 
CAPLIST_NEXT_SHIFT;
+   if (offset >= size) {
+   dev_err(>dev,
+   "wrong capability offset: %#x\n",
+   offset);
+   return -ENOENT;
+   }
} while (offset);
 
dev_dbg(>dev, "Community%d features: %#08x\n", i, 
community->features);
-- 
2.30.1



[PATCH v2 1/2] intel/pinctrl: check REVID register value for device presence

2021-03-24 Thread Roger Pau Monne
Use the value read from the REVID register in order to check for the
presence of the device. A read of all ones is treated as if the device
is not present, and hence probing is ended.

This fixes an issue when running as a Xen PVH dom0, where the ACPI
DSDT table is provided unmodified to dom0 and hence contains the
pinctrl devices, but the MMIO region(s) containing the device
registers might not be mapped in the guest physical memory map if such
region(s) are not exposed on a PCI device BAR or marked as reserved in
the host memory map.

Suggested-by: Andy Shevchenko 
Signed-off-by: Roger Pau Monné 
---
Changes since v1:
 - New in this version.
---
Cc: Mika Westerberg 
Cc: Andy Shevchenko 
Cc: Linus Walleij 
Cc: linux-g...@vger.kernel.org
---
 drivers/pinctrl/intel/pinctrl-intel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pinctrl/intel/pinctrl-intel.c 
b/drivers/pinctrl/intel/pinctrl-intel.c
index 8085782cd8f9..59d13342caf6 100644
--- a/drivers/pinctrl/intel/pinctrl-intel.c
+++ b/drivers/pinctrl/intel/pinctrl-intel.c
@@ -1493,6 +1493,8 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
 
/* Determine community features based on the revision */
value = readl(regs + REVID);
+   if (value == ~0u)
+   return -ENODATA;
if (((value & REVID_MASK) >> REVID_SHIFT) >= 0x94) {
community->features |= PINCTRL_FEATURE_DEBOUNCE;
community->features |= PINCTRL_FEATURE_1K_PD;
-- 
2.30.1



[PATCH v2 0/2] intel/pinctrl: check some registers reads

2021-03-24 Thread Roger Pau Monne
Hello,

The following series adds some consistency checks to the values returned
by some of the MMIO registers of the Intel pinctrl device.

That done to avoid a crash when running as a PVH dom0. See patch #1 for
more details.

Thanks, Roger.

Roger Pau Monne (2):
  intel/pinctrl: check REVID register value for device presence
  intel/pinctrl: check capability offset is between MMIO region

 drivers/pinctrl/intel/pinctrl-intel.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

-- 
2.30.1



Re: [PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 04:22:44PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 02:55:15PM +0100, Roger Pau Monné wrote:
> > On Wed, Mar 24, 2021 at 02:58:07PM +0200, Andy Shevchenko wrote:
> > > On Wed, Mar 24, 2021 at 01:31:18PM +0100, Roger Pau Monne wrote:
> > > Moreover, it seems you are bailing out and basically denying driver to 
> > > load.
> > > This does look that capability is simply the first register that blows 
> > > the setup.
> > > I think you have to fix something into Xen to avoid loading these drivers 
> > > or
> > > check with something like pci_device_is_present() approach.
> > 
> > Is there a backing PCI device BAR for those MMIO regions that the
> > pinctrl driver is trying to access? AFAICT those regions are only
> > reported in the ACPI DSDT table on the _CRS method of the object (at
> > least on my system).
> 
> Unfortunately it does not expose PCI configuration space.

Are those regions supposed to be marked as reserved in the memory map,
or that's left to the discretion of the hardware vendor?

> > Doing something like pci_device_is_present would require a register
> > that we know will never return ~0 unless the device is not present. As
> > said above, maybe we could use REVID to that end?
> 
> Yes, that's good, see above.
> 
> WRT capabilities, if we crash we will see the report immediately on the
> hardware which has such an issue. (It's quite unlikely we will ever have one,
> that's why I consider it's not critical)

I would rather prefer to not crash, because I think the kernel should
only resort to crashing when there's no alternative, and here it's
perfectly fine to just print an error message and don't load the
driver. IMO I would rather boot without pinctrl than get a panic if
it turns out pinctrl capabilities list is somehow corrupted. It's a
long shot, but the check added in order to prevent this scenario is
minimal.

In any case I will send a new version with the REVID check and this
current patch.

Thanks, Roger.


Re: [PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monné
On Wed, Mar 24, 2021 at 02:58:07PM +0200, Andy Shevchenko wrote:
> On Wed, Mar 24, 2021 at 01:31:18PM +0100, Roger Pau Monne wrote:
> > When parsing the capability list make sure the offset is between the
> > MMIO region mapped in 'regs', or else the kernel hits a page fault.
> > 
> > This fault has been seen when running as a Xen PVH dom0, which doesn't
> > have the MMIO regions mapped into the domain physical memory map,
> > despite having the device reported in the ACPI DSDT table. This
> > results in reporting a capability offset of 0x (because the kernel
> > is accessing unpopulated memory), and such offset is outside of the
> > mapped region.
> > 
> > Adding the check is harmless, and prevents buggy or broken systems
> > from crashing the kernel if the MMIO region is not properly reported.
> 
> Thanks for the report.
> 
> Looking into the code I would like rather see the explicit comparison to 
> 0x
> or ~0 against entire register b/c it's (one of) standard way of devices to 
> tell
> that something is not supported.

That can be done also. I think what I've proposed is slightly more
robust, as it will prevent a kernel page fault if somehow the offset
to the next capability is below ~0, but past the end of the MMIO
region. Unlikely I know, but it's not worth a kernel panic.

What could be done is check whether reading REVID returns ~0 and exit
at that point, if ~0 will never be a valid value returned by that
register. I think that should be a separate patch however.

> Moreover, it seems you are bailing out and basically denying driver to load.
> This does look that capability is simply the first register that blows the 
> setup.
> I think you have to fix something into Xen to avoid loading these drivers or
> check with something like pci_device_is_present() approach.

Is there a backing PCI device BAR for those MMIO regions that the
pinctrl driver is trying to access? AFAICT those regions are only
reported in the ACPI DSDT table on the _CRS method of the object (at
least on my system).

Doing something like pci_device_is_present would require a register
that we know will never return ~0 unless the device is not present. As
said above, maybe we could use REVID to that end?

> > Fixes: 91d898e51e60 ('pinctrl: intel: Convert capability list to features')
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Mika Westerberg 
> > Cc: Andy Shevchenko 
> > Cc: Linus Walleij 
> > Cc: linux-g...@vger.kernel.org
> > ---
> > Resend because I've missed adding the maintainers, sorry for the spam.
> 
> I have a script to make it easier: 
> https://github.com/andy-shev/home-bin-tools/blob/master/ge2maintainer.sh

Thanks!


[PATCH RESEND] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monne
When parsing the capability list make sure the offset is between the
MMIO region mapped in 'regs', or else the kernel hits a page fault.

This fault has been seen when running as a Xen PVH dom0, which doesn't
have the MMIO regions mapped into the domain physical memory map,
despite having the device reported in the ACPI DSDT table. This
results in reporting a capability offset of 0x (because the kernel
is accessing unpopulated memory), and such offset is outside of the
mapped region.

Adding the check is harmless, and prevents buggy or broken systems
from crashing the kernel if the MMIO region is not properly reported.

Fixes: 91d898e51e60 ('pinctrl: intel: Convert capability list to features')
Signed-off-by: Roger Pau Monné 
---
Cc: Mika Westerberg 
Cc: Andy Shevchenko 
Cc: Linus Walleij 
Cc: linux-g...@vger.kernel.org
---
Resend because I've missed adding the maintainers, sorry for the spam.
---
 drivers/pinctrl/intel/pinctrl-intel.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pinctrl/intel/pinctrl-intel.c 
b/drivers/pinctrl/intel/pinctrl-intel.c
index 8085782cd8f9..bc8b990d8021 100644
--- a/drivers/pinctrl/intel/pinctrl-intel.c
+++ b/drivers/pinctrl/intel/pinctrl-intel.c
@@ -1481,16 +1481,22 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
 
for (i = 0; i < pctrl->ncommunities; i++) {
struct intel_community *community = >communities[i];
+   struct resource *res;
void __iomem *regs;
+   size_t size;
u32 offset;
u32 value;
 
*community = pctrl->soc->communities[i];
 
-   regs = devm_platform_ioremap_resource(pdev, community->barno);
+   regs = devm_platform_get_and_ioremap_resource(pdev,
+ community->barno,
+ );
if (IS_ERR(regs))
return PTR_ERR(regs);
 
+   size = res->end - res->start;
+
/* Determine community features based on the revision */
value = readl(regs + REVID);
if (((value & REVID_MASK) >> REVID_SHIFT) >= 0x94) {
@@ -1519,6 +1525,12 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
break;
}
offset = (value & CAPLIST_NEXT_MASK) >> 
CAPLIST_NEXT_SHIFT;
+   if (offset >= size) {
+   dev_err(>dev,
+   "wrong capability offset: %#x\n",
+   offset);
+   return -ENOENT;
+   }
} while (offset);
 
dev_dbg(>dev, "Community%d features: %#08x\n", i, 
community->features);
-- 
2.30.1



[PATCH v2 1/2] xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG

2021-03-24 Thread Roger Pau Monne
The Xen memory hotplug limit should depend on the memory hotplug
generic option, rather than the Xen balloon configuration. It's
possible to have a kernel with generic memory hotplug enabled, but
without Xen balloon enabled, at which point memory hotplug won't work
correctly due to the size limitation of the p2m.

Rename the option to XEN_MEMORY_HOTPLUG_LIMIT since it's no longer
tied to ballooning.

Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Jan Beulich 
Cc: xen-de...@lists.xenproject.org
---
 arch/x86/xen/p2m.c  | 4 ++--
 drivers/xen/Kconfig | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 17d80f751fcb..a33902d05e45 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -98,8 +98,8 @@ EXPORT_SYMBOL_GPL(xen_p2m_size);
 unsigned long xen_max_p2m_pfn __read_mostly;
 EXPORT_SYMBOL_GPL(xen_max_p2m_pfn);
 
-#ifdef CONFIG_XEN_BALLOON_MEMORY_HOTPLUG_LIMIT
-#define P2M_LIMIT CONFIG_XEN_BALLOON_MEMORY_HOTPLUG_LIMIT
+#ifdef CONFIG_XEN_MEMORY_HOTPLUG_LIMIT
+#define P2M_LIMIT CONFIG_XEN_MEMORY_HOTPLUG_LIMIT
 #else
 #define P2M_LIMIT 0
 #endif
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 41645fe6ad48..ea0efd290c37 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -50,11 +50,11 @@ config XEN_BALLOON_MEMORY_HOTPLUG
 
  SUBSYSTEM=="memory", ACTION=="add", RUN+="/bin/sh -c '[ -f 
/sys$devpath/state ] && echo online > /sys$devpath/state'"
 
-config XEN_BALLOON_MEMORY_HOTPLUG_LIMIT
+config XEN_MEMORY_HOTPLUG_LIMIT
int "Hotplugged memory limit (in GiB) for a PV guest"
default 512
depends on XEN_HAVE_PVMMU
-   depends on XEN_BALLOON_MEMORY_HOTPLUG
+   depends on MEMORY_HOTPLUG
help
  Maxmium amount of memory (in GiB) that a PV guest can be
  expanded to when using memory hotplug.
-- 
2.30.1



[PATCH v2 2/2] Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"

2021-03-24 Thread Roger Pau Monne
This partially reverts commit 882213990d32fd224340a4533f6318dd152be4b2.

There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.

Note that XEN_UNPOPULATED_ALLOC depends on ZONE_DEVICE which pulls in
MEMORY_HOTPLUG.

Leave the check added to __set_phys_to_machine and the adjusted
comment about EXTRA_MEM_RATIO.

Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Jan Beulich 
Cc: xen-de...@lists.xenproject.org
---
Changes since v1:
 - Expand commit message.
 - Do not revert code comment.
---
 arch/x86/include/asm/xen/page.h | 12 
 arch/x86/xen/p2m.c  |  3 ---
 arch/x86/xen/setup.c| 16 ++--
 3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 7068e4bb057d..1a162e559753 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -86,18 +86,6 @@ clear_foreign_p2m_mapping(struct gnttab_unmap_grant_ref 
*unmap_ops,
 }
 #endif
 
-/*
- * The maximum amount of extra memory compared to the base size.  The
- * main scaling factor is the size of struct page.  At extreme ratios
- * of base:extra, all the base memory can be filled with page
- * structures for the extra memory, leaving no space for anything
- * else.
- *
- * 10x seems like a reasonable balance between scaling flexibility and
- * leaving a practically usable system.
- */
-#define XEN_EXTRA_MEM_RATIO(10)
-
 /*
  * Helper functions to write or read unsigned long values to/from
  * memory, when the access may fault.
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index a33902d05e45..ac06ca32e9ef 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -416,9 +416,6 @@ void __init xen_vmalloc_p2m_tree(void)
xen_p2m_last_pfn = xen_max_p2m_pfn;
 
p2m_limit = (phys_addr_t)P2M_LIMIT * 1024 * 1024 * 1024 / PAGE_SIZE;
-   if (!p2m_limit && IS_ENABLED(CONFIG_XEN_UNPOPULATED_ALLOC))
-   p2m_limit = xen_start_info->nr_pages * XEN_EXTRA_MEM_RATIO;
-
vm.flags = VM_ALLOC;
vm.size = ALIGN(sizeof(unsigned long) * max(xen_max_p2m_pfn, p2m_limit),
PMD_SIZE * PMDS_PER_MID_PAGE);
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 1a3b75652fa4..99ef476dc702 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -59,6 +59,18 @@ static struct {
 } xen_remap_buf __initdata __aligned(PAGE_SIZE);
 static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
 
+/* 
+ * The maximum amount of extra memory compared to the base size.  The
+ * main scaling factor is the size of struct page.  At extreme ratios
+ * of base:extra, all the base memory can be filled with page
+ * structures for the extra memory, leaving no space for anything
+ * else.
+ * 
+ * 10x seems like a reasonable balance between scaling flexibility and
+ * leaving a practically usable system.
+ */
+#define EXTRA_MEM_RATIO(10)
+
 static bool xen_512gb_limit __initdata = IS_ENABLED(CONFIG_XEN_512GB);
 
 static void __init xen_parse_512gb(void)
@@ -778,13 +790,13 @@ char * __init xen_memory_setup(void)
extra_pages += max_pages - max_pfn;
 
/*
-* Clamp the amount of extra memory to a XEN_EXTRA_MEM_RATIO
+* Clamp the amount of extra memory to a EXTRA_MEM_RATIO
 * factor the base size.
 *
 * Make sure we have no memory above max_pages, as this area
 * isn't handled by the p2m management.
 */
-   extra_pages = min3(XEN_EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
+   extra_pages = min3(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
   extra_pages, max_pages - max_pfn);
i = 0;
addr = xen_e820_table.entries[0].addr;
-- 
2.30.1



[PATCH v2 0/2] xen/x86: alternative fix for XSA-369

2021-03-24 Thread Roger Pau Monne
Hello,

This is a proposal for an alternative fix for XSA-369 that instead of
special casing XEN_UNPOPULATED_ALLOC to size the p2m relies on making
XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on the generic MEMORY_HOTPLUG
option rather than XEN_BALLOON_MEMORY_HOTPLUG.

I think this is safer, as we don't want to be special casing any option
that pulls in generic MEMORY_HOTPLUG without XEN_BALLOON_MEMORY_HOTPLUG.
Without this we would also need to at least special case ZONE_DEVICE
which also relies on MEMORY_HOTPLUG, and is what pulls the generic
MEMORY_HOTPLUG option even when XEN_BALLOON_MEMORY_HOTPLUG is disabled
with XEN_UNPOPULATED_ALLOC.

Thanks, Roger.

Roger Pau Monne (2):
  xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on
MEMORY_HOTPLUG
  Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"

 arch/x86/include/asm/xen/page.h | 12 
 arch/x86/xen/p2m.c  |  7 ++-
 arch/x86/xen/setup.c| 16 ++--
 drivers/xen/Kconfig |  4 ++--
 4 files changed, 18 insertions(+), 21 deletions(-)

-- 
2.30.1



[PATCH] intel/pinctrl: check capability offset is between MMIO region

2021-03-24 Thread Roger Pau Monne
When parsing the capability list make sure the offset is between the
MMIO region mapped in 'regs', or else the kernel hits a page fault.

This fault has been seen when running as a Xen PVH dom0, which doesn't
have the MMIO regions mapped into the domain physical memory map,
despite having the device reported in the ACPI DSDT table. This
results in reporting a capability offset of 0x (because the kernel
is accessing unpopulated memory), and such offset is outside of the
mapped region.

Adding the check is harmless, and prevents buggy or broken systems
from crashing the kernel if the MMIO region is not properly reported.

Fixes: 91d898e51e60 ('pinctrl: intel: Convert capability list to features')
Signed-off-by: Roger Pau Monné 
---
 drivers/pinctrl/intel/pinctrl-intel.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/pinctrl/intel/pinctrl-intel.c 
b/drivers/pinctrl/intel/pinctrl-intel.c
index 8085782cd8f9..bc8b990d8021 100644
--- a/drivers/pinctrl/intel/pinctrl-intel.c
+++ b/drivers/pinctrl/intel/pinctrl-intel.c
@@ -1481,16 +1481,22 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
 
for (i = 0; i < pctrl->ncommunities; i++) {
struct intel_community *community = >communities[i];
+   struct resource *res;
void __iomem *regs;
+   size_t size;
u32 offset;
u32 value;
 
*community = pctrl->soc->communities[i];
 
-   regs = devm_platform_ioremap_resource(pdev, community->barno);
+   regs = devm_platform_get_and_ioremap_resource(pdev,
+ community->barno,
+ );
if (IS_ERR(regs))
return PTR_ERR(regs);
 
+   size = res->end - res->start;
+
/* Determine community features based on the revision */
value = readl(regs + REVID);
if (((value & REVID_MASK) >> REVID_SHIFT) >= 0x94) {
@@ -1519,6 +1525,12 @@ static int intel_pinctrl_probe(struct platform_device 
*pdev,
break;
}
offset = (value & CAPLIST_NEXT_MASK) >> 
CAPLIST_NEXT_SHIFT;
+   if (offset >= size) {
+   dev_err(>dev,
+   "wrong capability offset: %#x\n",
+   offset);
+   return -ENOENT;
+   }
} while (offset);
 
dev_dbg(>dev, "Community%d features: %#08x\n", i, 
community->features);
-- 
2.30.1



[PATCH 2/2] Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"

2021-03-17 Thread Roger Pau Monne
This partially reverts commit 882213990d32fd224340a4533f6318dd152be4b2.

There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.

Leave the check added to __set_phys_to_machine.

Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Jan Beulich 
Cc: xen-de...@lists.xenproject.org
---
 arch/x86/include/asm/xen/page.h | 12 
 arch/x86/xen/p2m.c  |  3 ---
 arch/x86/xen/setup.c| 25 ++---
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 7068e4bb057d..1a162e559753 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -86,18 +86,6 @@ clear_foreign_p2m_mapping(struct gnttab_unmap_grant_ref 
*unmap_ops,
 }
 #endif
 
-/*
- * The maximum amount of extra memory compared to the base size.  The
- * main scaling factor is the size of struct page.  At extreme ratios
- * of base:extra, all the base memory can be filled with page
- * structures for the extra memory, leaving no space for anything
- * else.
- *
- * 10x seems like a reasonable balance between scaling flexibility and
- * leaving a practically usable system.
- */
-#define XEN_EXTRA_MEM_RATIO(10)
-
 /*
  * Helper functions to write or read unsigned long values to/from
  * memory, when the access may fault.
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index a33902d05e45..ac06ca32e9ef 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -416,9 +416,6 @@ void __init xen_vmalloc_p2m_tree(void)
xen_p2m_last_pfn = xen_max_p2m_pfn;
 
p2m_limit = (phys_addr_t)P2M_LIMIT * 1024 * 1024 * 1024 / PAGE_SIZE;
-   if (!p2m_limit && IS_ENABLED(CONFIG_XEN_UNPOPULATED_ALLOC))
-   p2m_limit = xen_start_info->nr_pages * XEN_EXTRA_MEM_RATIO;
-
vm.flags = VM_ALLOC;
vm.size = ALIGN(sizeof(unsigned long) * max(xen_max_p2m_pfn, p2m_limit),
PMD_SIZE * PMDS_PER_MID_PAGE);
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 1a3b75652fa4..7eab14d56369 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -59,6 +59,18 @@ static struct {
 } xen_remap_buf __initdata __aligned(PAGE_SIZE);
 static unsigned long xen_remap_mfn __initdata = INVALID_P2M_ENTRY;
 
+/* 
+ * The maximum amount of extra memory compared to the base size.  The
+ * main scaling factor is the size of struct page.  At extreme ratios
+ * of base:extra, all the base memory can be filled with page
+ * structures for the extra memory, leaving no space for anything
+ * else.
+ * 
+ * 10x seems like a reasonable balance between scaling flexibility and
+ * leaving a practically usable system.
+ */
+#define EXTRA_MEM_RATIO(10)
+
 static bool xen_512gb_limit __initdata = IS_ENABLED(CONFIG_XEN_512GB);
 
 static void __init xen_parse_512gb(void)
@@ -778,13 +790,20 @@ char * __init xen_memory_setup(void)
extra_pages += max_pages - max_pfn;
 
/*
-* Clamp the amount of extra memory to a XEN_EXTRA_MEM_RATIO
-* factor the base size.
+* Clamp the amount of extra memory to a EXTRA_MEM_RATIO
+* factor the base size.  On non-highmem systems, the base
+* size is the full initial memory allocation; on highmem it
+* is limited to the max size of lowmem, so that it doesn't
+* get completely filled.
 *
 * Make sure we have no memory above max_pages, as this area
 * isn't handled by the p2m management.
+*
+* In principle there could be a problem in lowmem systems if
+* the initial memory is also very large with respect to
+* lowmem, but we won't try to deal with that here.
 */
-   extra_pages = min3(XEN_EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
+   extra_pages = min3(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)),
   extra_pages, max_pages - max_pfn);
i = 0;
addr = xen_e820_table.entries[0].addr;
-- 
2.30.1



[PATCH 1/2] xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG

2021-03-17 Thread Roger Pau Monne
The Xen memory hotplug limit should depend on the memory hotplug
generic option, rather than the Xen balloon configuration. It's
possible to have a kernel with generic memory hotplug enabled, but
without Xen balloon enabled, at which point memory hotplug won't work
correctly due to the size limitation of the p2m.

Rename the option to XEN_MEMORY_HOTPLUG_LIMIT since it's no longer
tied to ballooning.

Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Jan Beulich 
Cc: xen-de...@lists.xenproject.org
---
 arch/x86/xen/p2m.c  | 4 ++--
 drivers/xen/Kconfig | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 17d80f751fcb..a33902d05e45 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -98,8 +98,8 @@ EXPORT_SYMBOL_GPL(xen_p2m_size);
 unsigned long xen_max_p2m_pfn __read_mostly;
 EXPORT_SYMBOL_GPL(xen_max_p2m_pfn);
 
-#ifdef CONFIG_XEN_BALLOON_MEMORY_HOTPLUG_LIMIT
-#define P2M_LIMIT CONFIG_XEN_BALLOON_MEMORY_HOTPLUG_LIMIT
+#ifdef CONFIG_XEN_MEMORY_HOTPLUG_LIMIT
+#define P2M_LIMIT CONFIG_XEN_MEMORY_HOTPLUG_LIMIT
 #else
 #define P2M_LIMIT 0
 #endif
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 41645fe6ad48..ea0efd290c37 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -50,11 +50,11 @@ config XEN_BALLOON_MEMORY_HOTPLUG
 
  SUBSYSTEM=="memory", ACTION=="add", RUN+="/bin/sh -c '[ -f 
/sys$devpath/state ] && echo online > /sys$devpath/state'"
 
-config XEN_BALLOON_MEMORY_HOTPLUG_LIMIT
+config XEN_MEMORY_HOTPLUG_LIMIT
int "Hotplugged memory limit (in GiB) for a PV guest"
default 512
depends on XEN_HAVE_PVMMU
-   depends on XEN_BALLOON_MEMORY_HOTPLUG
+   depends on MEMORY_HOTPLUG
help
  Maxmium amount of memory (in GiB) that a PV guest can be
  expanded to when using memory hotplug.
-- 
2.30.1



[PATCH 0/2] xen/x86: alternative fix for XSA-369

2021-03-17 Thread Roger Pau Monne
Hello,

This is a proposal for an alternative fix for XSA-369 that instead of
special casing XEN_UNPOPULATED_ALLOC to size the p2m relies on making
XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on the generic MEMORY_HOTPLUG
option rather than XEN_BALLOON_MEMORY_HOTPLUG.

I think this is safer, as we don't want to be special casing any option
that pulls in generic MEMORY_HOTPLUG without XEN_BALLOON_MEMORY_HOTPLUG.
Without this we would also need to at least special case ZONE_DEVICE
which also relies on MEMORY_HOTPLUG, and is what pulls the generic
MEMORY_HOTPLUG option even when XEN_BALLOON_MEMORY_HOTPLUG is disabled
with XEN_UNPOPULATED_ALLOC.

Thanks, Roger.

Roger Pau Monne (2):
  xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on
MEMORY_HOTPLUG
  Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"

 arch/x86/include/asm/xen/page.h | 12 
 arch/x86/xen/p2m.c  |  7 ++-
 arch/x86/xen/setup.c| 25 ++---
 drivers/xen/Kconfig |  4 ++--
 4 files changed, 26 insertions(+), 22 deletions(-)

-- 
2.30.1



Re: [PATCH v3 5/8] xen/events: link interdomain events to associated xenbus device

2021-02-22 Thread Roger Pau Monné
On Fri, Feb 19, 2021 at 04:40:27PM +0100, Juergen Gross wrote:
> In order to support the possibility of per-device event channel
> settings (e.g. lateeoi spurious event thresholds) add a xenbus device
> pointer to struct irq_info() and modify the related event channel
> binding interfaces to take the pointer to the xenbus device as a
> parameter instead of the domain id of the other side.
> 
> While at it remove the stale prototype of bind_evtchn_to_irq_lateeoi().
> 
> Signed-off-by: Juergen Gross 
> Reviewed-by: Boris Ostrovsky 
> Reviewed-by: Wei Liu 
> Reviewed-by: Paul Durrant 
> ---
>  drivers/block/xen-blkback/xenbus.c  |  2 +-

Reviewed-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH v2] xen-blkback: fix compatibility bug with single page rings

2021-02-02 Thread Roger Pau Monné
On Thu, Jan 28, 2021 at 01:04:41PM +, Paul Durrant wrote:
> From: Paul Durrant 
> 
> Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
> inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
> behaviour of xen-blkback when connecting to a frontend was:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring specified by 'ring-ref'
> - else expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> 
> This was correct behaviour, but was broken by the afforementioned commit to
> become:
> 
> - read 'ring-page-order'
> - if not present then expect a single page ring (i.e. ring-page-order = 0)
> - expect a ring specified by 'ring-refX' where X is between 0 and
>   1 << ring-page-order
> - if that didn't work then see if there's a single page ring specified by
>   'ring-ref'
> 
> This incorrect behaviour works most of the time but fails when a frontend
> that sets 'ring-page-order' is unloaded and replaced by one that does not
> because, instead of reading 'ring-ref', xen-blkback will read the stale
> 'ring-ref0' left around by the previous frontend will try to map the wrong
> grant reference.
> 
> This patch restores the original behaviour.
> 
> Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid 
> inconsistent xenstore 'ring-page-order' set by malicious blkfront")
> Signed-off-by: Paul Durrant 
> ---
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Jens Axboe 
> Cc: Dongli Zhang 
> 
> v2:
>  - Remove now-spurious error path special-case when nr_grefs == 1
> ---
>  drivers/block/xen-blkback/common.h |  1 +
>  drivers/block/xen-blkback/xenbus.c | 38 +-
>  2 files changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/common.h 
> b/drivers/block/xen-blkback/common.h
> index b0c71d3a81a0..524a79f10de6 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -313,6 +313,7 @@ struct xen_blkif {
>  
>   struct work_struct  free_work;
>   unsigned intnr_ring_pages;
> + boolmulti_ref;

You seem to have used spaces between the type and the variable name
here, while neighbors also use hard tabs.

The rest LGTM:

Reviewed-by: Roger Pau Monné 

We should have forbidden the usage of ring-page-order = 0 and we could
have avoided having to add the multi_ref variable, but that's too late
now.

Thanks, Roger.


Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-20 Thread Roger Pau Monné
On Wed, Jan 20, 2021 at 03:37:57PM +0100, Jan Beulich wrote:
> On 20.01.2021 15:35, Roger Pau Monné wrote:
> > On Wed, Jan 20, 2021 at 03:23:30PM +0100, Arthur Borsboom wrote:
> >> Hi Roger,
> >>
> >> I have set up a test environment based on Linux 5.11.0-rc4.
> >> The patch did not apply clean, so I copied/pasted the patch manually.
> >>
> >> Without the patch the call trace (as reported) is visible in dmesg.
> >> With the patch the call trace in dmesg is gone, but ... (there is always a
> >> but) ...
> >>
> >> Now the discard action returns the following.
> >>
> >> [arthur@test-arch ~]$ sudo fstrim -v /
> >> fstrim: /: the discard operation is not supported
> >>
> >> It might be correct, but of course I was hoping the Xen VM guest would pass
> >> on the discard request to the block device in the Xen VM host, which is a
> >> disk partition.
> >> Any suggestions?
> > 
> > Hm, that's not what I did see on my testing, the operation worked OK,
> > and that's what I would expect to happen in your case also, since I
> > know the xenstore keys.
> 
> Does discard work on partitions (and not just on entire disks)?
> It's been a while since I last had anything to do with this code,
> so I may well misremember.

The command provided did work for me, ie:

# fstrim -v /
/: 19.8 GiB (21190717440 bytes) trimmed

AFAICT the blkif discard interface we provide operates at physical
block granularity, so it's possible to discard a single block, and
hence should work against partitions.

Roger.


Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-20 Thread Roger Pau Monné
On Wed, Jan 20, 2021 at 03:23:30PM +0100, Arthur Borsboom wrote:
> Hi Roger,
> 
> I have set up a test environment based on Linux 5.11.0-rc4.
> The patch did not apply clean, so I copied/pasted the patch manually.
> 
> Without the patch the call trace (as reported) is visible in dmesg.
> With the patch the call trace in dmesg is gone, but ... (there is always a
> but) ...
> 
> Now the discard action returns the following.
> 
> [arthur@test-arch ~]$ sudo fstrim -v /
> fstrim: /: the discard operation is not supported
> 
> It might be correct, but of course I was hoping the Xen VM guest would pass
> on the discard request to the block device in the Xen VM host, which is a
> disk partition.
> Any suggestions?

Hm, that's not what I did see on my testing, the operation worked OK,
and that's what I would expect to happen in your case also, since I
know the xenstore keys.

I think it's possible your email client has mangled the patch, I'm
attaching the same patch to this email, could you try to apply it
again and report back? (this time it should apply cleanly)

Thanks, Roger.
>From f09acedb2e40fa84f31cfe4c960dea2037d06e3f Mon Sep 17 00:00:00 2001
From: Roger Pau Monne 
Date: Mon, 18 Jan 2021 11:47:05 +0100
Subject: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This is inline with the specification described in blkif.h:

 * discard-granularity: should be set to the physical block size if
   node is not present.
 * discard-alignment, discard-secure: should be set to 0 if node not
   present.

This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.

Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.

Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: Arthur Borsboom 
---
Changes since v2:
 - Allow all discard-* nodes to be optional.
---
 drivers/block/xen-blkfront.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5265975b3fba..e1c6798889f4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -945,7 +945,8 @@ static void blkif_set_queue_limits(struct blkfront_info 
*info)
if (info->feature_discard) {
blk_queue_flag_set(QUEUE_FLAG_DISCARD, rq);
blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_granularity = info->discard_granularity ?:
+info->physical_sector_size;
rq->limits.discard_alignment = info->discard_alignment;
if (info->feature_secdiscard)
blk_queue_flag_set(QUEUE_FLAG_SECERASE, rq);
@@ -2179,19 +2180,12 @@ static void blkfront_closing(struct blkfront_info *info)
 
 static void blkfront_setup_discard(struct blkfront_info *info)
 {
-   int err;
-   unsigned int discard_granularity;
-   unsigned int discard_alignment;
-
info->feature_discard = 1;
-   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-   "discard-granularity", "%u", _granularity,
-   "discard-alignment", "%u", _alignment,
-   NULL);
-   if (!err) {
-   info->discard_granularity = discard_granularity;
-   info->discard_alignment = discard_alignment;
-   }
+   info->discard_granularity = xenbus_read_unsigned(info->xbdev->otherend,
+"discard-granularity",
+0);
+   info->discard_alignment = xenbus_read_unsigned(info->xbdev->otherend,
+  "discard-alignment", 0);
info->feature_secdiscard =
!!xenbus_read_unsigned(info->xbdev->otherend, "discard-secure",
   0);
-- 
2.29.2



Re: [PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-19 Thread Roger Pau Monné
Forgot to Cc stable for the Fixes tag. Doing it now.

On Tue, Jan 19, 2021 at 11:57:27AM +0100, Roger Pau Monne wrote:
> This is inline with the specification described in blkif.h:
> 
>  * discard-granularity: should be set to the physical block size if
>node is not present.
>  * discard-alignment, discard-secure: should be set to 0 if node not
>present.
> 
> This was detected as QEMU would only create the discard-granularity
> node but not discard-alignment, and thus the setup done in
> blkfront_setup_discard would fail.
> 
> Fix blkfront_setup_discard to not fail on missing nodes, and also fix
> blkif_set_queue_limits to set the discard granularity to the physical
> block size if none is specified in xenbus.
> 
> Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
> Reported-by: Arthur Borsboom 
> Signed-off-by: Roger Pau Monné 
> ---
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Konrad Rzeszutek Wilk 
> Cc: "Roger Pau Monné" 
> Cc: Jens Axboe 
> Cc: xen-de...@lists.xenproject.org
> Cc: linux-bl...@vger.kernel.org
> Cc: Arthur Borsboom 
> ---
> Changes since v2:
>  - Allow all discard-* nodes to be optional.
> ---
>  drivers/block/xen-blkfront.c | 20 +++-
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5265975b3fba..e1c6798889f4 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -945,7 +945,8 @@ static void blkif_set_queue_limits(struct blkfront_info 
> *info)
>   if (info->feature_discard) {
>   blk_queue_flag_set(QUEUE_FLAG_DISCARD, rq);
>   blk_queue_max_discard_sectors(rq, get_capacity(gd));
> - rq->limits.discard_granularity = info->discard_granularity;
> + rq->limits.discard_granularity = info->discard_granularity ?:
> +  info->physical_sector_size;
>   rq->limits.discard_alignment = info->discard_alignment;
>   if (info->feature_secdiscard)
>   blk_queue_flag_set(QUEUE_FLAG_SECERASE, rq);
> @@ -2179,19 +2180,12 @@ static void blkfront_closing(struct blkfront_info 
> *info)
>  
>  static void blkfront_setup_discard(struct blkfront_info *info)
>  {
> - int err;
> - unsigned int discard_granularity;
> - unsigned int discard_alignment;
> -
>   info->feature_discard = 1;
> - err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> - "discard-granularity", "%u", _granularity,
> - "discard-alignment", "%u", _alignment,
> - NULL);
> - if (!err) {
> - info->discard_granularity = discard_granularity;
> - info->discard_alignment = discard_alignment;
> - }
> + info->discard_granularity = xenbus_read_unsigned(info->xbdev->otherend,
> +  "discard-granularity",
> +  0);
> + info->discard_alignment = xenbus_read_unsigned(info->xbdev->otherend,
> +"discard-alignment", 0);
>   info->feature_secdiscard =
>   !!xenbus_read_unsigned(info->xbdev->otherend, "discard-secure",
>  0);
> -- 
> 2.29.2
> 


[PATCH v2] xen-blkfront: allow discard-* nodes to be optional

2021-01-19 Thread Roger Pau Monne
This is inline with the specification described in blkif.h:

 * discard-granularity: should be set to the physical block size if
   node is not present.
 * discard-alignment, discard-secure: should be set to 0 if node not
   present.

This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.

Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.

Fixes: ed30bf317c5ce ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: Arthur Borsboom 
---
Changes since v2:
 - Allow all discard-* nodes to be optional.
---
 drivers/block/xen-blkfront.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5265975b3fba..e1c6798889f4 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -945,7 +945,8 @@ static void blkif_set_queue_limits(struct blkfront_info 
*info)
if (info->feature_discard) {
blk_queue_flag_set(QUEUE_FLAG_DISCARD, rq);
blk_queue_max_discard_sectors(rq, get_capacity(gd));
-   rq->limits.discard_granularity = info->discard_granularity;
+   rq->limits.discard_granularity = info->discard_granularity ?:
+info->physical_sector_size;
rq->limits.discard_alignment = info->discard_alignment;
if (info->feature_secdiscard)
blk_queue_flag_set(QUEUE_FLAG_SECERASE, rq);
@@ -2179,19 +2180,12 @@ static void blkfront_closing(struct blkfront_info *info)
 
 static void blkfront_setup_discard(struct blkfront_info *info)
 {
-   int err;
-   unsigned int discard_granularity;
-   unsigned int discard_alignment;
-
info->feature_discard = 1;
-   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
-   "discard-granularity", "%u", _granularity,
-   "discard-alignment", "%u", _alignment,
-   NULL);
-   if (!err) {
-   info->discard_granularity = discard_granularity;
-   info->discard_alignment = discard_alignment;
-   }
+   info->discard_granularity = xenbus_read_unsigned(info->xbdev->otherend,
+"discard-granularity",
+0);
+   info->discard_alignment = xenbus_read_unsigned(info->xbdev->otherend,
+  "discard-alignment", 0);
info->feature_secdiscard =
!!xenbus_read_unsigned(info->xbdev->otherend, "discard-secure",
   0);
-- 
2.29.2



Re: [PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-19 Thread Roger Pau Monné
On Tue, Jan 19, 2021 at 11:11:26AM +0100, Jürgen Groß wrote:
> On 19.01.21 11:06, Roger Pau Monné wrote:
> > On Tue, Jan 19, 2021 at 08:43:01AM +0100, Jürgen Groß wrote:
> > > On 18.01.21 16:15, Roger Pau Monne wrote:
> > > > Don't require the discard-alignment xenstore node to be present in
> > > > order to correctly setup the feature. This can happen with versions of
> > > > QEMU that only write the discard-granularity but not the
> > > > discard-alignment node.
> > > > 
> > > > Assume discard-alignment is 0 if not present. While there also fix the
> > > > logic to not enable the discard feature if discard-granularity is not
> > > > present.
> > > > 
> > > > Reported-by: Arthur Borsboom 
> > > > Signed-off-by: Roger Pau Monné 
> > > > ---
> > > > Cc: Boris Ostrovsky 
> > > > Cc: Juergen Gross 
> > > > Cc: Stefano Stabellini 
> > > > Cc: Konrad Rzeszutek Wilk 
> > > > Cc: "Roger Pau Monné" 
> > > > Cc: Jens Axboe 
> > > > Cc: xen-de...@lists.xenproject.org
> > > > Cc: linux-bl...@vger.kernel.org
> > > > Cc: Arthur Borsboom 
> > > > ---
> > > >drivers/block/xen-blkfront.c | 25 +
> > > >1 file changed, 13 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > > > index 5265975b3fba..5a93f7cc2939 100644
> > > > --- a/drivers/block/xen-blkfront.c
> > > > +++ b/drivers/block/xen-blkfront.c
> > > > @@ -2179,22 +2179,23 @@ static void blkfront_closing(struct 
> > > > blkfront_info *info)
> > > >static void blkfront_setup_discard(struct blkfront_info *info)
> > > >{
> > > > -   int err;
> > > > -   unsigned int discard_granularity;
> > > > -   unsigned int discard_alignment;
> > > > +   unsigned int discard_granularity = 0;
> > > > +   unsigned int discard_alignment = 0;
> > > > +   unsigned int discard_secure = 0;
> > > > -   info->feature_discard = 1;
> > > > -   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > > > +   xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > > > "discard-granularity", "%u", _granularity,
> > > > "discard-alignment", "%u", _alignment,
> > > > +   "discard-secure", "%u", _secure,
> > > > NULL);
> > > 
> > > This would mean that "discard-secure" will be evaluated only if the
> > > other two items are set in Xenstore. From blkif.h I can't see this is
> > > required, and your patch is modifying today's behavior in this regard.
> > > 
> > > You might want to have three xenbus_read_unsigned() calls instead.
> > 
> > You are right, discard-secure should be fetched regardless of whether
> > discard-alignment exists.
> > 
> > I can fetch discard-granularity and discard-alignment using
> > xenbus_gather and keep discard-secure using xenbus_read_unsigned. Let
> > me send a new version.
> 
> I'm still not convinced this is correct. blkif.h doesn't mention that
> discard-alignment will be valid only with discard-granularity being
> present.

No, in fact I think I need to rework this a little further. Just
having feature-discard = 1 should enable the discard functionality, by
setting discard-granularity = physical block size and
discard-alignment = 0.

Roger.


Re: [PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-19 Thread Roger Pau Monné
On Tue, Jan 19, 2021 at 08:43:01AM +0100, Jürgen Groß wrote:
> On 18.01.21 16:15, Roger Pau Monne wrote:
> > Don't require the discard-alignment xenstore node to be present in
> > order to correctly setup the feature. This can happen with versions of
> > QEMU that only write the discard-granularity but not the
> > discard-alignment node.
> > 
> > Assume discard-alignment is 0 if not present. While there also fix the
> > logic to not enable the discard feature if discard-granularity is not
> > present.
> > 
> > Reported-by: Arthur Borsboom 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Boris Ostrovsky 
> > Cc: Juergen Gross 
> > Cc: Stefano Stabellini 
> > Cc: Konrad Rzeszutek Wilk 
> > Cc: "Roger Pau Monné" 
> > Cc: Jens Axboe 
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux-bl...@vger.kernel.org
> > Cc: Arthur Borsboom 
> > ---
> >   drivers/block/xen-blkfront.c | 25 +
> >   1 file changed, 13 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index 5265975b3fba..5a93f7cc2939 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -2179,22 +2179,23 @@ static void blkfront_closing(struct blkfront_info 
> > *info)
> >   static void blkfront_setup_discard(struct blkfront_info *info)
> >   {
> > -   int err;
> > -   unsigned int discard_granularity;
> > -   unsigned int discard_alignment;
> > +   unsigned int discard_granularity = 0;
> > +   unsigned int discard_alignment = 0;
> > +   unsigned int discard_secure = 0;
> > -   info->feature_discard = 1;
> > -   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > +   xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > "discard-granularity", "%u", _granularity,
> > "discard-alignment", "%u", _alignment,
> > +   "discard-secure", "%u", _secure,
> > NULL);
> 
> This would mean that "discard-secure" will be evaluated only if the
> other two items are set in Xenstore. From blkif.h I can't see this is
> required, and your patch is modifying today's behavior in this regard.
> 
> You might want to have three xenbus_read_unsigned() calls instead.

You are right, discard-secure should be fetched regardless of whether
discard-alignment exists.

I can fetch discard-granularity and discard-alignment using
xenbus_gather and keep discard-secure using xenbus_read_unsigned. Let
me send a new version.

Thanks, Roger.


[PATCH] xen-blkfront: don't make discard-alignment mandatory

2021-01-18 Thread Roger Pau Monne
Don't require the discard-alignment xenstore node to be present in
order to correctly setup the feature. This can happen with versions of
QEMU that only write the discard-granularity but not the
discard-alignment node.

Assume discard-alignment is 0 if not present. While there also fix the
logic to not enable the discard feature if discard-granularity is not
present.

Reported-by: Arthur Borsboom 
Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Konrad Rzeszutek Wilk 
Cc: "Roger Pau Monné" 
Cc: Jens Axboe 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: Arthur Borsboom 
---
 drivers/block/xen-blkfront.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5265975b3fba..5a93f7cc2939 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2179,22 +2179,23 @@ static void blkfront_closing(struct blkfront_info *info)
 
 static void blkfront_setup_discard(struct blkfront_info *info)
 {
-   int err;
-   unsigned int discard_granularity;
-   unsigned int discard_alignment;
+   unsigned int discard_granularity = 0;
+   unsigned int discard_alignment = 0;
+   unsigned int discard_secure = 0;
 
-   info->feature_discard = 1;
-   err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+   xenbus_gather(XBT_NIL, info->xbdev->otherend,
"discard-granularity", "%u", _granularity,
"discard-alignment", "%u", _alignment,
+   "discard-secure", "%u", _secure,
NULL);
-   if (!err) {
-   info->discard_granularity = discard_granularity;
-   info->discard_alignment = discard_alignment;
-   }
-   info->feature_secdiscard =
-   !!xenbus_read_unsigned(info->xbdev->otherend, "discard-secure",
-  0);
+
+   if (!discard_granularity)
+   return;
+
+   info->feature_discard = 1;
+   info->discard_granularity = discard_granularity;
+   info->discard_alignment = discard_alignment;
+   info->feature_secdiscard = !!discard_secure;
 }
 
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo)
-- 
2.29.2



[PATCH v2] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Roger Pau Monne
Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
addr = 0 in order to fetch the size of a specific resource.

Add a shortcut to the default map resource path, since fetching the
size requires no address to be passed in, and thus no VMA to setup.

This is missing from the initial implementation, and causes issues
when mapping resources that don't have fixed or known sizes.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org # >= 4.18
---
NB: fetching the size of a resource shouldn't trigger an hypercall
preemption, and hence I've dropped the preempt indications.
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Paul Durrant 
Cc: am...@cam.ac.uk
Cc: andrew.coop...@citrix.com
Cc: xen-de...@lists.xenproject.org
---
Changes since v1:
 - Remove Fixes tag, add backport.
 - Make sure both addr and num are set or unset.
---
 drivers/xen/privcmd.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index b0c73c58f987..720a7b7abd46 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -717,14 +717,15 @@ static long privcmd_ioctl_restrict(struct file *file, 
void __user *udata)
return 0;
 }
 
-static long privcmd_ioctl_mmap_resource(struct file *file, void __user *udata)
+static long privcmd_ioctl_mmap_resource(struct file *file,
+   struct privcmd_mmap_resource __user *udata)
 {
struct privcmd_data *data = file->private_data;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct privcmd_mmap_resource kdata;
xen_pfn_t *pfns = NULL;
-   struct xen_mem_acquire_resource xdata;
+   struct xen_mem_acquire_resource xdata = { };
int rc;
 
if (copy_from_user(, udata, sizeof(kdata)))
@@ -734,6 +735,22 @@ static long privcmd_ioctl_mmap_resource(struct file *file, 
void __user *udata)
if (data->domid != DOMID_INVALID && data->domid != kdata.dom)
return -EPERM;
 
+   /* Both fields must be set or unset */
+   if (!!kdata.addr != !!kdata.num)
+   return -EINVAL;
+
+   xdata.domid = kdata.dom;
+   xdata.type = kdata.type;
+   xdata.id = kdata.id;
+
+   if (!kdata.addr && !kdata.num) {
+   /* Query the size of the resource. */
+   rc = HYPERVISOR_memory_op(XENMEM_acquire_resource, );
+   if (rc)
+   return rc;
+   return __put_user(xdata.nr_frames, >num);
+   }
+
mmap_write_lock(mm);
 
vma = find_vma(mm, kdata.addr);
@@ -768,10 +785,6 @@ static long privcmd_ioctl_mmap_resource(struct file *file, 
void __user *udata)
} else
vma->vm_private_data = PRIV_VMA_LOCKED;
 
-   memset(, 0, sizeof(xdata));
-   xdata.domid = kdata.dom;
-   xdata.type = kdata.type;
-   xdata.id = kdata.id;
xdata.frame = kdata.idx;
xdata.nr_frames = kdata.num;
set_xen_guest_handle(xdata.frame_list, pfns);
-- 
2.29.2



Re: [PATCH] xen/privcmd: allow fetching resource sizes

2021-01-12 Thread Roger Pau Monné
On Tue, Jan 12, 2021 at 06:57:30AM +0100, Jürgen Groß wrote:
> On 11.01.21 16:29, Roger Pau Monne wrote:
> > Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
> > addr = 0 in order to fetch the size of a specific resource.
> > 
> > Add a shortcut to the default map resource path, since fetching the
> > size requires no address to be passed in, and thus no VMA to setup.
> > 
> > Fixes: 3ad0876554caf ('xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE')
> 
> I don't think this addition is a reason to add a "Fixes:" tag. This is
> clearly new functionality.

It could be argued that not allowing to query the resource size was a
shortcoming of the original implementation, but a backport request to
stable would be more appropriate than a fixes tag I think. Will drop
on next version and add a backport request if you agree.

Thanks, Roger.


[PATCH] xen/privcmd: allow fetching resource sizes

2021-01-11 Thread Roger Pau Monne
Allow issuing an IOCTL_PRIVCMD_MMAP_RESOURCE ioctl with num = 0 and
addr = 0 in order to fetch the size of a specific resource.

Add a shortcut to the default map resource path, since fetching the
size requires no address to be passed in, and thus no VMA to setup.

Fixes: 3ad0876554caf ('xen/privcmd: add IOCTL_PRIVCMD_MMAP_RESOURCE')
Signed-off-by: Roger Pau Monné 
---
NB: fetching the size of a resource shouldn't trigger an hypercall
preemption, and hence I've dropped the preempt indications.
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Paul Durrant 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/privcmd.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index b0c73c58f987..a6e7e6e4286f 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -717,14 +717,15 @@ static long privcmd_ioctl_restrict(struct file *file, 
void __user *udata)
return 0;
 }
 
-static long privcmd_ioctl_mmap_resource(struct file *file, void __user *udata)
+static long privcmd_ioctl_mmap_resource(struct file *file,
+   struct privcmd_mmap_resource __user *udata)
 {
struct privcmd_data *data = file->private_data;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct privcmd_mmap_resource kdata;
xen_pfn_t *pfns = NULL;
-   struct xen_mem_acquire_resource xdata;
+   struct xen_mem_acquire_resource xdata = { };
int rc;
 
if (copy_from_user(, udata, sizeof(kdata)))
@@ -734,6 +735,18 @@ static long privcmd_ioctl_mmap_resource(struct file *file, 
void __user *udata)
if (data->domid != DOMID_INVALID && data->domid != kdata.dom)
return -EPERM;
 
+   xdata.domid = kdata.dom;
+   xdata.type = kdata.type;
+   xdata.id = kdata.id;
+
+   if (!kdata.addr && !kdata.num) {
+   /* Query the size of the resource. */
+   rc = HYPERVISOR_memory_op(XENMEM_acquire_resource, );
+   if (rc)
+   return rc;
+   return __put_user(xdata.nr_frames, >num);
+   }
+
mmap_write_lock(mm);
 
vma = find_vma(mm, kdata.addr);
@@ -768,10 +781,6 @@ static long privcmd_ioctl_mmap_resource(struct file *file, 
void __user *udata)
} else
vma->vm_private_data = PRIV_VMA_LOCKED;
 
-   memset(, 0, sizeof(xdata));
-   xdata.domid = kdata.dom;
-   xdata.type = kdata.type;
-   xdata.id = kdata.id;
xdata.frame = kdata.idx;
xdata.nr_frames = kdata.num;
set_xen_guest_handle(xdata.frame_list, pfns);
-- 
2.29.2



Re: [PATCH] xen: fix: use WARN_ON instead of if condition followed by BUG.

2020-12-30 Thread Roger Pau Monné
On Wed, Dec 30, 2020 at 02:38:06PM +0800, YANG LI wrote:
> Use WARN_ON instead of if condition followed by BUG in
> gnttab_batch_map() and gnttab_batch_copy().

But those are not equivalent as far as I'm aware. BUG will stop
execution, while WARN_ON will print a splat and continue executing.

If switching to WARN_ON is indeed fine it needs to be explained in the
commit message that returning to the caller(s) with
HYPERVISOR_grant_table_op having returned an error code is fine, and
that it's not going to create other issues, like memory corruption or
leaks.

Thanks, Roger.


Re: [PATCH 2/2] xen: don't use page->lru for ZONE_DEVICE memory

2020-12-10 Thread Roger Pau Monné
On Tue, Dec 08, 2020 at 07:45:00AM +0100, Jürgen Groß wrote:
> On 07.12.20 21:48, Jason Andryuk wrote:
> > On Mon, Dec 7, 2020 at 8:30 AM Juergen Gross  wrote:
> > > 
> > > Commit 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated
> > > memory") introduced usage of ZONE_DEVICE memory for foreign memory
> > > mappings.
> > > 
> > > Unfortunately this collides with using page->lru for Xen backend
> > > private page caches.
> > > 
> > > Fix that by using page->zone_device_data instead.
> > > 
> > > Fixes: 9e2369c06c8a18 ("xen: add helpers to allocate unpopulated memory")
> > > Signed-off-by: Juergen Gross 
> > 
> > Would it make sense to add BUG_ON(is_zone_device_page(page)) and the
> > opposite as appropriate to cache_enq?
> 
> No, I don't think so. At least in the CONFIG_ZONE_DEVICE case the
> initial list in a PV dom0 is populated from extra memory (basically
> the same, but not marked as zone device memory explicitly).

I assume it's fine for us to then use page->zone_device_data even if
the page is not explicitly marked as ZONE_DEVICE memory?

If that's fine, add my:

Acked-by: Roger Pau Monné 

To both patches, and thank you very much for fixing this!

Roger.


Re: [PATCH v3 3/6] mm: support THP migration to device private memory

2020-12-05 Thread Roger Pau Monné
On Wed, Dec 02, 2020 at 11:08:54AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 20, 2020 at 04:01:33PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> > 
> > > MEMORY_DEVICE_GENERIC:
> > > Struct pages are created in dev_dax_probe() and represent non-volatile 
> > > memory.
> > > The device can be mmap()'ed which calls dax_mmap() which sets
> > > vma->vm_flags | VM_HUGEPAGE.
> > > A CPU page fault will result in a PTE, PMD, or PUD sized page
> > > (but not compound) to be inserted by vmf_insert_mixed() which will call 
> > > either
> > > insert_pfn() or insert_page().
> > > Neither insert_pfn() nor insert_page() increments the page reference
> > > count.
> > 
> > But why was this done? It seems very strange to put a pfn with a
> > struct page into a VMA and then deliberately not take the refcount for
> > the duration of that pfn being in the VMA?
> > 
> > What prevents memunmap_pages() from progressing while VMAs still point
> > at the memory?
> 
> Agreed.  Adding Roger who added MEMORY_DEVICE_GENERIC and the only
> user.

MEMORY_DEVICE_GENERIC is just a rename of the previous
MEMORY_DEVICE_DEVDAX, and seems to be used by the DAX device apart
from Xen?

It's main purpose is to be able to allocate unused physical memory
ranges and have a baking struct page for them, so they can be used to
map foreign memory when running on Xen.

I'm currently on leave and won't be back until the end of the month,
could you please Cc the Xen maintainers if you modify the logic here
in order to make sure it will work for Xen?

Thanks, Roger.


Re: [PATCH 058/141] xen-blkfront: Fix fall-through warnings for Clang

2020-11-23 Thread Roger Pau Monné
On Fri, Nov 20, 2020 at 12:32:58PM -0600, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of letting the code fall
> through to the next case.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/block/xen-blkfront.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 48629d3433b4..34b028be78ab 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -2462,6 +2462,7 @@ static void blkback_changed(struct xenbus_device *dev,
>   break;
>   if (talk_to_blkback(dev, info))
>   break;
> + break;

I would have added a fallthrough like it's done below in
XenbusStateClosed.

Also, FWIW, I think clang's fallthrough warnings are a bit too verbose.
Falling through to a break like the case here shouldn't cause a
warning IMO, falling through to anything != break should indeed cause
those warnings to appear.

Thanks, Roger.


Re: [PATCH 1/2] xen: Remove Xen PVH/PVHVM dependency on PCI

2020-10-15 Thread Roger Pau Monné
On Thu, Oct 15, 2020 at 05:02:21PM +0200, Jan Beulich wrote:
> On 15.10.2020 16:59, Jason Andryuk wrote:
> > On Thu, Oct 15, 2020 at 4:10 AM Jan Beulich  wrote:
> >>
> >> On 14.10.2020 19:53, Jason Andryuk wrote:
> >>> @@ -76,7 +80,9 @@ config XEN_DEBUG_FS
> >>> Enabling this option may incur a significant performance overhead.
> >>>
> >>>  config XEN_PVH
> >>> - bool "Support for running as a Xen PVH guest"
> >>> + bool "Xen PVH guest support"
> >>
> >> Tangential question: Is "guest" here still appropriate, i.e.
> >> isn't this option also controlling whether the kernel can be
> >> used in a PVH Dom0?
> > 
> > Would you like something more generic like "Xen PVH support" and
> > "Support for running in Xen PVH mode"?
> 
> Yeah, just dropping "guest" would be fine with me. No idea how
> to reflect that PVH Dom0 isn't supported, yet.

The fact that it isn't supported by Xen shouldn't be reflected on the
Linux configuration, as it's independent. Ie: you could run this Linux
kernel on a future version of Xen where PVH dom0 is supported.

There's already a warning printed by Xen when booting PVH dom0 about
not being a supported mode.

Thanks, Roger.


Re: [PATCH 2/2] xen/blkback: turn the cache purge percent into a parameter

2020-10-15 Thread Roger Pau Monné
On Thu, Oct 15, 2020 at 04:37:52PM +0200, Jürgen Groß wrote:
> On 15.10.20 16:24, Roger Pau Monne wrote:
> > Assume that reads and writes to the variable will be atomic. The worse
> > that could happen is that one of the purges removes a partially
> > written percentage of grants, but the cache itself will recover.
> > 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Konrad Rzeszutek Wilk 
> > Cc: Jens Axboe 
> > Cc: Boris Ostrovsky 
> > Cc: SeongJae Park 
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux-bl...@vger.kernel.org
> > Cc: J. Roeleveld 
> > Cc: Jürgen Groß 
> > ---
> >   Documentation/ABI/testing/sysfs-driver-xen-blkback | 9 +
> >   drivers/block/xen-blkback/blkback.c| 7 +--
> >   2 files changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> > b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > index 776f25d335ca..7de791ad61f9 100644
> > --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > @@ -45,3 +45,12 @@ Description:
> >   to be executed periodically. This parameter controls the 
> > time
> >   interval between consecutive executions of the purge 
> > mechanism
> >   is set in ms.
> > +
> > +What:   /sys/module/xen_blkback/parameters/lru_percent_clean
> > +Date:   October 2020
> > +KernelVersion:  5.10
> > +Contact:Roger Pau Monné 
> > +Description:
> > +When the persistent grants list is full we will remove 
> > unused
> > +grants from the list. The percent number of grants to be
> > +removed at each LRU execution.
> > diff --git a/drivers/block/xen-blkback/blkback.c 
> > b/drivers/block/xen-blkback/blkback.c
> > index 6ad9b76fdb2b..772852d45a5a 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -127,7 +127,10 @@ MODULE_PARM_DESC(lru_internval,
> >* from the list. The percent number of grants to be removed at each LRU
> >* execution.
> >*/
> > -#define LRU_PERCENT_CLEAN 5
> > +static unsigned int lru_percent_clean = 5;
> > +module_param_named(lru_percent_clean, lru_percent_clean, uint, 0644);
> > +MODULE_PARM_DESC(lru_percent_clean,
> > +"Percentage of persistent grants to remove from the cache when 
> > full");
> >   /* Run-time switchable: /sys/module/blkback/parameters/ */
> >   static unsigned int log_stats;
> > @@ -404,7 +407,7 @@ static void purge_persistent_gnt(struct xen_blkif_ring 
> > *ring)
> > !ring->blkif->vbd.overflow_max_grants)) {
> > num_clean = 0;
> > } else {
> > -   num_clean = (max_pgrants / 100) * LRU_PERCENT_CLEAN;
> > +   num_clean = (max_pgrants / 100) * lru_percent_clean;
> 
> Hmm, wouldn't it be better to use (max_grants * lru_percent_clean) / 100
> here in order to support max_grants values less than 100?

Yes, we should have done that when moving max_pgrants to a parameter,
it used to be fixed first.

Will do in next version since I'm already changing the line.

Thanks, Roger.


[PATCH 2/2] xen/blkback: turn the cache purge percent into a parameter

2020-10-15 Thread Roger Pau Monne
Assume that reads and writes to the variable will be atomic. The worse
that could happen is that one of the purges removes a partially
written percentage of grants, but the cache itself will recover.

Signed-off-by: Roger Pau Monné 
---
Cc: Konrad Rzeszutek Wilk 
Cc: Jens Axboe 
Cc: Boris Ostrovsky 
Cc: SeongJae Park 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: J. Roeleveld 
Cc: Jürgen Groß 
---
 Documentation/ABI/testing/sysfs-driver-xen-blkback | 9 +
 drivers/block/xen-blkback/blkback.c| 7 +--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
b/Documentation/ABI/testing/sysfs-driver-xen-blkback
index 776f25d335ca..7de791ad61f9 100644
--- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
+++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
@@ -45,3 +45,12 @@ Description:
 to be executed periodically. This parameter controls the time
 interval between consecutive executions of the purge mechanism
 is set in ms.
+
+What:   /sys/module/xen_blkback/parameters/lru_percent_clean
+Date:   October 2020
+KernelVersion:  5.10
+Contact:Roger Pau Monné 
+Description:
+When the persistent grants list is full we will remove unused
+grants from the list. The percent number of grants to be
+removed at each LRU execution.
diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index 6ad9b76fdb2b..772852d45a5a 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -127,7 +127,10 @@ MODULE_PARM_DESC(lru_internval,
  * from the list. The percent number of grants to be removed at each LRU
  * execution.
  */
-#define LRU_PERCENT_CLEAN 5
+static unsigned int lru_percent_clean = 5;
+module_param_named(lru_percent_clean, lru_percent_clean, uint, 0644);
+MODULE_PARM_DESC(lru_percent_clean,
+"Percentage of persistent grants to remove from the cache when 
full");
 
 /* Run-time switchable: /sys/module/blkback/parameters/ */
 static unsigned int log_stats;
@@ -404,7 +407,7 @@ static void purge_persistent_gnt(struct xen_blkif_ring 
*ring)
!ring->blkif->vbd.overflow_max_grants)) {
num_clean = 0;
} else {
-   num_clean = (max_pgrants / 100) * LRU_PERCENT_CLEAN;
+   num_clean = (max_pgrants / 100) * lru_percent_clean;
num_clean = ring->persistent_gnt_c - max_pgrants + num_clean;
num_clean = min(ring->persistent_gnt_c, num_clean);
pr_debug("Going to purge at least %u persistent grants\n",
-- 
2.28.0



[PATCH 1/2] xen/blkback: turn the cache purge LRU interval into a parameter

2020-10-15 Thread Roger Pau Monne
Assume that reads and writes to the variable will be atomic. The worse
that could happen is that one of the LRU intervals is not calculated
properly if a partially written value is read, but that would only be
a transient issue.

Signed-off-by: Roger Pau Monné 
---
Cc: Konrad Rzeszutek Wilk 
Cc: Jens Axboe 
Cc: Boris Ostrovsky 
Cc: SeongJae Park 
Cc: xen-de...@lists.xenproject.org
Cc: linux-bl...@vger.kernel.org
Cc: J. Roeleveld 
Cc: Jürgen Groß 
---
 Documentation/ABI/testing/sysfs-driver-xen-blkback | 10 ++
 drivers/block/xen-blkback/blkback.c|  9 ++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
b/Documentation/ABI/testing/sysfs-driver-xen-blkback
index ecb7942ff146..776f25d335ca 100644
--- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
+++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
@@ -35,3 +35,13 @@ Description:
 controls the duration in milliseconds that blkback will not
 cache any page not backed by a grant mapping.
 The default is 10ms.
+
+What:   /sys/module/xen_blkback/parameters/lru_internval
+Date:   October 2020
+KernelVersion:  5.10
+Contact:Roger Pau Monné 
+Description:
+The LRU mechanism to clean the lists of persistent grants needs
+to be executed periodically. This parameter controls the time
+interval between consecutive executions of the purge mechanism
+is set in ms.
diff --git a/drivers/block/xen-blkback/blkback.c 
b/drivers/block/xen-blkback/blkback.c
index adfc9352351d..6ad9b76fdb2b 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -117,7 +117,10 @@ MODULE_PARM_DESC(max_ring_page_order, "Maximum order of 
pages to be used for the
  * be executed periodically. The time interval between consecutive executions
  * of the purge mechanism is set in ms.
  */
-#define LRU_INTERVAL 100
+static unsigned int lru_interval = 100;
+module_param_named(lru_interval, lru_interval, uint, 0644);
+MODULE_PARM_DESC(lru_internval,
+"Time interval between consecutive executions of the cache 
purge mechanism (in ms)");
 
 /*
  * When the persistent grants list is full we will remove unused grants
@@ -620,7 +623,7 @@ int xen_blkif_schedule(void *arg)
if (unlikely(vbd->size != vbd_sz(vbd)))
xen_vbd_resize(blkif);
 
-   timeout = msecs_to_jiffies(LRU_INTERVAL);
+   timeout = msecs_to_jiffies(lru_interval);
 
timeout = wait_event_interruptible_timeout(
ring->wq,
@@ -650,7 +653,7 @@ int xen_blkif_schedule(void *arg)
if (blkif->vbd.feature_gnt_persistent &&
time_after(jiffies, ring->next_lru)) {
purge_persistent_gnt(ring);
-   ring->next_lru = jiffies + 
msecs_to_jiffies(LRU_INTERVAL);
+   ring->next_lru = jiffies + 
msecs_to_jiffies(lru_interval);
}
 
/* Shrink the free pages pool if it is too large. */
-- 
2.28.0



[PATCH 0/2] xen/blkback: add LRU purge parameters

2020-10-15 Thread Roger Pau Monne
Add the LRU parameters as tunables.

Roger Pau Monne (2):
  xen/blkback: turn the cache purge LRU interval into a parameter
  xen/blkback: turn the cache purge percent into a parameter

 .../ABI/testing/sysfs-driver-xen-blkback  | 19 +++
 drivers/block/xen-blkback/blkback.c   | 16 +++-
 2 files changed, 30 insertions(+), 5 deletions(-)

-- 
2.28.0



Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-24 Thread Roger Pau Monné
On Thu, Sep 24, 2020 at 12:27:14PM +0200, SeongJae Park wrote:
> On Thu, 24 Sep 2020 12:13:44 +0200 "Roger Pau Monné"  
> wrote:
> 
> > On Wed, Sep 23, 2020 at 04:09:30PM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> > > > From: SeongJae Park 
> > > > 
> > > > Persistent grants feature provides high scalability.  On some small
> > > > systems, however, it could incur data copy overhead[1] and thus it is
> > > > required to be disabled.  But, there is no option to disable it.  For
> > > > the reason, this commit adds a module parameter for disabling of the
> > > > feature.
> > > 
> > > Would it be better suited to have it per guest?
> > 
> > I think having a per-backend policy that could be specified at the
> > toolstack level would be nice, but I see that as a further
> > improvement.
> 
> Agreed.
> 
> > 
> > Having a global backend domain policy of whether persistent grants are
> > enabled or not seems desirable, and if someone wants even more fine
> > grained control this change is AFAICT not incompatible with a
> > per-backend option anyway.
> 
> I think we could extend this design by receiving list of exceptional domains.
> For example, if 'feature_persistent' is True and exceptions list has '123,
> 456', domains of domid 123 and 456 will not use persistent grants, and vice
> versa.

I think that would be quite fragile IMO, I wouldn't recommend relying
on domain IDs.

What I would do instead is add a new attribute to
xl-disk-configuration [0] that allows setting the persistent grants
usage on a per-disk basis, and that should be passed to blkback in a
xenstore node.

> I could implement this, but... to be honest, I don't really understand the
> needs of the fine-grained control.  AFAIU, the problem is 'scalability' vs
> 'data copy overhead'.  So, only small systems would want to turn persistent
> grants off.  In such a small system, why would we need fine-grained control?
> I'm worrying if I would implement and maintain a feature without real use 
> case.
> 
> For the reason, I'd like to suggest to keep this as is for now and expand it
> with the 'exceptions list' idea or something better, if a real use case comes
> out later.

I agree. I'm happy to take patches to implement more fine grained
control, but that shouldn't prevent us from having a global policy if
that's useful to users.

Roger.

[0] https://xenbits.xen.org/docs/unstable/man/xl-disk-configuration.5.html


Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-24 Thread Roger Pau Monné
On Wed, Sep 23, 2020 at 04:09:30PM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> > From: SeongJae Park 
> > 
> > Persistent grants feature provides high scalability.  On some small
> > systems, however, it could incur data copy overhead[1] and thus it is
> > required to be disabled.  But, there is no option to disable it.  For
> > the reason, this commit adds a module parameter for disabling of the
> > feature.
> 
> Would it be better suited to have it per guest?

I think having a per-backend policy that could be specified at the
toolstack level would be nice, but I see that as a further
improvement.

Having a global backend domain policy of whether persistent grants are
enabled or not seems desirable, and if someone wants even more fine
grained control this change is AFAICT not incompatible with a
per-backend option anyway.

Roger.


Re: [PATCH v3 0/3] xen-blk(back|front): Let users disable persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 04:15:46PM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overheads[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds module parameters for disabling of the
> feature.

With the changes requested by Jürgen you can add my:

Acked-by: Roger Pau Monné 

Thanks, Roger.


Re: [PATCH v3 3/3] xen-blkfront: Apply changed parameter name to the document

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 04:27:39PM +0200, Jürgen Groß wrote:
> On 22.09.20 16:15, SeongJae Park wrote:
> > From: SeongJae Park 
> > 
> > Commit 14e710fe7897 ("xen-blkfront: rename indirect descriptor
> > parameter") changed the name of the module parameter for the maximum
> > amount of segments in indirect requests but missed updating the
> > document.  This commit updates the document.
> > 
> > Signed-off-by: SeongJae Park 
> 
> Reviewed-by: Juergen Gross 

Does this need to be backported to stable branches?

Thanks, Roger.


Re: [PATCH v2 1/3] xen-blkback: add a parameter for disabling of persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 01:26:38PM +0200, SeongJae Park wrote:
> On Tue, 22 Sep 2020 13:12:59 +0200 "Roger Pau Monné"  
> wrote:
> 
> > On Tue, Sep 22, 2020 at 12:52:07PM +0200, SeongJae Park wrote:
> > > From: SeongJae Park 
> > > 
> > > Persistent grants feature provides high scalability.  On some small
> > > systems, however, it could incur data copy overheads[1] and thus it is
> > > required to be disabled.  But, there is no option to disable it.  For
> > > the reason, this commit adds a module parameter for disabling of the
> > > feature.
> > > 
> > > [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> > > 
> > > Signed-off-by: Anthony Liguori 
> > > Signed-off-by: SeongJae Park 
> > > ---
> > >  .../ABI/testing/sysfs-driver-xen-blkback  |  9 ++
> > >  drivers/block/xen-blkback/xenbus.c| 28 ++-
> > >  2 files changed, 30 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> > > b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > > index ecb7942ff146..ac2947b98950 100644
> > > --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > > +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > > @@ -35,3 +35,12 @@ Description:
> > >  controls the duration in milliseconds that blkback will 
> > > not
> > >  cache any page not backed by a grant mapping.
> > >  The default is 10ms.
> > > +
> > > +What:   /sys/module/xen_blkback/parameters/feature_persistent
> > > +Date:   September 2020
> > > +KernelVersion:  5.10
> > > +Contact:SeongJae Park 
> > > +Description:
> > > +Whether to enable the persistent grants feature or not.  
> > > Note
> > > +that this option only takes effect on newly created 
> > > backends.
> > > +The default is Y (enable).
> > > diff --git a/drivers/block/xen-blkback/xenbus.c 
> > > b/drivers/block/xen-blkback/xenbus.c
> > > index b9aa5d1ac10b..8a95ddd08b13 100644
> > > --- a/drivers/block/xen-blkback/xenbus.c
> > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
> > >  
> > >  /* ** Connection ** */
> > >  
> > > +/* Enable the persistent grants feature. */
> > > +static bool feature_persistent = true;
> > > +module_param(feature_persistent, bool, 0644);
> > > +MODULE_PARM_DESC(feature_persistent,
> > > + "Enables the persistent grants feature");
> > > +
> > >  /*
> > >   * Write the physical details regarding the block device to the store, 
> > > and
> > >   * switch to Connected state.
> > > @@ -906,11 +912,15 @@ static void connect(struct backend_info *be)
> > >  
> > >   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
> > >  
> > > - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> > > - if (err) {
> > > - xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
> > > -  dev->nodename);
> > > - goto abort;
> > > + if (feature_persistent) {
> > > + err = xenbus_printf(xbt, dev->nodename, "feature-persistent",
> > > + "%u", feature_persistent);
> > > + if (err) {
> > > + xenbus_dev_fatal(dev, err,
> > > + "writing %s/feature-persistent",
> > > + dev->nodename);
> > > + goto abort;
> > > + }
> > >   }
> > >  
> > >   err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
> > > @@ -1093,8 +1103,12 @@ static int connect_ring(struct backend_info *be)
> > >   xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
> > >   return -ENOSYS;
> > >   }
> > > - pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
> > > -0);
> > > + if (feature_persistent)
> > > + pers_grants = xenbus_read_unsigned(dev->otherend,
> > > +

Re: [PATCH v2 1/3] xen-blkback: add a parameter for disabling of persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 12:52:07PM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overheads[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds a module parameter for disabling of the
> feature.
> 
> [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> 
> Signed-off-by: Anthony Liguori 
> Signed-off-by: SeongJae Park 
> ---
>  .../ABI/testing/sysfs-driver-xen-blkback  |  9 ++
>  drivers/block/xen-blkback/xenbus.c| 28 ++-
>  2 files changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> index ecb7942ff146..ac2947b98950 100644
> --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> @@ -35,3 +35,12 @@ Description:
>  controls the duration in milliseconds that blkback will not
>  cache any page not backed by a grant mapping.
>  The default is 10ms.
> +
> +What:   /sys/module/xen_blkback/parameters/feature_persistent
> +Date:   September 2020
> +KernelVersion:  5.10
> +Contact:SeongJae Park 
> +Description:
> +Whether to enable the persistent grants feature or not.  Note
> +that this option only takes effect on newly created backends.
> +The default is Y (enable).
> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index b9aa5d1ac10b..8a95ddd08b13 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
>  
>  /* ** Connection ** */
>  
> +/* Enable the persistent grants feature. */
> +static bool feature_persistent = true;
> +module_param(feature_persistent, bool, 0644);
> +MODULE_PARM_DESC(feature_persistent,
> + "Enables the persistent grants feature");
> +
>  /*
>   * Write the physical details regarding the block device to the store, and
>   * switch to Connected state.
> @@ -906,11 +912,15 @@ static void connect(struct backend_info *be)
>  
>   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
>  
> - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> - if (err) {
> - xenbus_dev_fatal(dev, err, "writing %s/feature-persistent",
> -  dev->nodename);
> - goto abort;
> + if (feature_persistent) {
> + err = xenbus_printf(xbt, dev->nodename, "feature-persistent",
> + "%u", feature_persistent);
> + if (err) {
> + xenbus_dev_fatal(dev, err,
> + "writing %s/feature-persistent",
> + dev->nodename);
> + goto abort;
> + }
>   }
>  
>   err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
> @@ -1093,8 +1103,12 @@ static int connect_ring(struct backend_info *be)
>   xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>   return -ENOSYS;
>   }
> - pers_grants = xenbus_read_unsigned(dev->otherend, "feature-persistent",
> -0);
> + if (feature_persistent)
> + pers_grants = xenbus_read_unsigned(dev->otherend,
> + "feature-persistent", 0);
> + else
> + pers_grants = 0;
> +

Sorry for not realizing earlier, but looking at it again I think you
need to cache the value of feature_persistent when it's first used in
the blkback state data, so that it's consistent.

What would happen for example with the following flow (assume a
persistent grants enabled frontend):

feature_persistent = false

connect(...)
feature-persistent is not written to xenstore

User changes feature_persistent = true

connect_ring(...)
pers_grants = true, because feature-persistent is set unconditionally
by the frontend and feature_persistent variable is now true.

Then blkback will try to use persistent grants and the whole
connection will malfunction because the frontend won't.

The other option is to prevent changing the variable when there are
blkback instances already running.

Thanks, Roger.


Re: [PATCH] xen-blkback: add a parameter for disabling of persistent grants

2020-09-22 Thread Roger Pau Monné
On Tue, Sep 22, 2020 at 09:01:25AM +0200, SeongJae Park wrote:
> From: SeongJae Park 
> 
> Persistent grants feature provides high scalability.  On some small
> systems, however, it could incur data copy overhead[1] and thus it is
> required to be disabled.  But, there is no option to disable it.  For
> the reason, this commit adds a module parameter for disabling of the
> feature.

Have you considered adding a similar option for blkfront?

> 
> [1] https://wiki.xen.org/wiki/Xen_4.3_Block_Protocol_Scalability
> 
> Signed-off-by: Anthony Liguori 
> Signed-off-by: SeongJae Park 
> ---
>  .../ABI/testing/sysfs-driver-xen-blkback|  8 
>  drivers/block/xen-blkback/xenbus.c  | 17 ++---
>  2 files changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback 
> b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> index ecb7942ff146..0c42285c75ee 100644
> --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> @@ -35,3 +35,11 @@ Description:
>  controls the duration in milliseconds that blkback will not
>  cache any page not backed by a grant mapping.
>  The default is 10ms.
> +
> +What:   /sys/module/xen_blkback/parameters/feature_persistent
> +Date:   September 2020
> +KernelVersion:  5.10
> +Contact:SeongJae Park 
> +Description:
> +Whether to enable the persistent grants feature or not.
> +The default is 1 (enable).

I would add that this option only takes effect on newly created
backends, existing backends when the option is set will continue to
use persistent grants.

For already running backends you could drain the buffer of persistent
grants and flip the option, but that's more complex and not required.

> diff --git a/drivers/block/xen-blkback/xenbus.c 
> b/drivers/block/xen-blkback/xenbus.c
> index b9aa5d1ac10b..9c03d70469f4 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -879,6 +879,12 @@ static void reclaim_memory(struct xenbus_device *dev)
>  
>  /* ** Connection ** */
>  
> +/* Enable the persistent grants feature. */
> +static unsigned int feature_persistent = 1;
> +module_param_named(feature_persistent, feature_persistent, int, 0644);
> +MODULE_PARM_DESC(feature_persistent,
> + "Enables the persistent grants feature");
> +
>  /*
>   * Write the physical details regarding the block device to the store, and
>   * switch to Connected state.
> @@ -906,7 +912,8 @@ static void connect(struct backend_info *be)
>  
>   xen_blkbk_barrier(xbt, be, be->blkif->vbd.flush_support);
>  
> - err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u", 1);
> + err = xenbus_printf(xbt, dev->nodename, "feature-persistent", "%u",
> + feature_persistent ? 1 : 0);

You can avoid writing the feature altogether if it's not enabled,
there's no need to set feature-persistent = 0.

Thanks, Roger.


Re: linux-next: build failure after merge of the akpm-current tree

2020-09-09 Thread Roger Pau Monné
On Tue, Sep 08, 2020 at 08:09:50PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the akpm-current tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> drivers/xen/unpopulated-alloc.c: In function 'fill_list':
> drivers/xen/unpopulated-alloc.c:30:9: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>30 |  pgmap->res.name = "Xen scratch";
>   | ^~~
>   | ref
> drivers/xen/unpopulated-alloc.c:31:9: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>31 |  pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
>   | ^~~
>   | ref
> drivers/xen/unpopulated-alloc.c:33:51: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>33 |  ret = allocate_resource(_resource, >res,
>   |   ^~~
>   |   ref
> In file included from include/asm-generic/memory_model.h:5,
>  from arch/x86/include/asm/page.h:76,
>  from arch/x86/include/asm/thread_info.h:12,
>  from include/linux/thread_info.h:38,
>  from arch/x86/include/asm/preempt.h:7,
>  from include/linux/preempt.h:78,
>  from include/linux/spinlock.h:51,
>  from include/linux/mmzone.h:8,
>  from include/linux/gfp.h:6,
>  from drivers/xen/unpopulated-alloc.c:3:
> drivers/xen/unpopulated-alloc.c:53:35: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>53 |   xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
>   |   ^~~
> include/linux/pfn.h:20:23: note: in definition of macro 'PFN_DOWN'
>20 | #define PFN_DOWN(x) ((x) >> PAGE_SHIFT)
>   |   ^
> drivers/xen/unpopulated-alloc.c:58:30: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>58 | release_resource(>res);
>   |  ^~~
>   |  ref
> drivers/xen/unpopulated-alloc.c:69:28: error: 'struct dev_pagemap' has no 
> member named 'res'; did you mean 'ref'?
>69 |   release_resource(>res);
>   |^~~
>   |ref
> fs/fuse/virtio_fs.c: In function 'virtio_fs_setup_dax':
> fs/fuse/virtio_fs.c:838:9: error: 'struct dev_pagemap' has no member named 
> 'res'; did you mean 'ref'?
>   838 |  pgmap->res = (struct resource){
>   | ^~~
>   | ref
> 
> Caused by commit
> 
>   b3e022c5a68c ("mm/memremap_pages: convert to 'struct range'")
> 
> interacting with commit
> 
>   9e2369c06c8a ("xen: add helpers to allocate unpopulated memory")
> 
> from Linus' tree (in v5.9-rc4) and commit
> 
>   7e833303db20 ("virtiofs: set up virtio_fs dax_device")
> 
> from the fuse tree.
> 
> I have added the following patch which may require more work but at
> least makes it all build.
> 
> From: Stephen Rothwell 
> Date: Tue, 8 Sep 2020 20:00:20 +1000
> Subject: [PATCH] merge fix up for "mm/memremap_pages: convert to 'struct
>  range'"
> 
> Signed-off-by: Stephen Rothwell 

Thanks, LGTM.

> ---
>  drivers/xen/unpopulated-alloc.c | 15 +--
>  fs/fuse/virtio_fs.c |  3 +--
>  2 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c
> index 3b98dc921426..9fa7ce330628 100644
> --- a/drivers/xen/unpopulated-alloc.c
> +++ b/drivers/xen/unpopulated-alloc.c
> @@ -18,6 +18,7 @@ static unsigned int list_count;
>  static int fill_list(unsigned int nr_pages)
>  {
>   struct dev_pagemap *pgmap;
> + struct resource res;
>   void *vaddr;
>   unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
>   int ret;
> @@ -27,10 +28,10 @@ static int fill_list(unsigned int nr_pages)
>   return -ENOMEM;
>  
>   pgmap->type = MEMORY_DEVICE_GENERIC;
> - pgmap->res.name = "Xen scratch";
> - pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> + res.name = "Xen scratch";
> + res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;

You could init the fields at res definition time now, since it's no
longer dynamically allocated.

Roger.


Re: [PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-04 Thread Roger Pau Monné
On Fri, Sep 04, 2020 at 09:00:18AM +0200, Jürgen Groß wrote:
> On 03.09.20 18:38, Roger Pau Monné wrote:
> > On Thu, Sep 03, 2020 at 05:30:07PM +0200, Jürgen Groß wrote:
> > > On 01.09.20 10:33, Roger Pau Monne wrote:
> > > > To be used in order to create foreign mappings. This is based on the
> > > > ZONE_DEVICE facility which is used by persistent memory devices in
> > > > order to create struct pages and kernel virtual mappings for the IOMEM
> > > > areas of such devices. Note that on kernels without support for
> > > > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > > > create foreign mappings.
> > > > 
> > > > The newly added helpers use the same parameters as the existing
> > > > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > > > replacement of the callers. Once a memory region has been added to be
> > > > used as scratch mapping space it will no longer be released, and pages
> > > > returned are kept in a linked list. This allows to have a buffer of
> > > > pages and prevents resorting to frequent additions and removals of
> > > > regions.
> > > > 
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > > 
> > > > Note the driver is currently not enabled on Arm platforms because it
> > > > would interfere with the identity mapping required on some platforms.
> > > > 
> > > > Signed-off-by: Roger Pau Monné 
> > > 
> > > Sorry, I just got a build error for x86 32-bit build:
> > > 
> > > WARNING: unmet direct dependencies detected for ZONE_DEVICE
> > >Depends on [n]: MEMORY_HOTPLUG [=n] && MEMORY_HOTREMOVE [=n] &&
> > > SPARSEMEM_VMEMMAP [=n] && ARCH_HAS_PTE_DEVMAP [=n]
> > >Selected by [y]:
> > >- XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86 [=y]
> > >GEN Makefile
> > >CC  kernel/bounds.s
> > >CALL/home/gross/korg/src/scripts/atomic/check-atomics.sh
> > >UPD include/generated/bounds.h
> > >CC  arch/x86/kernel/asm-offsets.s
> > > In file included from /home/gross/korg/src/include/linux/mmzone.h:19:0,
> > >   from /home/gross/korg/src/include/linux/gfp.h:6,
> > >   from /home/gross/korg/src/include/linux/slab.h:15,
> > >   from /home/gross/korg/src/include/linux/crypto.h:19,
> > >   from 
> > > /home/gross/korg/src/arch/x86/kernel/asm-offsets.c:9:
> > > /home/gross/korg/src/include/linux/page-flags-layout.h:95:2: error: #error
> > > "Not enough bits in page flags"
> > >   #error "Not enough bits in page flags"
> > >^
> > > make[2]: *** [/home/gross/korg/src/scripts/Makefile.build:114:
> > > arch/x86/kernel/asm-offsets.s] Error 1
> > > make[1]: *** [/home/gross/korg/src/Makefile:1175: prepare0] Error 2
> > > make[1]: Leaving directory '/home/gross/korg/x8632'
> > > make: *** [Makefile:185: __sub-make] Error 2
> > 
> > Sorry for this. I've tested a 32bit build but I think it was before
> > the last Kconfig changes. I'm a little unsure how to solve this, as
> > ZONE_DEVICE doesn't select the required options for it to run, but
> > rather depends on them to be available.
> > 
> > You can trigger something similar on x86-64 by doing:
> > 
> > $ make ARCH=x86_64 xen.config
> > Using .config as base
> > Merging ./kernel/configs/xen.config
> > Merging ./arch/x86/configs/xen.config
> > #
> > # merged configuration written to .config (needs make)
> > #
> > scripts/kconfig/conf  --olddefconfig Kconfig
> > 
> > WARNING: unmet direct dependencies detected for ZONE_DEVICE
> >Depends on [n]: MEMORY_HOTPLUG [=y] && MEMORY_HOTREMOVE [=n] && 
> > SPARSEMEM_VMEMMAP [=y] && ARCH_HAS_PTE_DEVMAP [=y]
> >Selected by [y]:
> >- XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86_64 [=y]
> > #
> > # configuration written to .config
> > #
> > 
> > I think the only solution is to have XEN_UNPOPULATED_ALLOC depend on
> > ZONE_DEVICE rather than select it?
> 
> Yes, I think so.
> 
> I've folded that in and now build is fine.

Thanks, I assume no further action is needed on my side.

Roger.


Re: [PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-03 Thread Roger Pau Monné
On Thu, Sep 03, 2020 at 05:30:07PM +0200, Jürgen Groß wrote:
> On 01.09.20 10:33, Roger Pau Monne wrote:
> > To be used in order to create foreign mappings. This is based on the
> > ZONE_DEVICE facility which is used by persistent memory devices in
> > order to create struct pages and kernel virtual mappings for the IOMEM
> > areas of such devices. Note that on kernels without support for
> > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > create foreign mappings.
> > 
> > The newly added helpers use the same parameters as the existing
> > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > replacement of the callers. Once a memory region has been added to be
> > used as scratch mapping space it will no longer be released, and pages
> > returned are kept in a linked list. This allows to have a buffer of
> > pages and prevents resorting to frequent additions and removals of
> > regions.
> > 
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> > 
> > Note the driver is currently not enabled on Arm platforms because it
> > would interfere with the identity mapping required on some platforms.
> > 
> > Signed-off-by: Roger Pau Monné 
> 
> Sorry, I just got a build error for x86 32-bit build:
> 
> WARNING: unmet direct dependencies detected for ZONE_DEVICE
>   Depends on [n]: MEMORY_HOTPLUG [=n] && MEMORY_HOTREMOVE [=n] &&
> SPARSEMEM_VMEMMAP [=n] && ARCH_HAS_PTE_DEVMAP [=n]
>   Selected by [y]:
>   - XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86 [=y]
>   GEN Makefile
>   CC  kernel/bounds.s
>   CALL/home/gross/korg/src/scripts/atomic/check-atomics.sh
>   UPD include/generated/bounds.h
>   CC  arch/x86/kernel/asm-offsets.s
> In file included from /home/gross/korg/src/include/linux/mmzone.h:19:0,
>  from /home/gross/korg/src/include/linux/gfp.h:6,
>  from /home/gross/korg/src/include/linux/slab.h:15,
>  from /home/gross/korg/src/include/linux/crypto.h:19,
>  from /home/gross/korg/src/arch/x86/kernel/asm-offsets.c:9:
> /home/gross/korg/src/include/linux/page-flags-layout.h:95:2: error: #error
> "Not enough bits in page flags"
>  #error "Not enough bits in page flags"
>   ^
> make[2]: *** [/home/gross/korg/src/scripts/Makefile.build:114:
> arch/x86/kernel/asm-offsets.s] Error 1
> make[1]: *** [/home/gross/korg/src/Makefile:1175: prepare0] Error 2
> make[1]: Leaving directory '/home/gross/korg/x8632'
> make: *** [Makefile:185: __sub-make] Error 2

Sorry for this. I've tested a 32bit build but I think it was before
the last Kconfig changes. I'm a little unsure how to solve this, as
ZONE_DEVICE doesn't select the required options for it to run, but
rather depends on them to be available.

You can trigger something similar on x86-64 by doing:

$ make ARCH=x86_64 xen.config
Using .config as base
Merging ./kernel/configs/xen.config
Merging ./arch/x86/configs/xen.config
#
# merged configuration written to .config (needs make)
#
scripts/kconfig/conf  --olddefconfig Kconfig

WARNING: unmet direct dependencies detected for ZONE_DEVICE
  Depends on [n]: MEMORY_HOTPLUG [=y] && MEMORY_HOTREMOVE [=n] && 
SPARSEMEM_VMEMMAP [=y] && ARCH_HAS_PTE_DEVMAP [=y]
  Selected by [y]:
  - XEN_UNPOPULATED_ALLOC [=y] && XEN [=y] && X86_64 [=y]
#
# configuration written to .config
#

I think the only solution is to have XEN_UNPOPULATED_ALLOC depend on
ZONE_DEVICE rather than select it?

Thanks, Roger.


Re: [PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-01 Thread Roger Pau Monné
On Tue, Sep 01, 2020 at 10:33:26AM +0200, Roger Pau Monne wrote:
> +static int fill_list(unsigned int nr_pages)
> +{
> + struct dev_pagemap *pgmap;
> + void *vaddr;
> + unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
> + int nid, ret;
> +
> + pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
> + if (!pgmap)
> + return -ENOMEM;
> +
> + pgmap->type = MEMORY_DEVICE_GENERIC;
> + pgmap->res.name = "Xen scratch";
> + pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> +
> + ret = allocate_resource(_resource, >res,
> + alloc_pages * PAGE_SIZE, 0, -1,
> + PAGES_PER_SECTION * PAGE_SIZE, NULL, NULL);
> + if (ret < 0) {
> + pr_err("Cannot allocate new IOMEM resource\n");
> + kfree(pgmap);
> + return ret;
> + }
> +
> + nid = memory_add_physaddr_to_nid(pgmap->res.start);

I think this is not needed ...

> +
> +#ifdef CONFIG_XEN_HAVE_PVMMU
> +/*
> + * memremap will build page tables for the new memory so
> + * the p2m must contain invalid entries so the correct
> + * non-present PTEs will be written.
> + *
> + * If a failure occurs, the original (identity) p2m entries
> + * are not restored since this region is now known not to
> + * conflict with any devices.
> + */
> + if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> + xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
> +
> + for (i = 0; i < alloc_pages; i++) {
> + if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
> + pr_warn("set_phys_to_machine() failed, no 
> memory added\n");
> + release_resource(>res);
> + kfree(pgmap);
> + return -ENOMEM;
> + }
> +}
> + }
> +#endif
> +
> + vaddr = memremap_pages(pgmap, nid);

... and NUMA_NO_NODE should be used here instead, as this memory is just
fictitious space to map foreign memory, and shouldn't be related to
any NUMA node.

The following chunk should be folded in, or I can resend.

Thanks, Roger.
---8<---
diff --git a/drivers/xen/unpopulated-alloc.c b/drivers/xen/unpopulated-alloc.c
index 1b5d157c6977..3b98dc921426 100644
--- a/drivers/xen/unpopulated-alloc.c
+++ b/drivers/xen/unpopulated-alloc.c
@@ -20,7 +20,7 @@ static int fill_list(unsigned int nr_pages)
struct dev_pagemap *pgmap;
void *vaddr;
unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
-   int nid, ret;
+   int ret;
 
pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
if (!pgmap)
@@ -39,8 +39,6 @@ static int fill_list(unsigned int nr_pages)
return ret;
}
 
-   nid = memory_add_physaddr_to_nid(pgmap->res.start);
-
 #ifdef CONFIG_XEN_HAVE_PVMMU
 /*
  * memremap will build page tables for the new memory so
@@ -65,7 +63,7 @@ static int fill_list(unsigned int nr_pages)
}
 #endif
 
-   vaddr = memremap_pages(pgmap, nid);
+   vaddr = memremap_pages(pgmap, NUMA_NO_NODE);
if (IS_ERR(vaddr)) {
pr_err("Cannot remap memory range\n");
release_resource(>res);



[PATCH v5 2/3] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

2020-09-01 Thread Roger Pau Monne
This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
being used by non DAX devices.

No functional change intended.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Ira Weiny 
Acked-by: Andrew Morton 
---
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: Ira Weiny 
Cc: "Aneesh Kumar K.V" 
Cc: Johannes Thumshirn 
Cc: Logan Gunthorpe 
Cc: Juergen Gross 
Cc: linux-nvd...@lists.01.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
---
 drivers/dax/device.c | 2 +-
 include/linux/memremap.h | 9 -
 mm/memremap.c| 2 +-
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 4c0af2eb7e19..1e89513f3c59 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -429,7 +429,7 @@ int dev_dax_probe(struct device *dev)
return -EBUSY;
}
 
-   dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
+   dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
addr = devm_memremap_pages(dev, _dax->pgmap);
if (IS_ERR(addr))
return PTR_ERR(addr);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..e5862746751b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -46,11 +46,10 @@ struct vmem_altmap {
  * wakeup is used to coordinate physical address space management (ex:
  * fs truncate/hole punch) vs pinned pages (ex: device dma).
  *
- * MEMORY_DEVICE_DEVDAX:
+ * MEMORY_DEVICE_GENERIC:
  * Host memory that has similar access semantics as System RAM i.e. DMA
- * coherent and supports page pinning. In contrast to
- * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
- * character device.
+ * coherent and supports page pinning. This is for example used by DAX devices
+ * that expose memory using a character device.
  *
  * MEMORY_DEVICE_PCI_P2PDMA:
  * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
@@ -60,7 +59,7 @@ enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
MEMORY_DEVICE_FS_DAX,
-   MEMORY_DEVICE_DEVDAX,
+   MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
 };
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 03e38b7a38f1..006dace60b1a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -216,7 +216,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
-   case MEMORY_DEVICE_DEVDAX:
+   case MEMORY_DEVICE_GENERIC:
need_devmap_managed = false;
break;
case MEMORY_DEVICE_PCI_P2PDMA:
-- 
2.28.0



[PATCH v5 3/3] xen: add helpers to allocate unpopulated memory

2020-09-01 Thread Roger Pau Monne
To be used in order to create foreign mappings. This is based on the
ZONE_DEVICE facility which is used by persistent memory devices in
order to create struct pages and kernel virtual mappings for the IOMEM
areas of such devices. Note that on kernels without support for
ZONE_DEVICE Xen will fallback to use ballooned pages in order to
create foreign mappings.

The newly added helpers use the same parameters as the existing
{alloc/free}_xenballooned_pages functions, which allows for in-place
replacement of the callers. Once a memory region has been added to be
used as scratch mapping space it will no longer be released, and pages
returned are kept in a linked list. This allows to have a buffer of
pages and prevents resorting to frequent additions and removals of
regions.

If enabled (because ZONE_DEVICE is supported) the usage of the new
functionality untangles Xen balloon and RAM hotplug from the usage of
unpopulated physical memory ranges to map foreign pages, which is the
correct thing to do in order to avoid mappings of foreign pages depend
on memory hotplug.

Note the driver is currently not enabled on Arm platforms because it
would interfere with the identity mapping required on some platforms.

Signed-off-by: Roger Pau Monné 
---
Cc: Oleksandr Andrushchenko 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Dan Carpenter 
Cc: Roger Pau Monne 
Cc: Wei Liu 
Cc: Yan Yankovskyi 
Cc: dri-de...@lists.freedesktop.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Dan Williams 
---
Changes since v4:
 - Introduce a description for the option.
 - Force selection of ZONE_DEVICE on X86 and select
   XEN_UNPOPULATED_ALLOC if running on dom0 mode or having any
   backends.

Changes since v3:
 - Introduce a Kconfig option that gates the addition of the
   unpopulated alloc driver. This allows to easily disable it on Arm
   platforms.
 - Dropped Juergen RB due to the addition of the Kconfig option.
 - Switched from MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC.

Changes since v2:
 - Drop BUILD_BUG_ON regarding PVMMU page sizes.
 - Use a SPDX license identifier.
 - Call fill with only the minimum required number of pages.
 - Include xen.h header in xen_drm_front_gem.c.
 - Use less generic function names.
 - Exit early from the init function if not a PV guest.
 - Don't use all caps for region name.
---
 drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
 drivers/xen/Kconfig |  11 ++
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |   4 +-
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 185 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/xen/xen.h   |   9 ++
 10 files changed, 222 insertions(+), 15 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

diff --git a/drivers/gpu/drm/xen/xen_drm_front_gem.c 
b/drivers/gpu/drm/xen/xen_drm_front_gem.c
index 39ff95b75357..534daf37c97e 100644
--- a/drivers/gpu/drm/xen/xen_drm_front_gem.c
+++ b/drivers/gpu/drm/xen/xen_drm_front_gem.c
@@ -18,6 +18,7 @@
 #include 
 
 #include 
+#include 
 
 #include "xen_drm_front.h"
 #include "xen_drm_front_gem.h"
@@ -99,8 +100,8 @@ static struct xen_gem_object *gem_create(struct drm_device 
*dev, size_t size)
 * allocate ballooned pages which will be used to map
 * grant references provided by the backend
 */
-   ret = alloc_xenballooned_pages(xen_obj->num_pages,
-  xen_obj->pages);
+   ret = xen_alloc_unpopulated_pages(xen_obj->num_pages,
+ xen_obj->pages);
if (ret < 0) {
DRM_ERROR("Cannot allocate %zu ballooned pages: %d\n",
  xen_obj->num_pages, ret);
@@ -152,8 +153,8 @@ void xen_drm_front_gem_free_object_unlocked(struct 
drm_gem_object *gem_obj)
} else {
if (xen_obj->pages) {
if (xen_obj->be_alloc) {
-   free_xenballooned_pages(xen_obj->num_pages,
-   xen_obj->pages);
+   xen_free_unpopulated_pages(xen_obj->num_pages,
+  xen_obj->pages);
gem_free_pages_array(xen_obj);
} else {
drm_gem_put_pages(_obj->base,
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index ea6c1e7e3e42..e38c33558d0d 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -325,4 +325,15 @

[PATCH v5 0/3] xen/balloon: fixes for memory hotplug

2020-09-01 Thread Roger Pau Monne
Hello,

The following series contain some fixes in order to split Xen
unpopulated memory handling from the ballooning driver using the
ZONE_DEVICE functionality, so that physical memory regions used to map
foreign pages are not tied to memory hotplug.

Note this is currently only available for x86 due to Arm using an
identity map for dom0 p2m and thus needing special handling.

Thanks, Roger.

Roger Pau Monne (3):
  xen/balloon: add header guard
  memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC
  xen: add helpers to allocate unpopulated memory

 drivers/dax/device.c|   2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
 drivers/xen/Kconfig |  11 ++
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |   4 +-
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 185 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/linux/memremap.h|   9 +-
 include/xen/balloon.h   |   4 +
 include/xen/xen.h   |   9 ++
 mm/memremap.c   |   2 +-
 14 files changed, 232 insertions(+), 22 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

-- 
2.28.0



[PATCH v5 1/3] xen/balloon: add header guard

2020-09-01 Thread Roger Pau Monne
In order to protect against the header being included multiple times
on the same compilation unit.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Boris Ostrovsky 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
This is required as a pre-patch to use ZONE_DEVICE, or else the
fallback of including the balloon header might not work properly.
---
 include/xen/balloon.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/xen/balloon.h b/include/xen/balloon.h
index 6fb95aa19405..6dbdb0b3fd03 100644
--- a/include/xen/balloon.h
+++ b/include/xen/balloon.h
@@ -2,6 +2,8 @@
 /**
  * Xen balloon functionality
  */
+#ifndef _XEN_BALLOON_H
+#define _XEN_BALLOON_H
 
 #define RETRY_UNLIMITED0
 
@@ -34,3 +36,5 @@ static inline void xen_balloon_init(void)
 {
 }
 #endif
+
+#endif /* _XEN_BALLOON_H */
-- 
2.28.0



Re: [PATCH v4 1/2] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

2020-08-31 Thread Roger Pau Monné
On Thu, Aug 20, 2020 at 01:37:41PM +0200, Roger Pau Monné wrote:
> On Tue, Aug 11, 2020 at 11:07:36PM +0200, David Hildenbrand wrote:
> > On 11.08.20 11:44, Roger Pau Monne wrote:
> > > This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
> > > being used by non DAX devices.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Roger Pau Monné 
> > > ---
> > > Cc: Dan Williams 
> > > Cc: Vishal Verma 
> > > Cc: Dave Jiang 
> > > Cc: Andrew Morton 
> > > Cc: Jason Gunthorpe 
> > > Cc: Ira Weiny 
> > > Cc: "Aneesh Kumar K.V" 
> > > Cc: Johannes Thumshirn 
> > > Cc: Logan Gunthorpe 
> > > Cc: linux-nvd...@lists.01.org
> > > Cc: xen-de...@lists.xenproject.org
> > > Cc: linux...@kvack.org
> > > ---
> > >  drivers/dax/device.c | 2 +-
> > >  include/linux/memremap.h | 9 -
> > >  mm/memremap.c| 2 +-
> > >  3 files changed, 6 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > > index 4c0af2eb7e19..1e89513f3c59 100644
> > > --- a/drivers/dax/device.c
> > > +++ b/drivers/dax/device.c
> > > @@ -429,7 +429,7 @@ int dev_dax_probe(struct device *dev)
> > >   return -EBUSY;
> > >   }
> > >  
> > > - dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
> > > + dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
> > >   addr = devm_memremap_pages(dev, _dax->pgmap);
> > >   if (IS_ERR(addr))
> > >   return PTR_ERR(addr);
> > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > index 5f5b2df06e61..e5862746751b 100644
> > > --- a/include/linux/memremap.h
> > > +++ b/include/linux/memremap.h
> > > @@ -46,11 +46,10 @@ struct vmem_altmap {
> > >   * wakeup is used to coordinate physical address space management (ex:
> > >   * fs truncate/hole punch) vs pinned pages (ex: device dma).
> > >   *
> > > - * MEMORY_DEVICE_DEVDAX:
> > > + * MEMORY_DEVICE_GENERIC:
> > >   * Host memory that has similar access semantics as System RAM i.e. DMA
> > > - * coherent and supports page pinning. In contrast to
> > > - * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
> > > - * character device.
> > > + * coherent and supports page pinning. This is for example used by DAX 
> > > devices
> > > + * that expose memory using a character device.
> > >   *
> > >   * MEMORY_DEVICE_PCI_P2PDMA:
> > >   * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
> > > @@ -60,7 +59,7 @@ enum memory_type {
> > >   /* 0 is reserved to catch uninitialized type fields */
> > >   MEMORY_DEVICE_PRIVATE = 1,
> > >   MEMORY_DEVICE_FS_DAX,
> > > - MEMORY_DEVICE_DEVDAX,
> > > + MEMORY_DEVICE_GENERIC,
> > >   MEMORY_DEVICE_PCI_P2PDMA,
> > >  };
> > >  
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 03e38b7a38f1..006dace60b1a 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -216,7 +216,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int 
> > > nid)
> > >   return ERR_PTR(-EINVAL);
> > >   }
> > >   break;
> > > - case MEMORY_DEVICE_DEVDAX:
> > > + case MEMORY_DEVICE_GENERIC:
> > >   need_devmap_managed = false;
> > >   break;
> > >   case MEMORY_DEVICE_PCI_P2PDMA:
> > > 
> > 
> > No strong opinion (@Dan?), I do wonder if a separate type would make sense.
> 
> Gentle ping.

Sorry to ping again, but I would rather get this out of my queue if
possible, seeing as the other patch is OK to go in but depends on this
one going in first.

Thanks, Roger.


Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-20 Thread Roger Pau Monné
On Wed, Aug 19, 2020 at 01:33:39PM +0200, Norbert Kaminski wrote:
> 
> On 19.08.2020 10:19, Roger Pau Monné wrote:
> > On Tue, Aug 18, 2020 at 08:40:18PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, Aug 18, 2020 at 07:21:14PM +0200, Roger Pau Monné wrote:
> > > > > Let me draw the picture from the beginning.
> > > > Thanks, greatly appreciated.
> > > > 
> > > > > EFI memory map contains various memory regions. Some of them are 
> > > > > marked
> > > > > as not needed after ExitBootServices() call (done in Xen before
> > > > > launching dom0). This includes EFI_BOOT_SERVICES_DATA and
> > > > > EFI_BOOT_SERVICES_CODE.
> > > > > 
> > > > > EFI SystemTable contains pointers to various ConfigurationTables -
> > > > > physical addresses (at least in this case). Xen does interpret some of
> > > > > them, but not ESRT. Xen pass the whole (address of) SystemTable to 
> > > > > Linux
> > > > > dom0 (at least in PV case). Xen doesn't do anything about tables it
> > > > > doesn't understand.
> > > > > 
> > > > > Now, the code in Linux takes the (ESRT) table address early and checks
> > > > > the memory map for it. We have 3 cases:
> > > > >   - it points at area marked as neither EFI_*_SERVICES_DATA, nor with
> > > > > EFI_MEMORY_RUNTIME attribute -> Linux refuse to use it
> > > > >   - it points to EFI_RUNTIME_SERVICES_DATA or with EFI_MEMORY_RUNTIME
> > > > > attribute - Linux uses the table; memory map already says the area
> > > > > belongs to EFI and the OS should not use it for something else
> > > > >   - it points to EFI_BOOT_SERVICES_DATA - Linux mark the area as 
> > > > > reserved
> > > > > to not release it after calling ExitBootServices()
> > > > > 
> > > > > The problematic is the third case - at the time when Linux dom0 is 
> > > > > run,
> > > > > ExitBootServices() was already called and EFI_BOOT_SERVICES_* memory 
> > > > > was
> > > > > already released. It could be already used for something else (for
> > > > > example Xen could overwrite it while loading dom0).
> > > > > 
> > > > > Note the problematic case should be the most common - UEFI 
> > > > > specification
> > > > > says "The ESRT shall be stored in memory of type EfiBootServicesData"
> > > > > (chapter 22.3 of UEFI Spec v2.6).
> > > > > 
> > > > > For this reason, to use ESRT in dom0, Xen should do something about it
> > > > > before ExitBootServices() call. While analyzing all the EFI tables is
> > > > > probably not a viable option, it can do some simple action:
> > > > >   - retains all the EFI_BOOT_SERVICES_* areas - there is already code
> > > > > for that, controlled with /mapbs boot switch (to xen.efi, would 
> > > > > need
> > > > > another option for multiboot2+efi)
> > > > >   - have a list of tables to retain - since Xen already do analyze 
> > > > > some
> > > > > of the ConfigurationTables, it can also have a list of those to
> > > > > preserve even if they live in EFI_BOOT_SERVICES_DATA. In this 
> > > > > case,
> > > > > while Xen doesn't need to parse the whole table, it need to parse 
> > > > > it's
> > > > > header to get the table size - to reserve that memory and not 
> > > > > reuse
> > > > > it after ExitBootServices().
> > > > Xen seems to already contain skeleton
> > > > XEN_EFI_query_capsule_capabilities and XEN_EFI_update_capsule
> > > > hypercalls which is what should be used in order to perform the
> > > > updates?
> > > I think those covers only runtime service calls similarly named. But you
> > > need also ESRT table to collect info about devices that you can even
> > > attempt to update.
> > Right, the ESRT must be available so that dom0 can discover the
> > resources.
> > 
> > > TBH, I'm not sure if those runtime services are really needed. I think
> > > Norbert succeeded UEFI update from within Xen PV dom0 with just access
> > > to the ESRT table, but without those services.
> > > 
> Marek is right here. I was able to successfully update and downgrade
> UFEI when 

Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-20 Thread Roger Pau Monné
On Thu, Aug 20, 2020 at 11:34:54AM +0200, Marek Marczykowski-Górecki wrote:
> On Thu, Aug 20, 2020 at 11:30:25AM +0200, Roger Pau Monné wrote:
> > Right, so you only need access to the ESRT table, that's all. Then I
> > think we need to make sure Xen doesn't use this memory for anything
> > else, which will require some changes in Xen (or at least some
> > checks?).
> > 
> > We also need to decide what to do if the table turns out to be placed
> > in a wrong region. How are we going to prevent dom0 from using it
> > then? My preference would be to completely hide it from dom0 in that
> > case, such that it believes there's no ESRT at all if possible.
> 
> Yes, that makes sense. As discussed earlier, that probably means
> re-constructing SystemTable before giving it to dom0. We'd need to do
> that in PVH case anyway, to adjust addresses, right?

Not really, on PVH dom0 we should be able to identity map the required
EFI regions in the dom0 p2m, so the only difference between a classic
PV dom0 is that we need to assure that those regions are correctly
identity mapped in the p2m, but that shouldn't require any change to
the SystemTable unless we need to craft custom tables (see below).

> Is there something
> like this in the Xen codebase already, or it needs to be written from
> scratch?

AFAICT it needs to be written for EFI. For the purposes here I think
you could copy the SystemTable and modify the NumberOfTableEntries and
ConfigurationTable fields in the copy in order to delete the ESRT if
found to be placed in a non suitable region?

At that point we can remove the checks from Linux since Xen will
assert that whatever gets passed to dom0 is in a suitable region. It
would be nice to have a way to signal that the placement of the ESRT
has been checked, but I'm not sure how to do this, do you have any
ideas?

Roger.


Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-19 Thread Roger Pau Monné
On Tue, Aug 18, 2020 at 08:40:18PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 18, 2020 at 07:21:14PM +0200, Roger Pau Monné wrote:
> > > Let me draw the picture from the beginning.
> > 
> > Thanks, greatly appreciated.
> > 
> > > EFI memory map contains various memory regions. Some of them are marked
> > > as not needed after ExitBootServices() call (done in Xen before
> > > launching dom0). This includes EFI_BOOT_SERVICES_DATA and
> > > EFI_BOOT_SERVICES_CODE.
> > > 
> > > EFI SystemTable contains pointers to various ConfigurationTables -
> > > physical addresses (at least in this case). Xen does interpret some of
> > > them, but not ESRT. Xen pass the whole (address of) SystemTable to Linux
> > > dom0 (at least in PV case). Xen doesn't do anything about tables it
> > > doesn't understand.
> > > 
> > > Now, the code in Linux takes the (ESRT) table address early and checks
> > > the memory map for it. We have 3 cases:
> > >  - it points at area marked as neither EFI_*_SERVICES_DATA, nor with
> > >EFI_MEMORY_RUNTIME attribute -> Linux refuse to use it
> > >  - it points to EFI_RUNTIME_SERVICES_DATA or with EFI_MEMORY_RUNTIME
> > >attribute - Linux uses the table; memory map already says the area
> > >belongs to EFI and the OS should not use it for something else
> > >  - it points to EFI_BOOT_SERVICES_DATA - Linux mark the area as reserved
> > >to not release it after calling ExitBootServices()
> > > 
> > > The problematic is the third case - at the time when Linux dom0 is run,
> > > ExitBootServices() was already called and EFI_BOOT_SERVICES_* memory was
> > > already released. It could be already used for something else (for
> > > example Xen could overwrite it while loading dom0).
> > > 
> > > Note the problematic case should be the most common - UEFI specification
> > > says "The ESRT shall be stored in memory of type EfiBootServicesData"
> > > (chapter 22.3 of UEFI Spec v2.6).
> > > 
> > > For this reason, to use ESRT in dom0, Xen should do something about it
> > > before ExitBootServices() call. While analyzing all the EFI tables is
> > > probably not a viable option, it can do some simple action:
> > >  - retains all the EFI_BOOT_SERVICES_* areas - there is already code
> > >for that, controlled with /mapbs boot switch (to xen.efi, would need
> > >another option for multiboot2+efi)
> > >  - have a list of tables to retain - since Xen already do analyze some
> > >of the ConfigurationTables, it can also have a list of those to
> > >preserve even if they live in EFI_BOOT_SERVICES_DATA. In this case,
> > >while Xen doesn't need to parse the whole table, it need to parse it's
> > >header to get the table size - to reserve that memory and not reuse
> > >it after ExitBootServices().
> > 
> > Xen seems to already contain skeleton
> > XEN_EFI_query_capsule_capabilities and XEN_EFI_update_capsule
> > hypercalls which is what should be used in order to perform the
> > updates?
> 
> I think those covers only runtime service calls similarly named. But you
> need also ESRT table to collect info about devices that you can even
> attempt to update.

Right, the ESRT must be available so that dom0 can discover the
resources.

> TBH, I'm not sure if those runtime services are really needed. I think
> Norbert succeeded UEFI update from within Xen PV dom0 with just access
> to the ESRT table, but without those services.

OK, by reading the UEFI spec I assumed that you needed access to
QueryCapsuleCapabilities and UpdateCapsule in order to perform the
updates, and those should be proxied using hyopercalls. Maybe this is
not mandatory and there's a side-band mechanism of doing this?

I think we need more info here.

> > So yes, I agree Xen should make sure the region of the table is not
> > freed when exiting boot services, and that dom0 can access it. I
> > guess we should move the checks done by Linux to Xen, and then only
> > provide the ESRT table to dom0 if the checks (now done by Xen) pass.
> 
> Yes, something like this. But note currently in the (PV) dom0 case, Xen
> provides dom0 with a pointer to the whole SystemTable, not individual
> ConfigurationTables. Making it filter what dom0 gets would require Xen
> to re-construct the whole thing with just those elements that are
> desired. Not exactly sure if worth the effort given the privilege dom0
> has.

We already do this for ACPI in PVH dom0, where Xen rebuilds the RSDT
in order to filter out 

Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-18 Thread Roger Pau Monné
On Tue, Aug 18, 2020 at 05:00:20PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 18, 2020 at 02:47:10PM +0200, Roger Pau Monné wrote:
> > On Tue, Aug 18, 2020 at 02:01:35PM +0200, Marek Marczykowski-Górecki wrote:
> > > Do you mean PV dom0 should receive full EFI memory map? Jan already
> > > objected this as it would be a layering violation.
> > 
> > dom0 is already capable of getting the native e820 memory map using
> > XENMEM_machine_memory_map, I'm not sure why allowing to return the
> > memory map in EFI form would be any different (or a layering
> > violation in the PV dom0 case).
> >
> > Do you have a reference to that thread? I certainly missed it.
> 
> See this thread: http://markmail.org/message/nrrvuau5whebksy2
> 
> > For PVH dom0 we could consider something different, since in that case
> > there's a guest memory map which could likely be returned in EFI
> > format and with the EFI regions if required.
> > 
> > > > > Skip this part on Xen PV (let Xen do the right thing if it deems
> > > > > necessary) and use ESRT table normally.
> > > > 
> > > > Maybe it would be better to introduce a new hypercall (or add a
> > > > parameter to XENMEM_machine_memory_map) in order to be able to fetch
> > > > the EFI memory map?
> > > >
> > > > That should allow a PV dom0 to check the ESRT is correct and thus not
> > > > diverge from bate metal.
> > > 
> > > Note the EFI memory map there is used not just to check things, but to
> > > actually modify it to reserve the region. I think that's rather Xen
> > > responsibility, not dom0. See the comment from Ard.
> > 
> > But that would modify Linux copy of the memory map, which is fine? My
> > understanding of EFI is limited, but I don't think such changes are
> > feed back into EFI, so Linux is completely free to do whatever it
> > wants with it's copy of the EFI memory map.
> 
> Yes, but the thing is to make sure Xen doesn't use that memory, not only
> dom0. See below.
> 
> > > > > + if (efi_enabled(EFI_MEMMAP)) {
> > > > > + rc = efi_mem_desc_lookup(efi.esrt, );
> > > > > + if (rc < 0 ||
> > > > > + (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> > > > > +  md.type != EFI_BOOT_SERVICES_DATA &&
> > > > > +  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> > > > > + pr_warn("ESRT header is not in the memory 
> > > > > map.\n");
> > > > > + return;
> > > > > + }
> > > > 
> > > > Here you blindly trust the data in the ESRT in the PV case, without
> > > > checking it matches the regions on the memory map, which could lead to
> > > > errors if ESRT turns to be wrong.
> > > 
> > > I don't think checking merely if ESRT lives somewhere in
> > > EFI_{BOOT,RUNTIME}_SERVICES_DATA area guarantees its correctness.
> > > 
> > > On the other hand, this seems to be done to prevent overwriting that
> > > memory with something else (see that in case of EFI_BOOT_SERVICES_DATA
> > > it is later marked as reserved. I think it should be rather done by Xen,
> > > not dom0.
> > 
> > Forcing Xen to do all those checks seems quite a tedious work, and in
> > fact the memory map that dom0 has might be more complete than the one
> > Xen is able to construct, as Xen doesn't have an AML parser so it's
> > not able to get all the possible info from ACPI.
> 
> Let me draw the picture from the beginning.

Thanks, greatly appreciated.

> EFI memory map contains various memory regions. Some of them are marked
> as not needed after ExitBootServices() call (done in Xen before
> launching dom0). This includes EFI_BOOT_SERVICES_DATA and
> EFI_BOOT_SERVICES_CODE.
> 
> EFI SystemTable contains pointers to various ConfigurationTables -
> physical addresses (at least in this case). Xen does interpret some of
> them, but not ESRT. Xen pass the whole (address of) SystemTable to Linux
> dom0 (at least in PV case). Xen doesn't do anything about tables it
> doesn't understand.
> 
> Now, the code in Linux takes the (ESRT) table address early and checks
> the memory map for it. We have 3 cases:
>  - it points at area marked as neither EFI_*_SERVICES_DATA, nor with
>EFI_MEMORY_RUNTIME attribute -> Linux refuse to use it
>  - it points to EFI_RUNTIME_SERVICES_DATA or with EFI_MEMORY_RUNTIME
>attribute - Li

Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-18 Thread Roger Pau Monné
On Tue, Aug 18, 2020 at 02:01:35PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Aug 17, 2020 at 11:00:13AM +0200, Roger Pau Monné wrote:
> > On Sun, Aug 16, 2020 at 02:19:49AM +0200, Marek Marczykowski-Górecki wrote:
> > > In case of Xen PV dom0, Xen passes along info about system tables (see
> > > arch/x86/xen/efi.c), but not the memory map from EFI.
> > 
> > I think that's because the memory map returned by
> > XENMEM_machine_memory_map is in e820 form, and doesn't contain the
> > required information about the EFI regions due to the translation done
> > by efi_arch_process_memory_map in Xen?
> 
> Yes, I think so.
> 
> > > This makes sense
> > > as it is Xen responsible for managing physical memory address space.
> > > In this case, it doesn't make sense to condition using ESRT table on
> > > availability of EFI memory map, as it isn't Linux kernel responsible for
> > > it.
> > 
> > PV dom0 is kind of special in that regard as it can create mappings to
> > (almost) any MMIO regions, and hence can change it's memory map
> > substantially.
> 
> Do you mean PV dom0 should receive full EFI memory map? Jan already
> objected this as it would be a layering violation.

dom0 is already capable of getting the native e820 memory map using
XENMEM_machine_memory_map, I'm not sure why allowing to return the
memory map in EFI form would be any different (or a layering
violation in the PV dom0 case).

Do you have a reference to that thread? I certainly missed it.

For PVH dom0 we could consider something different, since in that case
there's a guest memory map which could likely be returned in EFI
format and with the EFI regions if required.

> > > Skip this part on Xen PV (let Xen do the right thing if it deems
> > > necessary) and use ESRT table normally.
> > 
> > Maybe it would be better to introduce a new hypercall (or add a
> > parameter to XENMEM_machine_memory_map) in order to be able to fetch
> > the EFI memory map?
> >
> > That should allow a PV dom0 to check the ESRT is correct and thus not
> > diverge from bate metal.
> 
> Note the EFI memory map there is used not just to check things, but to
> actually modify it to reserve the region. I think that's rather Xen
> responsibility, not dom0. See the comment from Ard.

But that would modify Linux copy of the memory map, which is fine? My
understanding of EFI is limited, but I don't think such changes are
feed back into EFI, so Linux is completely free to do whatever it
wants with it's copy of the EFI memory map.

> > > 
> > > This is a requirement for using fwupd in PV dom0 to update UEFI using
> > > capsules.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > 
> > > ---
> > >  drivers/firmware/efi/esrt.c | 47 -
> > >  1 file changed, 25 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c
> > > index d5915272141f..5c49f2aaa4b1 100644
> > > --- a/drivers/firmware/efi/esrt.c
> > > +++ b/drivers/firmware/efi/esrt.c
> > > @@ -245,36 +245,38 @@ void __init efi_esrt_init(void)
> > >   int rc;
> > >   phys_addr_t end;
> > >  
> > > - if (!efi_enabled(EFI_MEMMAP))
> > > + if (!efi_enabled(EFI_MEMMAP) && !efi_enabled(EFI_PARAVIRT))
> > >   return;
> > >  
> > >   pr_debug("esrt-init: loading.\n");
> > >   if (!esrt_table_exists())
> > >   return;
> > >  
> > > - rc = efi_mem_desc_lookup(efi.esrt, );
> > > - if (rc < 0 ||
> > > - (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> > > -  md.type != EFI_BOOT_SERVICES_DATA &&
> > > -  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> > > - pr_warn("ESRT header is not in the memory map.\n");
> > > - return;
> > > - }
> > > + if (efi_enabled(EFI_MEMMAP)) {
> > > + rc = efi_mem_desc_lookup(efi.esrt, );
> > > + if (rc < 0 ||
> > > + (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> > > +  md.type != EFI_BOOT_SERVICES_DATA &&
> > > +  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> > > + pr_warn("ESRT header is not in the memory map.\n");
> > > + return;
> > > + }
> > 
> > Here you blindly trust the data in the ESRT in the PV case, without
> > checking it matches the regions on the memory map, which could lead to
> > errors if ESRT turns to be wrong.
> 
> I don't think checking merely if ESRT lives somewhere in
> EFI_{BOOT,RUNTIME}_SERVICES_DATA area guarantees its correctness.
> 
> On the other hand, this seems to be done to prevent overwriting that
> memory with something else (see that in case of EFI_BOOT_SERVICES_DATA
> it is later marked as reserved. I think it should be rather done by Xen,
> not dom0.

Forcing Xen to do all those checks seems quite a tedious work, and in
fact the memory map that dom0 has might be more complete than the one
Xen is able to construct, as Xen doesn't have an AML parser so it's
not able to get all the possible info from ACPI.

Thanks, Roger.


Re: [PATCH] efi: discover ESRT table on Xen PV too

2020-08-17 Thread Roger Pau Monné
On Sun, Aug 16, 2020 at 02:19:49AM +0200, Marek Marczykowski-Górecki wrote:
> In case of Xen PV dom0, Xen passes along info about system tables (see
> arch/x86/xen/efi.c), but not the memory map from EFI.

I think that's because the memory map returned by
XENMEM_machine_memory_map is in e820 form, and doesn't contain the
required information about the EFI regions due to the translation done
by efi_arch_process_memory_map in Xen?

> This makes sense
> as it is Xen responsible for managing physical memory address space.
> In this case, it doesn't make sense to condition using ESRT table on
> availability of EFI memory map, as it isn't Linux kernel responsible for
> it.

PV dom0 is kind of special in that regard as it can create mappings to
(almost) any MMIO regions, and hence can change it's memory map
substantially.

> Skip this part on Xen PV (let Xen do the right thing if it deems
> necessary) and use ESRT table normally.

Maybe it would be better to introduce a new hypercall (or add a
parameter to XENMEM_machine_memory_map) in order to be able to fetch
the EFI memory map?

That should allow a PV dom0 to check the ESRT is correct and thus not
diverge from bate metal.

> 
> This is a requirement for using fwupd in PV dom0 to update UEFI using
> capsules.
> 
> Signed-off-by: Marek Marczykowski-Górecki 
> ---
>  drivers/firmware/efi/esrt.c | 47 -
>  1 file changed, 25 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c
> index d5915272141f..5c49f2aaa4b1 100644
> --- a/drivers/firmware/efi/esrt.c
> +++ b/drivers/firmware/efi/esrt.c
> @@ -245,36 +245,38 @@ void __init efi_esrt_init(void)
>   int rc;
>   phys_addr_t end;
>  
> - if (!efi_enabled(EFI_MEMMAP))
> + if (!efi_enabled(EFI_MEMMAP) && !efi_enabled(EFI_PARAVIRT))
>   return;
>  
>   pr_debug("esrt-init: loading.\n");
>   if (!esrt_table_exists())
>   return;
>  
> - rc = efi_mem_desc_lookup(efi.esrt, );
> - if (rc < 0 ||
> - (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> -  md.type != EFI_BOOT_SERVICES_DATA &&
> -  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> - pr_warn("ESRT header is not in the memory map.\n");
> - return;
> - }
> + if (efi_enabled(EFI_MEMMAP)) {
> + rc = efi_mem_desc_lookup(efi.esrt, );
> + if (rc < 0 ||
> + (!(md.attribute & EFI_MEMORY_RUNTIME) &&
> +  md.type != EFI_BOOT_SERVICES_DATA &&
> +  md.type != EFI_RUNTIME_SERVICES_DATA)) {
> + pr_warn("ESRT header is not in the memory map.\n");
> + return;
> + }

Here you blindly trust the data in the ESRT in the PV case, without
checking it matches the regions on the memory map, which could lead to
errors if ESRT turns to be wrong.

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-14 Thread Roger Pau Monné
On Fri, Aug 14, 2020 at 02:54:38PM +0200, Jürgen Groß wrote:
> On 14.08.20 14:47, Roger Pau Monné wrote:
> > On Fri, Aug 14, 2020 at 12:27:32PM +0200, Jürgen Groß wrote:
> > > On 14.08.20 11:56, Roger Pau Monné wrote:
> > > > On Fri, Aug 14, 2020 at 08:29:20AM +0100, Christoph Hellwig wrote:
> > > > > On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monn?? wrote:
> > > > > > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > > > > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > > > > > If enabled (because ZONE_DEVICE is supported) the usage of the 
> > > > > > > > new
> > > > > > > > functionality untangles Xen balloon and RAM hotplug from the 
> > > > > > > > usage of
> > > > > > > > unpopulated physical memory ranges to map foreign pages, which 
> > > > > > > > is the
> > > > > > > > correct thing to do in order to avoid mappings of foreign pages 
> > > > > > > > depend
> > > > > > > > on memory hotplug.
> > > > > > > 
> > > > > > > So please just select ZONE_DEVICE if this is so much better rather
> > > > > > > than maintaining two variants.
> > > > > > 
> > > > > > We still need to other variant for Arm at least, so both need to be
> > > > > > maintained anyway, even if we force ZONE_DEVICE on x86.
> > > > > 
> > > > > Well, it still really helps reproducability if you stick to one
> > > > > implementation of x86.
> > > > > 
> > > > > The alternative would be an explicit config option to opt into it,
> > > > > but just getting a different implementation based on a random
> > > > > kernel option is strange.
> > > > 
> > > > Would adding something like the chunk below to the patch be OK?
> > > > 
> > > > ---8<---
> > > > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > > > index 018020b91baa..5f321a1319e6 100644
> > > > --- a/drivers/xen/Kconfig
> > > > +++ b/drivers/xen/Kconfig
> > > > @@ -328,7 +328,14 @@ config XEN_FRONT_PGDIR_SHBUF
> > > > tristate
> > > >config XEN_UNPOPULATED_ALLOC
> > > > -   bool
> > > > -   default y if ZONE_DEVICE && !ARM && !ARM64
> > > > +   bool "Use unpopulated memory ranges for guest mappings"
> > > > +   depends on X86
> > > > +   select ZONE_DEVICE
> > > > +   default y
> > > 
> > > I'd rather use "default XEN_BACKEND" here, as mappings of other guest's
> > > memory is rarely used for non-backend guests.
> > 
> > There's also the privcmd and gnt devices which make heavy use of this,
> > so I'm not sure only selecting by default on XEN_BACKEND is the best
> > option.
> 
> I just want to avoid that kernels built for running as Xen guest, but
> not as dom0, will be forced to select ZONE_DEVICE.
> 
> As privcmd is dom0-only, this is no problem.

Oh, didn't know that, I somehow assumed privcmd would be available to
all Xen guests regardless of whether dom0 support was selected.

> In case you are worrying about gnt devices, I'd be fine to switch to
> 
> default XEN_BACKEND || XEN_GNTDEV

Sure. maybe even:

default XEN_BACKEND || XEN_GNTDEV || XEN_DOM0

Do you want me to resend with this changed or are you happy to fixup
if there are no further comments?

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-14 Thread Roger Pau Monné
On Fri, Aug 14, 2020 at 12:27:32PM +0200, Jürgen Groß wrote:
> On 14.08.20 11:56, Roger Pau Monné wrote:
> > On Fri, Aug 14, 2020 at 08:29:20AM +0100, Christoph Hellwig wrote:
> > > On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monn?? wrote:
> > > > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > > > functionality untangles Xen balloon and RAM hotplug from the usage 
> > > > > > of
> > > > > > unpopulated physical memory ranges to map foreign pages, which is 
> > > > > > the
> > > > > > correct thing to do in order to avoid mappings of foreign pages 
> > > > > > depend
> > > > > > on memory hotplug.
> > > > > 
> > > > > So please just select ZONE_DEVICE if this is so much better rather
> > > > > than maintaining two variants.
> > > > 
> > > > We still need to other variant for Arm at least, so both need to be
> > > > maintained anyway, even if we force ZONE_DEVICE on x86.
> > > 
> > > Well, it still really helps reproducability if you stick to one
> > > implementation of x86.
> > > 
> > > The alternative would be an explicit config option to opt into it,
> > > but just getting a different implementation based on a random
> > > kernel option is strange.
> > 
> > Would adding something like the chunk below to the patch be OK?
> > 
> > ---8<---
> > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > index 018020b91baa..5f321a1319e6 100644
> > --- a/drivers/xen/Kconfig
> > +++ b/drivers/xen/Kconfig
> > @@ -328,7 +328,14 @@ config XEN_FRONT_PGDIR_SHBUF
> > tristate
> >   config XEN_UNPOPULATED_ALLOC
> > -   bool
> > -   default y if ZONE_DEVICE && !ARM && !ARM64
> > +   bool "Use unpopulated memory ranges for guest mappings"
> > +   depends on X86
> > +   select ZONE_DEVICE
> > +   default y
> 
> I'd rather use "default XEN_BACKEND" here, as mappings of other guest's
> memory is rarely used for non-backend guests.

There's also the privcmd and gnt devices which make heavy use of this,
so I'm not sure only selecting by default on XEN_BACKEND is the best
option.

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-14 Thread Roger Pau Monné
On Fri, Aug 14, 2020 at 08:29:20AM +0100, Christoph Hellwig wrote:
> On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monn?? wrote:
> > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > 
> > > So please just select ZONE_DEVICE if this is so much better rather
> > > than maintaining two variants.
> > 
> > We still need to other variant for Arm at least, so both need to be
> > maintained anyway, even if we force ZONE_DEVICE on x86.
> 
> Well, it still really helps reproducability if you stick to one
> implementation of x86.
> 
> The alternative would be an explicit config option to opt into it,
> but just getting a different implementation based on a random
> kernel option is strange.

Would adding something like the chunk below to the patch be OK?

---8<---
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 018020b91baa..5f321a1319e6 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -328,7 +328,14 @@ config XEN_FRONT_PGDIR_SHBUF
tristate
 
 config XEN_UNPOPULATED_ALLOC
-   bool
-   default y if ZONE_DEVICE && !ARM && !ARM64
+   bool "Use unpopulated memory ranges for guest mappings"
+   depends on X86
+   select ZONE_DEVICE
+   default y
+   help
+ Use unpopulated memory ranges in order to create mappings for guest
+ memory regions, including grants maps and foreign pages. This avoids
+ having to balloon out RAM regions in order to obtain physical memory
+ space to create such mappings.
 
 endmenu



Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-13 Thread Roger Pau Monné
Your email client seems to set 'Reply-to:' to point to everyone on the
'Cc:' field, but not yourself, which is kind of weird. I've manually
fixed it on this reply by moving everyone to the 'Cc:' field and
setting you on 'To:'.

On Thu, Aug 13, 2020 at 11:49:46AM +0200, Daniel Vetter wrote:
> On Thu, Aug 13, 2020 at 09:54:20AM +0200, Roger Pau Monné wrote:
> > On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> > > On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > 
> > > So please just select ZONE_DEVICE if this is so much better rather
> > > than maintaining two variants.
> > 
> > We still need to other variant for Arm at least, so both need to be
> > maintained anyway, even if we force ZONE_DEVICE on x86.
> 
> Why does arm not have ZONE_DEVICE?

It's not that Arm doesn't have ZONE_DEVICE, it's just that the
approach used here won't work correctly on an Arm Xen dom0 as-is.

This is due to the usage of an identity second stage translation in
order to workaround the lack of an IOMMU in some Arm boards.

It can be made to work on Arm, but will likely require someone from
the Arm side doing that.

Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-13 Thread Roger Pau Monné
On Thu, Aug 13, 2020 at 08:33:37AM +0100, Christoph Hellwig wrote:
> On Tue, Aug 11, 2020 at 11:44:47AM +0200, Roger Pau Monne wrote:
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> 
> So please just select ZONE_DEVICE if this is so much better rather
> than maintaining two variants.

We still need to other variant for Arm at least, so both need to be
maintained anyway, even if we force ZONE_DEVICE on x86.

Thanks, Roger.


Re: [PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-12 Thread Roger Pau Monné
On Wed, Aug 12, 2020 at 09:28:45AM +0200, Jürgen Groß wrote:
> On 11.08.20 11:44, Roger Pau Monne wrote:
> > To be used in order to create foreign mappings. This is based on the
> > ZONE_DEVICE facility which is used by persistent memory devices in
> > order to create struct pages and kernel virtual mappings for the IOMEM
> > areas of such devices. Note that on kernels without support for
> > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > create foreign mappings.
> > 
> > The newly added helpers use the same parameters as the existing
> > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > replacement of the callers. Once a memory region has been added to be
> > used as scratch mapping space it will no longer be released, and pages
> > returned are kept in a linked list. This allows to have a buffer of
> > pages and prevents resorting to frequent additions and removals of
> > regions.
> > 
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> > 
> > Note the driver is currently not enabled on Arm platforms because it
> > would interfere with the identity mapping required on some platforms.
> > 
> > Signed-off-by: Roger Pau Monné 
> > ---
> > Cc: Oleksandr Andrushchenko 
> > Cc: David Airlie 
> > Cc: Daniel Vetter 
> > Cc: Boris Ostrovsky 
> > Cc: Juergen Gross 
> > Cc: Stefano Stabellini 
> > Cc: Dan Carpenter 
> > Cc: Roger Pau Monne 
> > Cc: Wei Liu 
> > Cc: Yan Yankovskyi 
> > Cc: dri-de...@lists.freedesktop.org
> > Cc: xen-de...@lists.xenproject.org
> > Cc: linux...@kvack.org
> > Cc: David Hildenbrand 
> > Cc: Michal Hocko 
> > Cc: Dan Williams 
> > ---
> > Changes since v3:
> >   - Introduce a Kconfig option that gates the addition of the
> > unpopulated alloc driver. This allows to easily disable it on Arm
> > platforms.
> >   - Dropped Juergen RB due to the addition of the Kconfig option.
> >   - Switched from MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC.
> > 
> > Changes since v2:
> >   - Drop BUILD_BUG_ON regarding PVMMU page sizes.
> >   - Use a SPDX license identifier.
> >   - Call fill with only the minimum required number of pages.
> >   - Include xen.h header in xen_drm_front_gem.c.
> >   - Use less generic function names.
> >   - Exit early from the init function if not a PV guest.
> >   - Don't use all caps for region name.
> > ---
> >   drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
> >   drivers/xen/Kconfig |   4 +
> >   drivers/xen/Makefile|   1 +
> >   drivers/xen/balloon.c   |   4 +-
> >   drivers/xen/grant-table.c   |   4 +-
> >   drivers/xen/privcmd.c   |   4 +-
> >   drivers/xen/unpopulated-alloc.c | 185 
> >   drivers/xen/xenbus/xenbus_client.c  |   6 +-
> >   drivers/xen/xlate_mmu.c |   4 +-
> >   include/xen/xen.h   |   9 ++
> >   10 files changed, 215 insertions(+), 15 deletions(-)
> >   create mode 100644 drivers/xen/unpopulated-alloc.c
> > 
> > diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> > index 1d339ef92422..018020b91baa 100644
> > --- a/drivers/xen/Kconfig
> > +++ b/drivers/xen/Kconfig
> > @@ -327,4 +327,8 @@ config XEN_HAVE_VPMU
> >   config XEN_FRONT_PGDIR_SHBUF
> > tristate
> > +config XEN_UNPOPULATED_ALLOC
> > +   bool
> > +   default y if ZONE_DEVICE && !ARM && !ARM64
> 
> There is a current effort to enable Xen on RISC-V. Do we expect this
> option to be usable for this architecture?

It will depend on whether the Xen RISC-V implementation mandates an
IOMMU, for example Arm doesn't and hence uses an identity p2m for
dom0. If RISC-V does the same then I would assume this won't be
suitable as-is (not that it couldn't be made to work).

IMO it wasn't clear from the community call what was the RISC-V port
position regarding this, but IIRC the IOMMU spec for RISC-V was behind
the virtualization one, so it's likely that quite a lot of hardware
will have hardware virtualization support but no IOMMU, in which case
I think it's likely the RISC-V port will follow Arm and implement an
identity p2m.

> If yes, I'm fine with the
> exclusion of Arm, otherwise I'd opt for defaulting to yes only for
> X86.
> 
> Either way you can have my:
> 
> Reviewed-by: Juergen Gross 

Thanks. Feel free to change the 'ZONE_DEVICE && !ARM && !ARM64' to
'ZONE_DEVICE && X86' if you think it's safer.

Roger.


[PATCH v4 1/2] memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC

2020-08-11 Thread Roger Pau Monne
This is in preparation for the logic behind MEMORY_DEVICE_DEVDAX also
being used by non DAX devices.

No functional change intended.

Signed-off-by: Roger Pau Monné 
---
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: Ira Weiny 
Cc: "Aneesh Kumar K.V" 
Cc: Johannes Thumshirn 
Cc: Logan Gunthorpe 
Cc: linux-nvd...@lists.01.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
---
 drivers/dax/device.c | 2 +-
 include/linux/memremap.h | 9 -
 mm/memremap.c| 2 +-
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 4c0af2eb7e19..1e89513f3c59 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -429,7 +429,7 @@ int dev_dax_probe(struct device *dev)
return -EBUSY;
}
 
-   dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
+   dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
addr = devm_memremap_pages(dev, _dax->pgmap);
if (IS_ERR(addr))
return PTR_ERR(addr);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 5f5b2df06e61..e5862746751b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -46,11 +46,10 @@ struct vmem_altmap {
  * wakeup is used to coordinate physical address space management (ex:
  * fs truncate/hole punch) vs pinned pages (ex: device dma).
  *
- * MEMORY_DEVICE_DEVDAX:
+ * MEMORY_DEVICE_GENERIC:
  * Host memory that has similar access semantics as System RAM i.e. DMA
- * coherent and supports page pinning. In contrast to
- * MEMORY_DEVICE_FS_DAX, this memory is access via a device-dax
- * character device.
+ * coherent and supports page pinning. This is for example used by DAX devices
+ * that expose memory using a character device.
  *
  * MEMORY_DEVICE_PCI_P2PDMA:
  * Device memory residing in a PCI BAR intended for use with Peer-to-Peer
@@ -60,7 +59,7 @@ enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
MEMORY_DEVICE_FS_DAX,
-   MEMORY_DEVICE_DEVDAX,
+   MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
 };
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 03e38b7a38f1..006dace60b1a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -216,7 +216,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
-   case MEMORY_DEVICE_DEVDAX:
+   case MEMORY_DEVICE_GENERIC:
need_devmap_managed = false;
break;
case MEMORY_DEVICE_PCI_P2PDMA:
-- 
2.28.0



[PATCH v4 2/2] xen: add helpers to allocate unpopulated memory

2020-08-11 Thread Roger Pau Monne
To be used in order to create foreign mappings. This is based on the
ZONE_DEVICE facility which is used by persistent memory devices in
order to create struct pages and kernel virtual mappings for the IOMEM
areas of such devices. Note that on kernels without support for
ZONE_DEVICE Xen will fallback to use ballooned pages in order to
create foreign mappings.

The newly added helpers use the same parameters as the existing
{alloc/free}_xenballooned_pages functions, which allows for in-place
replacement of the callers. Once a memory region has been added to be
used as scratch mapping space it will no longer be released, and pages
returned are kept in a linked list. This allows to have a buffer of
pages and prevents resorting to frequent additions and removals of
regions.

If enabled (because ZONE_DEVICE is supported) the usage of the new
functionality untangles Xen balloon and RAM hotplug from the usage of
unpopulated physical memory ranges to map foreign pages, which is the
correct thing to do in order to avoid mappings of foreign pages depend
on memory hotplug.

Note the driver is currently not enabled on Arm platforms because it
would interfere with the identity mapping required on some platforms.

Signed-off-by: Roger Pau Monné 
---
Cc: Oleksandr Andrushchenko 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Dan Carpenter 
Cc: Roger Pau Monne 
Cc: Wei Liu 
Cc: Yan Yankovskyi 
Cc: dri-de...@lists.freedesktop.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Dan Williams 
---
Changes since v3:
 - Introduce a Kconfig option that gates the addition of the
   unpopulated alloc driver. This allows to easily disable it on Arm
   platforms.
 - Dropped Juergen RB due to the addition of the Kconfig option.
 - Switched from MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC.

Changes since v2:
 - Drop BUILD_BUG_ON regarding PVMMU page sizes.
 - Use a SPDX license identifier.
 - Call fill with only the minimum required number of pages.
 - Include xen.h header in xen_drm_front_gem.c.
 - Use less generic function names.
 - Exit early from the init function if not a PV guest.
 - Don't use all caps for region name.
---
 drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
 drivers/xen/Kconfig |   4 +
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |   4 +-
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 185 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/xen/xen.h   |   9 ++
 10 files changed, 215 insertions(+), 15 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

diff --git a/drivers/gpu/drm/xen/xen_drm_front_gem.c 
b/drivers/gpu/drm/xen/xen_drm_front_gem.c
index f0b85e094111..270e1bd3e4da 100644
--- a/drivers/gpu/drm/xen/xen_drm_front_gem.c
+++ b/drivers/gpu/drm/xen/xen_drm_front_gem.c
@@ -18,6 +18,7 @@
 #include 
 
 #include 
+#include 
 
 #include "xen_drm_front.h"
 #include "xen_drm_front_gem.h"
@@ -99,8 +100,8 @@ static struct xen_gem_object *gem_create(struct drm_device 
*dev, size_t size)
 * allocate ballooned pages which will be used to map
 * grant references provided by the backend
 */
-   ret = alloc_xenballooned_pages(xen_obj->num_pages,
-  xen_obj->pages);
+   ret = xen_alloc_unpopulated_pages(xen_obj->num_pages,
+ xen_obj->pages);
if (ret < 0) {
DRM_ERROR("Cannot allocate %zu ballooned pages: %d\n",
  xen_obj->num_pages, ret);
@@ -152,8 +153,8 @@ void xen_drm_front_gem_free_object_unlocked(struct 
drm_gem_object *gem_obj)
} else {
if (xen_obj->pages) {
if (xen_obj->be_alloc) {
-   free_xenballooned_pages(xen_obj->num_pages,
-   xen_obj->pages);
+   xen_free_unpopulated_pages(xen_obj->num_pages,
+  xen_obj->pages);
gem_free_pages_array(xen_obj);
} else {
drm_gem_put_pages(_obj->base,
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 1d339ef92422..018020b91baa 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -327,4 +327,8 @@ config XEN_HAVE_VPMU
 config XEN_FRONT_PGDIR_SHBUF
tristate
 
+config XEN_UNPOPULATED_ALLOC
+   bool
+   default y if ZONE_DEVICE && !ARM && !ARM64
+

[PATCH v4 0/2] xen/balloon: fixes for memory hotplug

2020-08-11 Thread Roger Pau Monne
Hello,

The following series contain some fixes in order to split Xen
unpopulated memory handling from the ballooning driver if ZONE_DEVICE is
available, so that physical memory regions used to map foreign pages are
not tied to memory hotplug.

The main difference in this version is that MEMORY_DEVICE_DEVDAX is
renamed to MEMORY_DEVICE_GENERIC, as using DEVDAX in the Xen code to
allocate unpopulated memory felt wrong.

Thanks, Roger.

Roger Pau Monne (2):
  memremap: rename MEMORY_DEVICE_DEVDAX to MEMORY_DEVICE_GENERIC
  xen: add helpers to allocate unpopulated memory

 drivers/dax/device.c|   2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
 drivers/xen/Kconfig |   4 +
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |   4 +-
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 185 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/linux/memremap.h|   9 +-
 include/xen/xen.h   |   9 ++
 mm/memremap.c   |   2 +-
 13 files changed, 221 insertions(+), 22 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

-- 
2.28.0



Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Roger Pau Monné
On Tue, Jul 28, 2020 at 06:12:46PM +0100, Julien Grall wrote:
> Hi Roger,
> 
> On 28/07/2020 17:59, Roger Pau Monné wrote:
> > On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
> > > Hi,
> > > 
> > > On 27/07/2020 10:13, Roger Pau Monne wrote:
> > > > To be used in order to create foreign mappings. This is based on the
> > > > ZONE_DEVICE facility which is used by persistent memory devices in
> > > > order to create struct pages and kernel virtual mappings for the IOMEM
> > > > areas of such devices. Note that on kernels without support for
> > > > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > > > create foreign mappings.
> > > > 
> > > > The newly added helpers use the same parameters as the existing
> > > > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > > > replacement of the callers. Once a memory region has been added to be
> > > > used as scratch mapping space it will no longer be released, and pages
> > > > returned are kept in a linked list. This allows to have a buffer of
> > > > pages and prevents resorting to frequent additions and removals of
> > > > regions.
> > > > 
> > > > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > > > functionality untangles Xen balloon and RAM hotplug from the usage of
> > > > unpopulated physical memory ranges to map foreign pages, which is the
> > > > correct thing to do in order to avoid mappings of foreign pages depend
> > > > on memory hotplug.
> > > I think this is going to break Dom0 on Arm if the kernel has been built 
> > > with
> > > hotplug. This is because you may end up to re-use region that will be used
> > > for the 1:1 mapping of a foreign map.
> > > 
> > > Note that I don't know whether hotplug has been tested on Xen on Arm yet. 
> > > So
> > > it might be possible to be already broken.
> > > 
> > > Meanwhile, my suggestion would be to make the use of hotplug in the 
> > > balloon
> > > code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?
> > 
> > Right, this feature (allocation of unpopulated memory separated from
> > the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
> > think could be used on Arm.
> > 
> > IMO the right solution seems to be to subtract the physical memory
> > regions that can be used for the identity mappings of foreign pages
> > (all RAM on the system AFAICT) from iomem_resource, as that would make
> > this and the memory hotplug done in the balloon driver safe?
> 
> Dom0 doesn't know the regions used for the identity mappings as this is only
> managed by Xen. So there is nothing you can really do here.

OK, I will add the guards to prevent this being built on Arm.

> But don't you have the same issue on x86 with "magic pages"?

Those are marked as reserved on the memory map, and hence I would
expect them to never end up in iomem_resource.

Thanks, Roger.


Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Roger Pau Monné
On Tue, Jul 28, 2020 at 06:06:25PM +0100, Andrew Cooper wrote:
> On 28/07/2020 17:59, Roger Pau Monné wrote:
> > On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
> >> Hi,
> >>
> >> On 27/07/2020 10:13, Roger Pau Monne wrote:
> >>> To be used in order to create foreign mappings. This is based on the
> >>> ZONE_DEVICE facility which is used by persistent memory devices in
> >>> order to create struct pages and kernel virtual mappings for the IOMEM
> >>> areas of such devices. Note that on kernels without support for
> >>> ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> >>> create foreign mappings.
> >>>
> >>> The newly added helpers use the same parameters as the existing
> >>> {alloc/free}_xenballooned_pages functions, which allows for in-place
> >>> replacement of the callers. Once a memory region has been added to be
> >>> used as scratch mapping space it will no longer be released, and pages
> >>> returned are kept in a linked list. This allows to have a buffer of
> >>> pages and prevents resorting to frequent additions and removals of
> >>> regions.
> >>>
> >>> If enabled (because ZONE_DEVICE is supported) the usage of the new
> >>> functionality untangles Xen balloon and RAM hotplug from the usage of
> >>> unpopulated physical memory ranges to map foreign pages, which is the
> >>> correct thing to do in order to avoid mappings of foreign pages depend
> >>> on memory hotplug.
> >> I think this is going to break Dom0 on Arm if the kernel has been built 
> >> with
> >> hotplug. This is because you may end up to re-use region that will be used
> >> for the 1:1 mapping of a foreign map.
> >>
> >> Note that I don't know whether hotplug has been tested on Xen on Arm yet. 
> >> So
> >> it might be possible to be already broken.
> >>
> >> Meanwhile, my suggestion would be to make the use of hotplug in the balloon
> >> code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?
> > Right, this feature (allocation of unpopulated memory separated from
> > the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
> > think could be used on Arm.
> >
> > IMO the right solution seems to be to subtract the physical memory
> > regions that can be used for the identity mappings of foreign pages
> > (all RAM on the system AFAICT) from iomem_resource, as that would make
> > this and the memory hotplug done in the balloon driver safe?
> 
> The right solution is a mechanism for translated guests to query Xen to
> find regions of guest physical address space which are unused, and can
> be safely be used for foreign/grant/other  mappings.
> 
> Please don't waste any more time applying more duct tape to a broken
> system, and instead spend the time organising some proper foundations.

The piece added here (using ZONE_DEVICE) will be relevant when Xen can
provide the space to map foreign pages, it's just that right now it
relies on iomem_resource instead of a Xen specific resource map that
should be provided by the hypervisor. It should indeed be fixed, but
right now this patch should allow a PVH dom0 to work slightly better.
When Xen provides such areas Linux just needs to populate a custom Xen
resource with them and use it instead of iomem_resurce.

The Arm stuff I'm certainly not familiar with, and can't provide much
insight on that. If it's best to just disable it and continue to rely
on ballooned out pages that's fine.

Roger.


Re: [PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-28 Thread Roger Pau Monné
On Tue, Jul 28, 2020 at 05:48:23PM +0100, Julien Grall wrote:
> Hi,
> 
> On 27/07/2020 10:13, Roger Pau Monne wrote:
> > To be used in order to create foreign mappings. This is based on the
> > ZONE_DEVICE facility which is used by persistent memory devices in
> > order to create struct pages and kernel virtual mappings for the IOMEM
> > areas of such devices. Note that on kernels without support for
> > ZONE_DEVICE Xen will fallback to use ballooned pages in order to
> > create foreign mappings.
> > 
> > The newly added helpers use the same parameters as the existing
> > {alloc/free}_xenballooned_pages functions, which allows for in-place
> > replacement of the callers. Once a memory region has been added to be
> > used as scratch mapping space it will no longer be released, and pages
> > returned are kept in a linked list. This allows to have a buffer of
> > pages and prevents resorting to frequent additions and removals of
> > regions.
> > 
> > If enabled (because ZONE_DEVICE is supported) the usage of the new
> > functionality untangles Xen balloon and RAM hotplug from the usage of
> > unpopulated physical memory ranges to map foreign pages, which is the
> > correct thing to do in order to avoid mappings of foreign pages depend
> > on memory hotplug.
> I think this is going to break Dom0 on Arm if the kernel has been built with
> hotplug. This is because you may end up to re-use region that will be used
> for the 1:1 mapping of a foreign map.
> 
> Note that I don't know whether hotplug has been tested on Xen on Arm yet. So
> it might be possible to be already broken.
> 
> Meanwhile, my suggestion would be to make the use of hotplug in the balloon
> code conditional (maybe using CONFIG_ARM64 and CONFIG_ARM)?

Right, this feature (allocation of unpopulated memory separated from
the balloon driver) is currently gated on CONFIG_ZONE_DEVICE, which I
think could be used on Arm.

IMO the right solution seems to be to subtract the physical memory
regions that can be used for the identity mappings of foreign pages
(all RAM on the system AFAICT) from iomem_resource, as that would make
this and the memory hotplug done in the balloon driver safe?

Thanks, Roger.


[PATCH] xen/balloon: add header guard

2020-07-28 Thread Roger Pau Monne
In order to protect against the header being included multiple times
on the same compilation unit.

Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
This is required as a pre-patch to use ZONE_DEVICE, or else the
fallback of including the balloon header might not work properly.
---
 include/xen/balloon.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/xen/balloon.h b/include/xen/balloon.h
index 6fb95aa19405..6dbdb0b3fd03 100644
--- a/include/xen/balloon.h
+++ b/include/xen/balloon.h
@@ -2,6 +2,8 @@
 /**
  * Xen balloon functionality
  */
+#ifndef _XEN_BALLOON_H
+#define _XEN_BALLOON_H
 
 #define RETRY_UNLIMITED0
 
@@ -34,3 +36,5 @@ static inline void xen_balloon_init(void)
 {
 }
 #endif
+
+#endif /* _XEN_BALLOON_H */
-- 
2.27.0



[PATCH v3 4/4] xen: add helpers to allocate unpopulated memory

2020-07-27 Thread Roger Pau Monne
To be used in order to create foreign mappings. This is based on the
ZONE_DEVICE facility which is used by persistent memory devices in
order to create struct pages and kernel virtual mappings for the IOMEM
areas of such devices. Note that on kernels without support for
ZONE_DEVICE Xen will fallback to use ballooned pages in order to
create foreign mappings.

The newly added helpers use the same parameters as the existing
{alloc/free}_xenballooned_pages functions, which allows for in-place
replacement of the callers. Once a memory region has been added to be
used as scratch mapping space it will no longer be released, and pages
returned are kept in a linked list. This allows to have a buffer of
pages and prevents resorting to frequent additions and removals of
regions.

If enabled (because ZONE_DEVICE is supported) the usage of the new
functionality untangles Xen balloon and RAM hotplug from the usage of
unpopulated physical memory ranges to map foreign pages, which is the
correct thing to do in order to avoid mappings of foreign pages depend
on memory hotplug.

Signed-off-by: Roger Pau Monné 
---
I've not added a new memory_type type and just used
MEMORY_DEVICE_DEVDAX which seems to be what we want for such memory
regions. I'm unsure whether abusing this type is fine, or if I should
instead add a specific type, maybe MEMORY_DEVICE_GENERIC? I don't
think we should be using a specific Xen type at all.
---
Cc: Oleksandr Andrushchenko 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Dan Carpenter 
Cc: Roger Pau Monne 
Cc: Wei Liu 
Cc: Yan Yankovskyi 
Cc: dri-de...@lists.freedesktop.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Dan Williams 
---
Changes since v2:
 - Drop BUILD_BUG_ON regarding PVMMU page sizes.
 - Use a SPDX license identifier.
 - Call fill with only the minimum required number of pages.
 - Include xen.h header in xen_drm_front_gem.c.
 - Use less generic function names.
 - Exit early from the init function if not a PV guest.
 - Don't use all caps for region name.
---
 drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |   4 +-
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 185 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/xen/xen.h   |   9 ++
 9 files changed, 211 insertions(+), 15 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

diff --git a/drivers/gpu/drm/xen/xen_drm_front_gem.c 
b/drivers/gpu/drm/xen/xen_drm_front_gem.c
index f0b85e094111..270e1bd3e4da 100644
--- a/drivers/gpu/drm/xen/xen_drm_front_gem.c
+++ b/drivers/gpu/drm/xen/xen_drm_front_gem.c
@@ -18,6 +18,7 @@
 #include 
 
 #include 
+#include 
 
 #include "xen_drm_front.h"
 #include "xen_drm_front_gem.h"
@@ -99,8 +100,8 @@ static struct xen_gem_object *gem_create(struct drm_device 
*dev, size_t size)
 * allocate ballooned pages which will be used to map
 * grant references provided by the backend
 */
-   ret = alloc_xenballooned_pages(xen_obj->num_pages,
-  xen_obj->pages);
+   ret = xen_alloc_unpopulated_pages(xen_obj->num_pages,
+ xen_obj->pages);
if (ret < 0) {
DRM_ERROR("Cannot allocate %zu ballooned pages: %d\n",
  xen_obj->num_pages, ret);
@@ -152,8 +153,8 @@ void xen_drm_front_gem_free_object_unlocked(struct 
drm_gem_object *gem_obj)
} else {
if (xen_obj->pages) {
if (xen_obj->be_alloc) {
-   free_xenballooned_pages(xen_obj->num_pages,
-   xen_obj->pages);
+   xen_free_unpopulated_pages(xen_obj->num_pages,
+  xen_obj->pages);
gem_free_pages_array(xen_obj);
} else {
drm_gem_put_pages(_obj->base,
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 0d322f3d90cd..788a5d9c8ef0 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -42,3 +42,4 @@ xen-gntdev-$(CONFIG_XEN_GNTDEV_DMABUF)+= 
gntdev-dmabuf.o
 xen-gntalloc-y := gntalloc.o
 xen-privcmd-y  := privcmd.o privcmd-buf.o
 obj-$(CONFIG_XEN_FRONT_PGDIR_SHBUF)+= xen-front-pgdir-shbuf.o
+obj-$(CONFIG_ZONE_DEVICE)  += unpopulated-alloc.o
diff --git a/drivers/xen/ball

[PATCH v3 2/4] xen/balloon: make the balloon wait interruptible

2020-07-27 Thread Roger Pau Monne
So it can be killed, or else processes can get hung indefinitely
waiting for balloon pages.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Juergen Gross 
Cc: sta...@vger.kernel.org
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 3cb10ed32557..292413b27575 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -568,11 +568,13 @@ static int add_ballooned_pages(int nr_pages)
if (xen_hotplug_unpopulated) {
st = reserve_additional_memory();
if (st != BP_ECANCELED) {
+   int rc;
+
mutex_unlock(_mutex);
-   wait_event(balloon_wq,
+   rc = wait_event_interruptible(balloon_wq,
   !list_empty(_pages));
mutex_lock(_mutex);
-   return 0;
+   return rc ? -ENOMEM : 0;
}
}
 
-- 
2.27.0



[PATCH v3 3/4] Revert "xen/balloon: Fix crash when ballooning on x86 32 bit PAE"

2020-07-27 Thread Roger Pau Monne
This reverts commit dfd74a1edfaba5864276a2859190a8d242d18952.

This has been fixed by commit dca4436d1cf9e0d237c which added the out
of bounds check to __add_memory, so that trying to add blocks past
MAX_PHYSMEM_BITS will fail.

Note the check in the Xen balloon driver was bogus anyway, as it
checked the start address of the resource, but it should instead test
the end address to assert the whole resource falls below
MAX_PHYSMEM_BITS.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Juergen Gross 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 292413b27575..b1d8b028bf80 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -266,20 +266,6 @@ static struct resource 
*additional_memory_resource(phys_addr_t size)
return NULL;
}
 
-#ifdef CONFIG_SPARSEMEM
-   {
-   unsigned long limit = 1UL << (MAX_PHYSMEM_BITS - PAGE_SHIFT);
-   unsigned long pfn = res->start >> PAGE_SHIFT;
-
-   if (pfn > limit) {
-   pr_err("New System RAM resource outside addressable RAM 
(%lu > %lu)\n",
-  pfn, limit);
-   release_memory_resource(res);
-   return NULL;
-   }
-   }
-#endif
-
return res;
 }
 
-- 
2.27.0



[PATCH v3 0/4] xen/balloon: fixes for memory hotplug

2020-07-27 Thread Roger Pau Monne
Hello,

The following series contain some fixes in order to split Xen
unpopulated memory handling from the ballooning driver if ZONE_DEVICE is
available, so that physical memory regions used to map foreign pages are
not tied to memory hotplug.

First two patches are bugfixes that IMO should be backported to stable
branches, third patch is a revert of a workaround applied to the balloon
driver and last patch introduces an interface based on ZONE_DEVICE in
order to manage regions to use for foreign mappings.

Thanks, Roger.

Roger Pau Monne (4):
  xen/balloon: fix accounting in alloc_xenballooned_pages error path
  xen/balloon: make the balloon wait interruptible
  Revert "xen/balloon: Fix crash when ballooning on x86 32 bit PAE"
  xen: add helpers to allocate unpopulated memory

 drivers/gpu/drm/xen/xen_drm_front_gem.c |   9 +-
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |  30 ++--
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 185 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/xen/xen.h   |   9 ++
 9 files changed, 221 insertions(+), 31 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

-- 
2.27.0



[PATCH v3 1/4] xen/balloon: fix accounting in alloc_xenballooned_pages error path

2020-07-27 Thread Roger Pau Monne
target_unpopulated is incremented with nr_pages at the start of the
function, but the call to free_xenballooned_pages will only subtract
pgno number of pages, and thus the rest need to be subtracted before
returning or else accounting will be skewed.

Signed-off-by: Roger Pau Monné 
Reviewed-by: Juergen Gross 
Cc: sta...@vger.kernel.org
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 77c57568e5d7..3cb10ed32557 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -630,6 +630,12 @@ int alloc_xenballooned_pages(int nr_pages, struct page 
**pages)
  out_undo:
mutex_unlock(_mutex);
free_xenballooned_pages(pgno, pages);
+   /*
+* NB: free_xenballooned_pages will only subtract pgno pages, but since
+* target_unpopulated is incremented with nr_pages at the start we need
+* to remove the remaining ones also, or accounting will be screwed.
+*/
+   balloon_stats.target_unpopulated -= nr_pages - pgno;
return ret;
 }
 EXPORT_SYMBOL(alloc_xenballooned_pages);
-- 
2.27.0



Re: [PATCH v2 4/4] xen: add helpers to allocate unpopulated memory

2020-07-27 Thread Roger Pau Monné
On Fri, Jul 24, 2020 at 12:36:33PM -0400, Boris Ostrovsky wrote:
> On 7/24/20 10:34 AM, David Hildenbrand wrote:
> > CCing Dan
> >
> > On 24.07.20 14:42, Roger Pau Monne wrote:
> >> +
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +#include 
> >> +
> >> +#include 
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +static DEFINE_MUTEX(lock);
> >> +static LIST_HEAD(list);
> >> +static unsigned int count;
> >> +
> >> +static int fill(unsigned int nr_pages)
> 
> 
> Less generic names? How about  list_lock, pg_list, pg_count,
> fill_pglist()? (But these are bad too, so maybe you can come up with
> something better)

OK, I have to admit I like using such short names when the code allows
to, for example this code is so simple that it didn't seem to warrant
using longer names. Will rename on next version.

> >> +{
> >> +  struct dev_pagemap *pgmap;
> >> +  void *vaddr;
> >> +  unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
> >> +  int nid, ret;
> >> +
> >> +  pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
> >> +  if (!pgmap)
> >> +  return -ENOMEM;
> >> +
> >> +  pgmap->type = MEMORY_DEVICE_DEVDAX;
> >> +  pgmap->res.name = "XEN SCRATCH";
> 
> 
> Typically iomem resources only capitalize first letters.
> 
> 
> >> +  pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
> >> +
> >> +  ret = allocate_resource(_resource, >res,
> >> +  alloc_pages * PAGE_SIZE, 0, -1,
> >> +  PAGES_PER_SECTION * PAGE_SIZE, NULL, NULL);
> 
> 
> Are we not going to end up with a whole bunch of "Xen scratch" resource
> ranges for each miss in the page list? Or do we expect them to get merged?

PAGES_PER_SECTION is IMO big enough to prevent ending up with a lot of
separated ranges. I think the value is 32 or 64MiB on x86, so while we
are likely to end up with more than one resource added, I don't think
it's going to be massive.

> 
> >> +  if (ret < 0) {
> >> +  pr_err("Cannot allocate new IOMEM resource\n");
> >> +  kfree(pgmap);
> >> +  return ret;
> >> +  }
> >> +
> >> +  nid = memory_add_physaddr_to_nid(pgmap->res.start);
> 
> 
> Should we consider page range crossing node boundaries?

I'm not sure whether this is possible (I would think allocate_resource
should return a range from a single node), but then it would greatly
complicate the code to perform the memremap_pages, as we would have to
split the region into multiple dev_pagemap structs.

FWIW the current code in the balloon driver does exactly the same
(which doesn't mean it's correct, but that's where I got the logic
from).

> >> +
> >> +#ifdef CONFIG_XEN_HAVE_PVMMU
> >> +  /*
> >> +   * We don't support PV MMU when Linux and Xen is using
> >> +   * different page granularity.
> >> +   */
> >> +  BUILD_BUG_ON(XEN_PAGE_SIZE != PAGE_SIZE);
> >> +
> >> +/*
> >> + * memremap will build page tables for the new memory so
> >> + * the p2m must contain invalid entries so the correct
> >> + * non-present PTEs will be written.
> >> + *
> >> + * If a failure occurs, the original (identity) p2m entries
> >> + * are not restored since this region is now known not to
> >> + * conflict with any devices.
> >> + */
> >> +  if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> >> +  xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
> >> +
> >> +  for (i = 0; i < alloc_pages; i++) {
> >> +  if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
> >> +  pr_warn("set_phys_to_machine() failed, no 
> >> memory added\n");
> >> +  release_resource(>res);
> >> +  kfree(pgmap);
> >> +  return -ENOMEM;
> >> +  }
> >> +}
> >> +  }
> >> +#endif
> >> +
> >> +  vaddr = memremap_pages(pgmap, nid);
> >> +  if (IS_ERR(vaddr)) {
> >> +  pr_err("Cannot remap memory range\n");
> >> +  release_resource(>res);
> >> +  kfree(pgmap);
> >> +  return PTR_ERR(vaddr);
>

[PATCH v2 4/4] xen: add helpers to allocate unpopulated memory

2020-07-24 Thread Roger Pau Monne
To be used in order to create foreign mappings. This is based on the
ZONE_DEVICE facility which is used by persistent memory devices in
order to create struct pages and kernel virtual mappings for the IOMEM
areas of such devices. Note that on kernels without support for
ZONE_DEVICE Xen will fallback to use ballooned pages in order to
create foreign mappings.

The newly added helpers use the same parameters as the existing
{alloc/free}_xenballooned_pages functions, which allows for in-place
replacement of the callers. Once a memory region has been added to be
used as scratch mapping space it will no longer be released, and pages
returned are kept in a linked list. This allows to have a buffer of
pages and prevents resorting to frequent additions and removals of
regions.

If enabled (because ZONE_DEVICE is supported) the usage of the new
functionality untangles Xen balloon and RAM hotplug from the usage of
unpopulated physical memory ranges to map foreign pages, which is the
correct thing to do in order to avoid mappings of foreign pages depend
on memory hotplug.

Signed-off-by: Roger Pau Monné 
---
I've not added a new memory_type type and just used
MEMORY_DEVICE_DEVDAX which seems to be what we want for such memory
regions. I'm unsure whether abusing this type is fine, or if I should
instead add a specific type, maybe MEMORY_DEVICE_GENERIC? I don't
think we should be using a specific Xen type at all.
---
Cc: Oleksandr Andrushchenko 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Dan Carpenter 
Cc: Roger Pau Monne 
Cc: Wei Liu 
Cc: Yan Yankovskyi 
Cc: dri-de...@lists.freedesktop.org
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
Cc: David Hildenbrand 
Cc: Michal Hocko 
---
 drivers/gpu/drm/xen/xen_drm_front_gem.c |   8 +-
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |   4 +-
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 222 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/xen/xen.h   |   8 +
 9 files changed, 246 insertions(+), 15 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

diff --git a/drivers/gpu/drm/xen/xen_drm_front_gem.c 
b/drivers/gpu/drm/xen/xen_drm_front_gem.c
index f0b85e094111..9dd06eae767a 100644
--- a/drivers/gpu/drm/xen/xen_drm_front_gem.c
+++ b/drivers/gpu/drm/xen/xen_drm_front_gem.c
@@ -99,8 +99,8 @@ static struct xen_gem_object *gem_create(struct drm_device 
*dev, size_t size)
 * allocate ballooned pages which will be used to map
 * grant references provided by the backend
 */
-   ret = alloc_xenballooned_pages(xen_obj->num_pages,
-  xen_obj->pages);
+   ret = xen_alloc_unpopulated_pages(xen_obj->num_pages,
+ xen_obj->pages);
if (ret < 0) {
DRM_ERROR("Cannot allocate %zu ballooned pages: %d\n",
  xen_obj->num_pages, ret);
@@ -152,8 +152,8 @@ void xen_drm_front_gem_free_object_unlocked(struct 
drm_gem_object *gem_obj)
} else {
if (xen_obj->pages) {
if (xen_obj->be_alloc) {
-   free_xenballooned_pages(xen_obj->num_pages,
-   xen_obj->pages);
+   xen_free_unpopulated_pages(xen_obj->num_pages,
+  xen_obj->pages);
gem_free_pages_array(xen_obj);
} else {
drm_gem_put_pages(_obj->base,
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 0d322f3d90cd..788a5d9c8ef0 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -42,3 +42,4 @@ xen-gntdev-$(CONFIG_XEN_GNTDEV_DMABUF)+= 
gntdev-dmabuf.o
 xen-gntalloc-y := gntalloc.o
 xen-privcmd-y  := privcmd.o privcmd-buf.o
 obj-$(CONFIG_XEN_FRONT_PGDIR_SHBUF)+= xen-front-pgdir-shbuf.o
+obj-$(CONFIG_ZONE_DEVICE)  += unpopulated-alloc.o
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index b1d8b028bf80..815ef10eb2ff 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -654,7 +654,7 @@ void free_xenballooned_pages(int nr_pages, struct page 
**pages)
 }
 EXPORT_SYMBOL(free_xenballooned_pages);
 
-#ifdef CONFIG_XEN_PV
+#if defined(CONFIG_XEN_PV) && !defined(CONFIG_ZONE_DEVICE)
 static void __init balloon_add_region(unsigned long start_pfn,
  unsigned long pages)
 {
@@ -708,7 +708,7 @@ static int __init 

[PATCH v2 3/4] Revert "xen/balloon: Fix crash when ballooning on x86 32 bit PAE"

2020-07-24 Thread Roger Pau Monne
This reverts commit dfd74a1edfaba5864276a2859190a8d242d18952.

This has been fixed by commit dca4436d1cf9e0d237c which added the out
of bounds check to __add_memory, so that trying to add blocks past
MAX_PHYSMEM_BITS will fail.

Note the check in the Xen balloon driver was bogus anyway, as it
checked the start address of the resource, but it should instead test
the end address to assert the whole resource falls below
MAX_PHYSMEM_BITS.

Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 292413b27575..b1d8b028bf80 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -266,20 +266,6 @@ static struct resource 
*additional_memory_resource(phys_addr_t size)
return NULL;
}
 
-#ifdef CONFIG_SPARSEMEM
-   {
-   unsigned long limit = 1UL << (MAX_PHYSMEM_BITS - PAGE_SHIFT);
-   unsigned long pfn = res->start >> PAGE_SHIFT;
-
-   if (pfn > limit) {
-   pr_err("New System RAM resource outside addressable RAM 
(%lu > %lu)\n",
-  pfn, limit);
-   release_memory_resource(res);
-   return NULL;
-   }
-   }
-#endif
-
return res;
 }
 
-- 
2.27.0



[PATCH v2 0/4] xen/balloon: fixes for memory hotplug

2020-07-24 Thread Roger Pau Monne
Hello,

The following series contain some fixes in order to split Xen
unpopulated memory handling from the ballooning driver if ZONE_DEVICE is
available, so that physical memory regions used to map foreign pages are
not tied to memory hotplug.

Fix two patches are bugfixes that IMO should be backported to stable
branches, third patch is a revert of a workaround applied to the balloon
driver and last patch introduces an interface based on ZONE_DEVICE in
order to manage regions to use for foreign mappings.

Thanks, Roger.

Roger Pau Monne (4):
  xen/balloon: fix accounting in alloc_xenballooned_pages error path
  xen/balloon: make the balloon wait interruptible
  Revert "xen/balloon: Fix crash when ballooning on x86 32 bit PAE"
  xen: add helpers to allocate unpopulated memory

 drivers/gpu/drm/xen/xen_drm_front_gem.c |   8 +-
 drivers/xen/Makefile|   1 +
 drivers/xen/balloon.c   |  30 ++--
 drivers/xen/grant-table.c   |   4 +-
 drivers/xen/privcmd.c   |   4 +-
 drivers/xen/unpopulated-alloc.c | 222 
 drivers/xen/xenbus/xenbus_client.c  |   6 +-
 drivers/xen/xlate_mmu.c |   4 +-
 include/xen/xen.h   |   8 +
 9 files changed, 256 insertions(+), 31 deletions(-)
 create mode 100644 drivers/xen/unpopulated-alloc.c

-- 
2.27.0



[PATCH v2 1/4] xen/balloon: fix accounting in alloc_xenballooned_pages error path

2020-07-24 Thread Roger Pau Monne
target_unpopulated is incremented with nr_pages at the start of the
function, but the call to free_xenballooned_pages will only subtract
pgno number of pages, and thus the rest need to be subtracted before
returning or else accounting will be skewed.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 77c57568e5d7..3cb10ed32557 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -630,6 +630,12 @@ int alloc_xenballooned_pages(int nr_pages, struct page 
**pages)
  out_undo:
mutex_unlock(_mutex);
free_xenballooned_pages(pgno, pages);
+   /*
+* NB: free_xenballooned_pages will only subtract pgno pages, but since
+* target_unpopulated is incremented with nr_pages at the start we need
+* to remove the remaining ones also, or accounting will be screwed.
+*/
+   balloon_stats.target_unpopulated -= nr_pages - pgno;
return ret;
 }
 EXPORT_SYMBOL(alloc_xenballooned_pages);
-- 
2.27.0



[PATCH v2 2/4] xen/balloon: make the balloon wait interruptible

2020-07-24 Thread Roger Pau Monne
So it can be killed, or else processes can get hung indefinitely
waiting for balloon pages.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 3cb10ed32557..292413b27575 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -568,11 +568,13 @@ static int add_ballooned_pages(int nr_pages)
if (xen_hotplug_unpopulated) {
st = reserve_additional_memory();
if (st != BP_ECANCELED) {
+   int rc;
+
mutex_unlock(_mutex);
-   wait_event(balloon_wq,
+   rc = wait_event_interruptible(balloon_wq,
   !list_empty(_pages));
mutex_lock(_mutex);
-   return 0;
+   return rc ? -ENOMEM : 0;
}
}
 
-- 
2.27.0



Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 05:10:03PM +0200, Jürgen Groß wrote:
> On 23.07.20 15:59, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 03:22:49PM +0200, David Hildenbrand wrote:
> > > On 23.07.20 14:23, Roger Pau Monné wrote:
> > > > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> > > > > On 23.07.20 10:45, Roger Pau Monne wrote:
> > > > > > Add an extra option to add_memory_resource that overrides the memory
> > > > > > hotplug online behavior in order to force onlining of memory from
> > > > > > add_memory_resource unconditionally.
> > > > > > 
> > > > > > This is required for the Xen balloon driver, that must run the
> > > > > > online page callback in order to correctly process the newly added
> > > > > > memory region, note this is an unpopulated region that is used by 
> > > > > > Linux
> > > > > > to either hotplug RAM or to map foreign pages from other domains, 
> > > > > > and
> > > > > > hence memory hotplug when running on Xen can be used even without 
> > > > > > the
> > > > > > user explicitly requesting it, as part of the normal operations of 
> > > > > > the
> > > > > > OS when attempting to map memory from a different domain.
> > > > > > 
> > > > > > Setting a different default value of memhp_default_online_type when
> > > > > > attaching the balloon driver is not a robust solution, as the user 
> > > > > > (or
> > > > > > distro init scripts) could still change it and thus break the Xen
> > > > > > balloon driver.
> > > > > 
> > > > > I think we discussed this a couple of times before (even triggered by 
> > > > > my
> > > > > request), and this is responsibility of user space to configure. 
> > > > > Usually
> > > > > distros have udev rules to online memory automatically. Especially, 
> > > > > user
> > > > > space should eb able to configure *how* to online memory.
> > > > 
> > > > Note (as per the commit message) that in the specific case I'm
> > > > referring to the memory hotplugged by the Xen balloon driver will be
> > > > an unpopulated range to be used internally by certain Xen subsystems,
> > > > like the xen-blkback or the privcmd drivers. The addition of such
> > > > blocks of (unpopulated) memory can happen without the user explicitly
> > > > requesting it, and hence not even aware such hotplug process is taking
> > > > place. To be clear: no actual RAM will be added to the system.
> > > 
> > > Okay, but there is also the case where XEN will actually hotplug memory
> > > using this same handler IIRC (at least I've read papers about it). Both
> > > are using the same handler, correct?
> > 
> > Yes, it's used by this dual purpose, which I have to admit I don't
> > like that much either.
> > 
> > One set of pages should be clearly used for RAM memory hotplug, and
> > the other to map foreign pages that are not related to memory hotplug,
> > it's just that we happen to need a physical region with backing struct
> > pages.
> > 
> > > > 
> > > > > It's the admin/distro responsibility to configure this properly. In 
> > > > > case
> > > > > this doesn't happen (or as you say, users change it), bad luck.
> > > > > 
> > > > > E.g., virtio-mem takes care to not add more memory in case it is not
> > > > > getting onlined. I remember hyper-v has similar code to at least wait 
> > > > > a
> > > > > bit for memory to get onlined.
> > > > 
> > > > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > > > as Xen, as said this is done to add unpopulated memory regions that
> > > > will be used to map foreign memory (from other domains) by Xen drivers
> > > > on the system.
> > > 
> > > Indeed, if the memory is never exposed to the buddy (and all you need is
> > > struct pages +  a kernel virtual mapping), I wonder if
> > > memremap/ZONE_DEVICE is what you want?
> > 
> > I'm certainly not familiar with the Linux memory subsystem, but if
> > that gets us a backing struct page and a kernel mapping then I would
> > say yes.
> > 
> > > Then you won't have user-visible
>

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 03:22:49PM +0200, David Hildenbrand wrote:
> On 23.07.20 14:23, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> >> On 23.07.20 10:45, Roger Pau Monne wrote:
> >>> Add an extra option to add_memory_resource that overrides the memory
> >>> hotplug online behavior in order to force onlining of memory from
> >>> add_memory_resource unconditionally.
> >>>
> >>> This is required for the Xen balloon driver, that must run the
> >>> online page callback in order to correctly process the newly added
> >>> memory region, note this is an unpopulated region that is used by Linux
> >>> to either hotplug RAM or to map foreign pages from other domains, and
> >>> hence memory hotplug when running on Xen can be used even without the
> >>> user explicitly requesting it, as part of the normal operations of the
> >>> OS when attempting to map memory from a different domain.
> >>>
> >>> Setting a different default value of memhp_default_online_type when
> >>> attaching the balloon driver is not a robust solution, as the user (or
> >>> distro init scripts) could still change it and thus break the Xen
> >>> balloon driver.
> >>
> >> I think we discussed this a couple of times before (even triggered by my
> >> request), and this is responsibility of user space to configure. Usually
> >> distros have udev rules to online memory automatically. Especially, user
> >> space should eb able to configure *how* to online memory.
> > 
> > Note (as per the commit message) that in the specific case I'm
> > referring to the memory hotplugged by the Xen balloon driver will be
> > an unpopulated range to be used internally by certain Xen subsystems,
> > like the xen-blkback or the privcmd drivers. The addition of such
> > blocks of (unpopulated) memory can happen without the user explicitly
> > requesting it, and hence not even aware such hotplug process is taking
> > place. To be clear: no actual RAM will be added to the system.
> 
> Okay, but there is also the case where XEN will actually hotplug memory
> using this same handler IIRC (at least I've read papers about it). Both
> are using the same handler, correct?

Yes, it's used by this dual purpose, which I have to admit I don't
like that much either.

One set of pages should be clearly used for RAM memory hotplug, and
the other to map foreign pages that are not related to memory hotplug,
it's just that we happen to need a physical region with backing struct
pages.

> > 
> >> It's the admin/distro responsibility to configure this properly. In case
> >> this doesn't happen (or as you say, users change it), bad luck.
> >>
> >> E.g., virtio-mem takes care to not add more memory in case it is not
> >> getting onlined. I remember hyper-v has similar code to at least wait a
> >> bit for memory to get onlined.
> > 
> > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > as Xen, as said this is done to add unpopulated memory regions that
> > will be used to map foreign memory (from other domains) by Xen drivers
> > on the system.
> 
> Indeed, if the memory is never exposed to the buddy (and all you need is
> struct pages +  a kernel virtual mapping), I wonder if
> memremap/ZONE_DEVICE is what you want?

I'm certainly not familiar with the Linux memory subsystem, but if
that gets us a backing struct page and a kernel mapping then I would
say yes.

> Then you won't have user-visible
> memory blocks created with unclear online semantics, partially involving
> the buddy.

Seems like a fine solution.

Juergen: would you be OK to use a separate page-list for
alloc_xenballooned_pages on HVM/PVH using the logic described by
David?

I guess I would leave PV as-is, since it already has this reserved
region to map foreign pages.

Thanks, Roger.


Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 03:20:55PM +0200, Jürgen Groß wrote:
> On 23.07.20 15:08, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 02:28:13PM +0200, Jürgen Groß wrote:
> > > On 23.07.20 14:23, Roger Pau Monné wrote:
> > > > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> > > > > On 23.07.20 10:45, Roger Pau Monne wrote:
> > > > > > Add an extra option to add_memory_resource that overrides the memory
> > > > > > hotplug online behavior in order to force onlining of memory from
> > > > > > add_memory_resource unconditionally.
> > > > > > 
> > > > > > This is required for the Xen balloon driver, that must run the
> > > > > > online page callback in order to correctly process the newly added
> > > > > > memory region, note this is an unpopulated region that is used by 
> > > > > > Linux
> > > > > > to either hotplug RAM or to map foreign pages from other domains, 
> > > > > > and
> > > > > > hence memory hotplug when running on Xen can be used even without 
> > > > > > the
> > > > > > user explicitly requesting it, as part of the normal operations of 
> > > > > > the
> > > > > > OS when attempting to map memory from a different domain.
> > > > > > 
> > > > > > Setting a different default value of memhp_default_online_type when
> > > > > > attaching the balloon driver is not a robust solution, as the user 
> > > > > > (or
> > > > > > distro init scripts) could still change it and thus break the Xen
> > > > > > balloon driver.
> > > > > 
> > > > > I think we discussed this a couple of times before (even triggered by 
> > > > > my
> > > > > request), and this is responsibility of user space to configure. 
> > > > > Usually
> > > > > distros have udev rules to online memory automatically. Especially, 
> > > > > user
> > > > > space should eb able to configure *how* to online memory.
> > > > 
> > > > Note (as per the commit message) that in the specific case I'm
> > > > referring to the memory hotplugged by the Xen balloon driver will be
> > > > an unpopulated range to be used internally by certain Xen subsystems,
> > > > like the xen-blkback or the privcmd drivers. The addition of such
> > > > blocks of (unpopulated) memory can happen without the user explicitly
> > > > requesting it, and hence not even aware such hotplug process is taking
> > > > place. To be clear: no actual RAM will be added to the system.
> > > > 
> > > > Failure to online such blocks using the Xen specific online handler
> > > > (which does not handle back the memory to the allocator in any way)
> > > > will result in the system getting stuck and malfunctioning.
> > > > 
> > > > > It's the admin/distro responsibility to configure this properly. In 
> > > > > case
> > > > > this doesn't happen (or as you say, users change it), bad luck.
> > > > > 
> > > > > E.g., virtio-mem takes care to not add more memory in case it is not
> > > > > getting onlined. I remember hyper-v has similar code to at least wait 
> > > > > a
> > > > > bit for memory to get onlined.
> > > > 
> > > > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > > > as Xen, as said this is done to add unpopulated memory regions that
> > > > will be used to map foreign memory (from other domains) by Xen drivers
> > > > on the system.
> > > > 
> > > > Maybe this should somehow use a different mechanism to hotplug such
> > > > empty memory blocks? I don't mind doing this differently, but I would
> > > > need some pointers. Allowing user-space to change a (seemingly
> > > > unrelated) parameter and as a result produce failures on Xen drivers
> > > > is not an acceptable solution IMO.
> > > 
> > > Maybe we can use the same approach as Xen PV-domains: pre-allocate a
> > > region in the memory map to be used for mapping foreign pages. For the
> > > kernel it will look like pre-ballooned memory, so it will create struct
> > > page for the region (which is what we are after), but it won't give the
> > > memory to the allocator.
> > 
> > IMO u

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 03:14:31PM +0200, David Hildenbrand wrote:
> On 23.07.20 15:08, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 02:28:13PM +0200, Jürgen Groß wrote:
> >> On 23.07.20 14:23, Roger Pau Monné wrote:
> >>> On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> >>>> On 23.07.20 10:45, Roger Pau Monne wrote:
> >>>>> Add an extra option to add_memory_resource that overrides the memory
> >>>>> hotplug online behavior in order to force onlining of memory from
> >>>>> add_memory_resource unconditionally.
> >>>>>
> >>>>> This is required for the Xen balloon driver, that must run the
> >>>>> online page callback in order to correctly process the newly added
> >>>>> memory region, note this is an unpopulated region that is used by Linux
> >>>>> to either hotplug RAM or to map foreign pages from other domains, and
> >>>>> hence memory hotplug when running on Xen can be used even without the
> >>>>> user explicitly requesting it, as part of the normal operations of the
> >>>>> OS when attempting to map memory from a different domain.
> >>>>>
> >>>>> Setting a different default value of memhp_default_online_type when
> >>>>> attaching the balloon driver is not a robust solution, as the user (or
> >>>>> distro init scripts) could still change it and thus break the Xen
> >>>>> balloon driver.
> >>>>
> >>>> I think we discussed this a couple of times before (even triggered by my
> >>>> request), and this is responsibility of user space to configure. Usually
> >>>> distros have udev rules to online memory automatically. Especially, user
> >>>> space should eb able to configure *how* to online memory.
> >>>
> >>> Note (as per the commit message) that in the specific case I'm
> >>> referring to the memory hotplugged by the Xen balloon driver will be
> >>> an unpopulated range to be used internally by certain Xen subsystems,
> >>> like the xen-blkback or the privcmd drivers. The addition of such
> >>> blocks of (unpopulated) memory can happen without the user explicitly
> >>> requesting it, and hence not even aware such hotplug process is taking
> >>> place. To be clear: no actual RAM will be added to the system.
> >>>
> >>> Failure to online such blocks using the Xen specific online handler
> >>> (which does not handle back the memory to the allocator in any way)
> >>> will result in the system getting stuck and malfunctioning.
> >>>
> >>>> It's the admin/distro responsibility to configure this properly. In case
> >>>> this doesn't happen (or as you say, users change it), bad luck.
> >>>>
> >>>> E.g., virtio-mem takes care to not add more memory in case it is not
> >>>> getting onlined. I remember hyper-v has similar code to at least wait a
> >>>> bit for memory to get onlined.
> >>>
> >>> I don't think VirtIO or Hyper-V use the hotplug system in the same way
> >>> as Xen, as said this is done to add unpopulated memory regions that
> >>> will be used to map foreign memory (from other domains) by Xen drivers
> >>> on the system.
> >>>
> >>> Maybe this should somehow use a different mechanism to hotplug such
> >>> empty memory blocks? I don't mind doing this differently, but I would
> >>> need some pointers. Allowing user-space to change a (seemingly
> >>> unrelated) parameter and as a result produce failures on Xen drivers
> >>> is not an acceptable solution IMO.
> >>
> >> Maybe we can use the same approach as Xen PV-domains: pre-allocate a
> >> region in the memory map to be used for mapping foreign pages. For the
> >> kernel it will look like pre-ballooned memory, so it will create struct
> >> page for the region (which is what we are after), but it won't give the
> >> memory to the allocator.
> > 
> > IMO using something similar to memory hotplug would give us more
> > flexibility, and TBH the logic is already there in the balloon driver.
> > It seems quite wasteful to allocate such region(s) beforehand for all
> > domains, even when most of them won't end up using foreign mappings at
> > all.
> 
> I do wonder why these issues you describe start to pop up now, literally
> years after this stuff has been imp

Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 02:28:13PM +0200, Jürgen Groß wrote:
> On 23.07.20 14:23, Roger Pau Monné wrote:
> > On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> > > On 23.07.20 10:45, Roger Pau Monne wrote:
> > > > Add an extra option to add_memory_resource that overrides the memory
> > > > hotplug online behavior in order to force onlining of memory from
> > > > add_memory_resource unconditionally.
> > > > 
> > > > This is required for the Xen balloon driver, that must run the
> > > > online page callback in order to correctly process the newly added
> > > > memory region, note this is an unpopulated region that is used by Linux
> > > > to either hotplug RAM or to map foreign pages from other domains, and
> > > > hence memory hotplug when running on Xen can be used even without the
> > > > user explicitly requesting it, as part of the normal operations of the
> > > > OS when attempting to map memory from a different domain.
> > > > 
> > > > Setting a different default value of memhp_default_online_type when
> > > > attaching the balloon driver is not a robust solution, as the user (or
> > > > distro init scripts) could still change it and thus break the Xen
> > > > balloon driver.
> > > 
> > > I think we discussed this a couple of times before (even triggered by my
> > > request), and this is responsibility of user space to configure. Usually
> > > distros have udev rules to online memory automatically. Especially, user
> > > space should eb able to configure *how* to online memory.
> > 
> > Note (as per the commit message) that in the specific case I'm
> > referring to the memory hotplugged by the Xen balloon driver will be
> > an unpopulated range to be used internally by certain Xen subsystems,
> > like the xen-blkback or the privcmd drivers. The addition of such
> > blocks of (unpopulated) memory can happen without the user explicitly
> > requesting it, and hence not even aware such hotplug process is taking
> > place. To be clear: no actual RAM will be added to the system.
> > 
> > Failure to online such blocks using the Xen specific online handler
> > (which does not handle back the memory to the allocator in any way)
> > will result in the system getting stuck and malfunctioning.
> > 
> > > It's the admin/distro responsibility to configure this properly. In case
> > > this doesn't happen (or as you say, users change it), bad luck.
> > > 
> > > E.g., virtio-mem takes care to not add more memory in case it is not
> > > getting onlined. I remember hyper-v has similar code to at least wait a
> > > bit for memory to get onlined.
> > 
> > I don't think VirtIO or Hyper-V use the hotplug system in the same way
> > as Xen, as said this is done to add unpopulated memory regions that
> > will be used to map foreign memory (from other domains) by Xen drivers
> > on the system.
> > 
> > Maybe this should somehow use a different mechanism to hotplug such
> > empty memory blocks? I don't mind doing this differently, but I would
> > need some pointers. Allowing user-space to change a (seemingly
> > unrelated) parameter and as a result produce failures on Xen drivers
> > is not an acceptable solution IMO.
> 
> Maybe we can use the same approach as Xen PV-domains: pre-allocate a
> region in the memory map to be used for mapping foreign pages. For the
> kernel it will look like pre-ballooned memory, so it will create struct
> page for the region (which is what we are after), but it won't give the
> memory to the allocator.

IMO using something similar to memory hotplug would give us more
flexibility, and TBH the logic is already there in the balloon driver.
It seems quite wasteful to allocate such region(s) beforehand for all
domains, even when most of them won't end up using foreign mappings at
all.

Anyway, I'm going to take a look at how to do that, I guess it's going
to involve playing with the memory map and reserving some space.

I suggest we should remove the Xen balloon hotplug logic, as it's not
working properly and we don't have a plan to fix it.

Thanks, Roger.


Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monné
On Thu, Jul 23, 2020 at 01:37:03PM +0200, David Hildenbrand wrote:
> On 23.07.20 10:45, Roger Pau Monne wrote:
> > Add an extra option to add_memory_resource that overrides the memory
> > hotplug online behavior in order to force onlining of memory from
> > add_memory_resource unconditionally.
> > 
> > This is required for the Xen balloon driver, that must run the
> > online page callback in order to correctly process the newly added
> > memory region, note this is an unpopulated region that is used by Linux
> > to either hotplug RAM or to map foreign pages from other domains, and
> > hence memory hotplug when running on Xen can be used even without the
> > user explicitly requesting it, as part of the normal operations of the
> > OS when attempting to map memory from a different domain.
> > 
> > Setting a different default value of memhp_default_online_type when
> > attaching the balloon driver is not a robust solution, as the user (or
> > distro init scripts) could still change it and thus break the Xen
> > balloon driver.
> 
> I think we discussed this a couple of times before (even triggered by my
> request), and this is responsibility of user space to configure. Usually
> distros have udev rules to online memory automatically. Especially, user
> space should eb able to configure *how* to online memory.

Note (as per the commit message) that in the specific case I'm
referring to the memory hotplugged by the Xen balloon driver will be
an unpopulated range to be used internally by certain Xen subsystems,
like the xen-blkback or the privcmd drivers. The addition of such
blocks of (unpopulated) memory can happen without the user explicitly
requesting it, and hence not even aware such hotplug process is taking
place. To be clear: no actual RAM will be added to the system.

Failure to online such blocks using the Xen specific online handler
(which does not handle back the memory to the allocator in any way)
will result in the system getting stuck and malfunctioning.

> It's the admin/distro responsibility to configure this properly. In case
> this doesn't happen (or as you say, users change it), bad luck.
> 
> E.g., virtio-mem takes care to not add more memory in case it is not
> getting onlined. I remember hyper-v has similar code to at least wait a
> bit for memory to get onlined.

I don't think VirtIO or Hyper-V use the hotplug system in the same way
as Xen, as said this is done to add unpopulated memory regions that
will be used to map foreign memory (from other domains) by Xen drivers
on the system.

Maybe this should somehow use a different mechanism to hotplug such
empty memory blocks? I don't mind doing this differently, but I would
need some pointers. Allowing user-space to change a (seemingly
unrelated) parameter and as a result produce failures on Xen drivers
is not an acceptable solution IMO.

Thanks, Roger.


[PATCH 1/3] xen/balloon: fix accounting in alloc_xenballooned_pages error path

2020-07-23 Thread Roger Pau Monne
target_unpopulated is incremented with nr_pages at the start of the
function, but the call to free_xenballooned_pages will only subtract
pgno number of pages, and thus the rest need to be subtracted before
returning or else accounting will be skewed.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 77c57568e5d7..3cb10ed32557 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -630,6 +630,12 @@ int alloc_xenballooned_pages(int nr_pages, struct page 
**pages)
  out_undo:
mutex_unlock(_mutex);
free_xenballooned_pages(pgno, pages);
+   /*
+* NB: free_xenballooned_pages will only subtract pgno pages, but since
+* target_unpopulated is incremented with nr_pages at the start we need
+* to remove the remaining ones also, or accounting will be screwed.
+*/
+   balloon_stats.target_unpopulated -= nr_pages - pgno;
return ret;
 }
 EXPORT_SYMBOL(alloc_xenballooned_pages);
-- 
2.27.0



[PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-23 Thread Roger Pau Monne
Add an extra option to add_memory_resource that overrides the memory
hotplug online behavior in order to force onlining of memory from
add_memory_resource unconditionally.

This is required for the Xen balloon driver, that must run the
online page callback in order to correctly process the newly added
memory region, note this is an unpopulated region that is used by Linux
to either hotplug RAM or to map foreign pages from other domains, and
hence memory hotplug when running on Xen can be used even without the
user explicitly requesting it, as part of the normal operations of the
OS when attempting to map memory from a different domain.

Setting a different default value of memhp_default_online_type when
attaching the balloon driver is not a robust solution, as the user (or
distro init scripts) could still change it and thus break the Xen
balloon driver.

Signed-off-by: Roger Pau Monné 
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: xen-de...@lists.xenproject.org
Cc: linux...@kvack.org
---
 drivers/xen/balloon.c  |  2 +-
 include/linux/memory_hotplug.h |  3 ++-
 mm/memory_hotplug.c| 16 ++--
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 292413b27575..fe0e0c76834b 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -346,7 +346,7 @@ static enum bp_state reserve_additional_memory(void)
mutex_unlock(_mutex);
/* add_memory_resource() requires the device_hotplug lock */
lock_device_hotplug();
-   rc = add_memory_resource(nid, resource);
+   rc = add_memory_resource(nid, resource, true);
unlock_device_hotplug();
mutex_lock(_mutex);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd8..1793619fe4a6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -342,7 +342,8 @@ extern void clear_zone_contiguous(struct zone *zone);
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int __add_memory(int nid, u64 start, u64 size);
 extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource);
+extern int add_memory_resource(int nid, struct resource *resource,
+  bool force_online);
 extern int add_memory_driver_managed(int nid, u64 start, u64 size,
 const char *resource_name);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..2491588d3f86 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1002,7 +1002,10 @@ static int check_hotplug_memory_range(u64 start, u64 
size)
 
 static int online_memory_block(struct memory_block *mem, void *arg)
 {
-   mem->online_type = memhp_default_online_type;
+   bool force_online = arg;
+
+   mem->online_type = force_online ? MMOP_ONLINE
+   : memhp_default_online_type;
return device_online(>dev);
 }
 
@@ -1012,7 +1015,7 @@ static int online_memory_block(struct memory_block *mem, 
void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res)
+int __ref add_memory_resource(int nid, struct resource *res, bool force_online)
 {
struct mhp_params params = { .pgprot = PAGE_KERNEL };
u64 start, size;
@@ -1076,8 +1079,9 @@ int __ref add_memory_resource(int nid, struct resource 
*res)
mem_hotplug_done();
 
/* online pages if requested */
-   if (memhp_default_online_type != MMOP_OFFLINE)
-   walk_memory_blocks(start, size, NULL, online_memory_block);
+   if (memhp_default_online_type != MMOP_OFFLINE || force_online)
+   walk_memory_blocks(start, size, (void *)force_online,
+  online_memory_block);
 
return ret;
 error:
@@ -1100,7 +1104,7 @@ int __ref __add_memory(int nid, u64 start, u64 size)
if (IS_ERR(res))
return PTR_ERR(res);
 
-   ret = add_memory_resource(nid, res);
+   ret = add_memory_resource(nid, res, false);
if (ret < 0)
release_memory_resource(res);
return ret;
@@ -1158,7 +1162,7 @@ int add_memory_driver_managed(int nid, u64 start, u64 
size,
goto out_unlock;
}
 
-   rc = add_memory_resource(nid, res);
+   rc = add_memory_resource(nid, res, false);
if (rc < 0)
release_memory_resource(res);
 
-- 
2.27.0



[PATCH 2/3] xen/balloon: make the balloon wait interruptible

2020-07-23 Thread Roger Pau Monne
So it can be killed, or else processes can get hung indefinitely
waiting for balloon pages.

Signed-off-by: Roger Pau Monné 
Cc: sta...@vger.kernel.org
---
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/balloon.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 3cb10ed32557..292413b27575 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -568,11 +568,13 @@ static int add_ballooned_pages(int nr_pages)
if (xen_hotplug_unpopulated) {
st = reserve_additional_memory();
if (st != BP_ECANCELED) {
+   int rc;
+
mutex_unlock(_mutex);
-   wait_event(balloon_wq,
+   rc = wait_event_interruptible(balloon_wq,
   !list_empty(_pages));
mutex_lock(_mutex);
-   return 0;
+   return rc ? -ENOMEM : 0;
}
}
 
-- 
2.27.0



[PATCH 0/3] xen/balloon: fixes for memory hotplug

2020-07-23 Thread Roger Pau Monne
Hello,

The following series contain some fixes in order to make Xen balloon
memory hotplug function properly. Fix two patches are bugfixes that IMO
should be backported to stable branches, last patch might be more
controversial (to backport) since it includes a small change to the
generic memory hotplug interface.

Thanks, Roger.

Roger Pau Monne (3):
  xen/balloon: fix accounting in alloc_xenballooned_pages error path
  xen/balloon: make the balloon wait interruptible
  memory: introduce an option to force onlining of hotplug memory

 drivers/xen/balloon.c  | 14 +++---
 include/linux/memory_hotplug.h |  3 ++-
 mm/memory_hotplug.c| 16 ++--
 3 files changed, 23 insertions(+), 10 deletions(-)

-- 
2.27.0



Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-22 Thread Roger Pau Monné
On Tue, Jul 21, 2020 at 07:55:09PM +, Anchal Agarwal wrote:
> On Tue, Jul 21, 2020 at 10:30:18AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > Marek: I'm adding you in case you could be able to give this a try and
> > make sure it doesn't break suspend for dom0.
> > 
> > On Tue, Jul 21, 2020 at 12:17:36AM +, Anchal Agarwal wrote:
> > > On Mon, Jul 20, 2020 at 11:37:05AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> > > > > (Roger, question for you at the very end)
> > > > >
> > > > > On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > > > > > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> > > > > >> CAUTION: This email originated from outside of the organization. 
> > > > > >> Do not click links or open attachments unless you can confirm the 
> > > > > >> sender and know the content is safe.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> > > > > >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
> > > > > >>>> CAUTION: This email originated from outside of the organization. 
> > > > > >>>> Do not click links or open attachments unless you can confirm 
> > > > > >>>> the sender and know the content is safe.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 7/2/20 2:21 PM, Anchal Agarwal wrote:
> > > > > >>>> And PVH dom0.
> > > > > >>> That's another good use case to make it work with however, I still
> > > > > >>> think that should be tested/worked upon separately as the feature 
> > > > > >>> itself
> > > > > >>> (PVH Dom0) is very new.
> > > > > >>
> > > > > >> Same question here --- will this break PVH dom0?
> > > > > >>
> > > > > > I haven't tested it as a part of this series. Is that a blocker 
> > > > > > here?
> > > > >
> > > > >
> > > > > I suspect dom0 will not do well now as far as hibernation goes, in 
> > > > > which
> > > > > case you are not breaking anything.
> > > > >
> > > > >
> > > > > Roger?
> > > >
> > > > I sadly don't have any box ATM that supports hibernation where I
> > > > could test it. We have hibernation support for PV dom0, so while I
> > > > haven't done anything specific to support or test hibernation on PVH
> > > > dom0 I would at least aim to not make this any worse, and hence the
> > > > check should at least also fail for a PVH dom0?
> > > >
> > > > if (!xen_hvm_domain() || xen_initial_domain())
> > > > return -ENODEV;
> > > >
> > > > Ie: none of this should be applied to a PVH dom0, as it doesn't have
> > > > PV devices and hence should follow the bare metal device suspend.
> > > >
> > > So from what I understand you meant for any guest running on pvh dom0 
> > > should not
> > > hibernate if hibernation is triggered from within the guest or should 
> > > they?
> > 
> > Er no to both I think. What I meant is that a PVH dom0 should be able
> > to properly suspend, and we should make sure this work doesn't make
> > this any harder (or breaks it if it's currently working).
> > 
> > Or at least that's how I understood the question raised by Boris.
> > 
> > You are adding code to the generic suspend path that's also used by dom0
> > in order to perform bare metal suspension. This is fine now for a PV
> > dom0 because the code is gated on xen_hvm_domain, but you should also
> > take into account that a PVH dom0 is considered a HVM domain, and
> > hence will get the notifier registered.
> >
> Ok that makes sense now. This is good to be safe, but my patch series is only 
> to
> support domU hibernation, so I am not sure if this will affect pvh dom0.
> However, since I do not have a good way of testing sure I will add the check.
> 
> Moreover, in Patch-0004, I do register suspend/resume syscore_ops 
> specifically for domU
> hibernation only if its xen_hvm_domain.

So if the hooks are only registered for domU, do you still need this
xen_hvm_domain check here?

I have to admit I'm not familiar with Linux PM suspend.

> I don't see any reason that should not
> be registered for domU's running on pvh dom0.

To be clear: it should be registered for all HVM domUs, regardless of
whether they are running on a PV or a PVH dom0. My intention was never
to suggest otherwise. It should be enabled for all HVM domUs, but
shouldn't be enabled for HVM dom0.

> Those suspend/resume callbacks will
> only be invoked in case hibernation and will be skipped if generic suspend 
> path
> is in progress. Do you see any issue with that?

No, I think it's fine.

Roger.


Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-21 Thread Roger Pau Monné
Marek: I'm adding you in case you could be able to give this a try and
make sure it doesn't break suspend for dom0.

On Tue, Jul 21, 2020 at 12:17:36AM +, Anchal Agarwal wrote:
> On Mon, Jul 20, 2020 at 11:37:05AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> > > (Roger, question for you at the very end)
> > >
> > > On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > > > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> > > >> CAUTION: This email originated from outside of the organization. Do 
> > > >> not click links or open attachments unless you can confirm the sender 
> > > >> and know the content is safe.
> > > >>
> > > >>
> > > >>
> > > >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> > > >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
> > > >>>> CAUTION: This email originated from outside of the organization. Do 
> > > >>>> not click links or open attachments unless you can confirm the 
> > > >>>> sender and know the content is safe.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 7/2/20 2:21 PM, Anchal Agarwal wrote:
> > > >>>> And PVH dom0.
> > > >>> That's another good use case to make it work with however, I still
> > > >>> think that should be tested/worked upon separately as the feature 
> > > >>> itself
> > > >>> (PVH Dom0) is very new.
> > > >>
> > > >> Same question here --- will this break PVH dom0?
> > > >>
> > > > I haven't tested it as a part of this series. Is that a blocker here?
> > >
> > >
> > > I suspect dom0 will not do well now as far as hibernation goes, in which
> > > case you are not breaking anything.
> > >
> > >
> > > Roger?
> > 
> > I sadly don't have any box ATM that supports hibernation where I
> > could test it. We have hibernation support for PV dom0, so while I
> > haven't done anything specific to support or test hibernation on PVH
> > dom0 I would at least aim to not make this any worse, and hence the
> > check should at least also fail for a PVH dom0?
> > 
> > if (!xen_hvm_domain() || xen_initial_domain())
> > return -ENODEV;
> > 
> > Ie: none of this should be applied to a PVH dom0, as it doesn't have
> > PV devices and hence should follow the bare metal device suspend.
> >
> So from what I understand you meant for any guest running on pvh dom0 should 
> not 
> hibernate if hibernation is triggered from within the guest or should they?

Er no to both I think. What I meant is that a PVH dom0 should be able
to properly suspend, and we should make sure this work doesn't make
this any harder (or breaks it if it's currently working).

Or at least that's how I understood the question raised by Boris.

You are adding code to the generic suspend path that's also used by dom0
in order to perform bare metal suspension. This is fine now for a PV
dom0 because the code is gated on xen_hvm_domain, but you should also
take into account that a PVH dom0 is considered a HVM domain, and
hence will get the notifier registered.

> > Also I would contact the QubesOS guys, they rely heavily on the
> > suspend feature for dom0, and that's something not currently tested by
> > osstest so any breakages there go unnoticed.
> > 
> Was this for me or Boris? If its the former then I have no idea how to?

I've now added Marek.

Roger.


Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-20 Thread Roger Pau Monné
On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> (Roger, question for you at the very end)
> 
> On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> >> CAUTION: This email originated from outside of the organization. Do not 
> >> click links or open attachments unless you can confirm the sender and know 
> >> the content is safe.
> >>
> >>
> >>
> >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
>  CAUTION: This email originated from outside of the organization. Do not 
>  click links or open attachments unless you can confirm the sender and 
>  know the content is safe.
> 
> 
> 
>  On 7/2/20 2:21 PM, Anchal Agarwal wrote:
>  And PVH dom0.
> >>> That's another good use case to make it work with however, I still
> >>> think that should be tested/worked upon separately as the feature itself
> >>> (PVH Dom0) is very new.
> >>
> >> Same question here --- will this break PVH dom0?
> >>
> > I haven't tested it as a part of this series. Is that a blocker here?
> 
> 
> I suspect dom0 will not do well now as far as hibernation goes, in which
> case you are not breaking anything.
> 
> 
> Roger?

I sadly don't have any box ATM that supports hibernation where I
could test it. We have hibernation support for PV dom0, so while I
haven't done anything specific to support or test hibernation on PVH
dom0 I would at least aim to not make this any worse, and hence the
check should at least also fail for a PVH dom0?

if (!xen_hvm_domain() || xen_initial_domain())
return -ENODEV;

Ie: none of this should be applied to a PVH dom0, as it doesn't have
PV devices and hence should follow the bare metal device suspend.

Also I would contact the QubesOS guys, they rely heavily on the
suspend feature for dom0, and that's something not currently tested by
osstest so any breakages there go unnoticed.

Thanks, Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-30 Thread Roger Pau Monné
On Mon, Jun 29, 2020 at 07:20:35PM +, Anchal Agarwal wrote:
> On Fri, Jun 26, 2020 at 11:12:39AM +0200, Roger Pau Monné wrote:
> > So the frontend should do:
> > 
> > - Switch to Closed state (and cleanup everything required).
> > - Wait for backend to switch to Closed state (must be done
> >   asynchronously, handled in blkback_changed).
> > - Switch frontend to XenbusStateInitialising, that will in turn force
> >   the backend to switch to XenbusStateInitWait.
> > - After that it should just follow the normal connection procedure.
> > 
> > I think the part that's missing is the frontend doing the state change
> > to XenbusStateInitialising when the backend switches to the Closed
> > state.
> > 
> > > I was of the view we may just want to mark frontend closed which should do
> > > the job of freeing resources and then following the same flow as
> > > blkfront_restore. That does not seems to work correctly 100% of the time.
> > 
> > I think the missing part is that you must wait for the backend to
> > switch to the Closed state, or else the switch to
> > XenbusStateInitialising won't be picked up correctly by the backend
> > (because it's still doing it's cleanup).
> > 
> > Using blkfront_restore might be an option, but you need to assert the
> > backend is in the initial state before using that path.
> >
> Yes, I agree and I make sure that XenbusStateInitialising only triggers
> on frontend once backend is disconnected. msleep in a loop not that graceful 
> but
> works.
> Frontend only switches to XenbusStateInitialising once it sees backend
> as Closed. The issue here is and may require more debugging is:
> 1. Hibernate instance->Closing failed, artificially created situation by not
> marking frontend Closed in the first place during freezing.
> 2. System comes back up fine restored to 'backend connected'.

I'm not sure I'm following what is happening here, what should happen
IMO is that the backend will eventually reach the Closed state? Ie:
the frontend has initiated the disconnection from the backend by
setting the Closing state, and the backend will have to eventually
reach the Closed state.

At that point the frontend can initiate a reconnection by switching to
the Initialising state.

> 3. Re-run (1) again without reboot
> 4. (4) fails to recover basically freezing does not fail at all which is weird
>because it should timeout as it passes through same path. It hits a BUG in
>talk_to_blkback() and instance crashes.

It's hard to tell exactly. I guess you would have to figure what makes
the frontend not get stuck at the same place as the first attempt.

Roger.


Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-26 Thread Roger Pau Monné
On Thu, Jun 25, 2020 at 06:36:59PM +, Anchal Agarwal wrote:
> On Tue, Jun 23, 2020 at 10:19:03AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Tue, Jun 23, 2020 at 12:43:14AM +, Anchal Agarwal wrote:
> > > On Mon, Jun 22, 2020 at 10:38:46AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > On Fri, Jun 19, 2020 at 11:43:12PM +, Anchal Agarwal wrote:
> > > > > On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> > > > > > CAUTION: This email originated from outside of the organization. Do 
> > > > > > not click links or open attachments unless you can confirm the 
> > > > > > sender and know the content is safe.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > > > > > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > > > > > > CAUTION: This email originated from outside of the 
> > > > > > > > organization. Do not click links or open attachments unless you 
> > > > > > > > can confirm the sender and know the content is safe.
> > > > > > > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > > > > > >  CAUTION: This email originated from outside of the 
> > > > > > > > > organization. Do not click links or open attachments unless 
> > > > > > > > > you can confirm the sender and know the content is safe.
> > > > > > > > > > + xenbus_dev_error(dev, err, "Freezing 
> > > > > > > > > timed out;"
> > > > > > > > > > +  "the device may become 
> > > > > > > > > inconsistent state");
> > > > > > > > >
> > > > > > > > > Leaving the device in this state is quite bad, as it's in 
> > > > > > > > > a closed
> > > > > > > > > state and with the queues frozen. You should make an 
> > > > > > > > > attempt to
> > > > > > > > > restore things to a working state.
> > > > > > > > >
> > > > > > > > > You mean if backend closed after timeout? Is there a way to 
> > > > > > > > > know that? I understand it's not good to
> > > > > > > > > leave it in this state however, I am still trying to find if 
> > > > > > > > > there is a good way to know if backend is still connected 
> > > > > > > > > after timeout.
> > > > > > > > > Hence the message " the device may become inconsistent 
> > > > > > > > > state".  I didn't see a timeout not even once on my end so 
> > > > > > > > > that's why
> > > > > > > > > I may be looking for an alternate perspective here. may be 
> > > > > > > > > need to thaw everything back intentionally is one thing I 
> > > > > > > > > could think of.
> > > > > > > >
> > > > > > > > You can manually force this state, and then check that it will 
> > > > > > > > behave
> > > > > > > > correctly. I would expect that on a failure to disconnect from 
> > > > > > > > the
> > > > > > > > backend you should switch the frontend to the 'Init' state in 
> > > > > > > > order to
> > > > > > > > try to reconnect to the backend when possible.
> > > > > > > >
> > > > > > > From what I understand forcing manually is, failing the freeze 
> > > > > > > without
> > > > > > > disconnect and try to revive the connection by unfreezing the
> > > > > > > queues->r

Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation]

2020-06-23 Thread Roger Pau Monné
On Tue, Jun 23, 2020 at 12:43:14AM +, Anchal Agarwal wrote:
> On Mon, Jun 22, 2020 at 10:38:46AM +0200, Roger Pau Monné wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Fri, Jun 19, 2020 at 11:43:12PM +, Anchal Agarwal wrote:
> > > On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monné wrote:
> > > > CAUTION: This email originated from outside of the organization. Do not 
> > > > click links or open attachments unless you can confirm the sender and 
> > > > know the content is safe.
> > > >
> > > >
> > > >
> > > > On Tue, Jun 16, 2020 at 09:49:25PM +, Anchal Agarwal wrote:
> > > > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monné wrote:
> > > > > > CAUTION: This email originated from outside of the organization. Do 
> > > > > > not click links or open attachments unless you can confirm the 
> > > > > > sender and know the content is safe.
> > > > > > On Wed, Jun 03, 2020 at 11:33:52PM +, Agarwal, Anchal wrote:
> > > > > > >  CAUTION: This email originated from outside of the organization. 
> > > > > > > Do not click links or open attachments unless you can confirm the 
> > > > > > > sender and know the content is safe.
> > > > > > > > + xenbus_dev_error(dev, err, "Freezing timed 
> > > > > > > out;"
> > > > > > > > +  "the device may become 
> > > > > > > inconsistent state");
> > > > > > >
> > > > > > > Leaving the device in this state is quite bad, as it's in a 
> > > > > > > closed
> > > > > > > state and with the queues frozen. You should make an attempt 
> > > > > > > to
> > > > > > > restore things to a working state.
> > > > > > >
> > > > > > > You mean if backend closed after timeout? Is there a way to know 
> > > > > > > that? I understand it's not good to
> > > > > > > leave it in this state however, I am still trying to find if 
> > > > > > > there is a good way to know if backend is still connected after 
> > > > > > > timeout.
> > > > > > > Hence the message " the device may become inconsistent state".  I 
> > > > > > > didn't see a timeout not even once on my end so that's why
> > > > > > > I may be looking for an alternate perspective here. may be need 
> > > > > > > to thaw everything back intentionally is one thing I could think 
> > > > > > > of.
> > > > > >
> > > > > > You can manually force this state, and then check that it will 
> > > > > > behave
> > > > > > correctly. I would expect that on a failure to disconnect from the
> > > > > > backend you should switch the frontend to the 'Init' state in order 
> > > > > > to
> > > > > > try to reconnect to the backend when possible.
> > > > > >
> > > > > From what I understand forcing manually is, failing the freeze without
> > > > > disconnect and try to revive the connection by unfreezing the
> > > > > queues->reconnecting to backend [which never got diconnected]. May be 
> > > > > even
> > > > > tearing down things manually because I am not sure what state will 
> > > > > frontend
> > > > > see if backend fails to to disconnect at any point in time. I assumed 
> > > > > connected.
> > > > > Then again if its "CONNECTED" I may not need to tear down everything 
> > > > > and start
> > > > > from Initialising state because that may not work.
> > > > >
> > > > > So I am not so sure about backend's state so much, lets say if  
> > > > > xen_blkif_disconnect fail,
> > > > > I don't see it getting handled in the backend then what will be 
> > > > > backend's state?
> > > > > Will it still switch xenbus state to 'Closed'? If not what will 
> > > > > frontend see,
> > > > > if it tries to read bac

  1   2   3   4   5   6   7   8   9   10   >