Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

2021-06-03 Thread Anchal Agarwal
On Thu, Jun 03, 2021 at 04:11:46PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 6/2/21 3:37 PM, Anchal Agarwal wrote:
> > On Tue, Jun 01, 2021 at 10:18:36AM -0400, Boris Ostrovsky wrote:
> >>
> > The resume won't fail because in the image the xen_vcpu and xen_vcpu_info 
> > are
> > same. These are the same values that got in there during saving of the
> > hibernation image. So whatever xen_vcpu got as a value during boot time 
> > registration on resume is
> > essentially lost once the jump into the saved kernel image happens. 
> > Interesting
> > part is if KASLR is not enabled boot time vcpup mfn is same as in the image.
> 
> 
> Do you start the your guest right after you've hibernated it? What happens if 
> you create (and keep running) a few other guests in-between? mfn would likely 
> be different then I'd think.
> 
>
Yes, I just run it in loops on a single guest and I am able to see the issue in
20-40 iterations sometime may be sooner. Yeah, you could be right and this could
definitely happen more often depending what's happening on dom0 side.
> > Once you enable KASLR this value changes sometimes and whenever that happens
> > resume gets stuck. Does that make sense?
> >
> > No it does not resume successfully if hypercall fails because I was trying 
> > to
> > explicitly reset vcpu and invoke hypercall.
> > I am just wondering why does restore logic fails to work here or probably I 
> > am
> > missing a critical piece here.
> 
> 
> If you are not using KASLR then xen_vcpu_info is at the same address every 
> time you boot. So whatever you registered before hibernating stays the same 
> when you boot second time and register again, and so successful comparison in 
> xen_vcpu_setup() works. (Mostly by chance.)
>
That's what I thought so too.
> 
> But if KASLR is on then this comparison not failing should cause xen_vcpu 
> pointer in the loaded image to become bogus because xen_vcpu is now 
> registered for a different xen_vcpu_info address during boot.
> 
The reason for that I think is once you jump into the image that information is
getting lost. But there is  some residue somewhere that's causing the resume to
fail. I haven't been able to pinpoint the exact field value that may be causing
that issue.
Correct me if I am wrong here, but even if hypothetically I put a hack to tell 
the kernel
somehow re-register vcpu it won't pass because there is no hypercall to
unregister it in first place? Can the resumed kernel use the new values in that
case [Now this is me just throwing wild guesses!!]

> 
> >>> Another line of thought is something what kexec does to come around this 
> >>> problem
> >>> is to abuse soft_reset and issue it during syscore_resume or may be 
> >>> before the image get loaded.
> >>> I haven't experimented with that yet as I am assuming there has to be a 
> >>> way to re-register vcpus during resume.
> >>
> >> Right, that sounds like it should work.
> >>
> > You mean soft reset or re-register vcpu?
> 
> 
> Doing something along the lines of a soft reset. It should allow you to 
> re-register. Not sure how you can use it without Xen changes though.
> 
No not without xen changes. It won't work. I will have xen changes in place to
test that on our infrastructure. 

--
Anchal
> 
> 
> -boris
> 



Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

2021-06-02 Thread Anchal Agarwal
On Tue, Jun 01, 2021 at 10:18:36AM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 5/28/21 5:50 PM, Anchal Agarwal wrote:
> 
> > That only fails during boot but not after the control jumps into the image. 
> > The
> > non boot cpus are brought offline(freeze_secondary_cpus) and then online 
> > via cpu hotplug path. In that case xen_vcpu_setup doesn't invokes the 
> > hypercall again.
> 
> 
> OK, that makes sense --- by that time VCPUs have already been registered. 
> What I don't understand though is why resume doesn't fail every time --- 
> xen_vcpu and xen_vcpu_info should be different practically always, shouldn't 
> they? Do you observe successful resumes when the hypercall fails?
> 
> 
The resume won't fail because in the image the xen_vcpu and xen_vcpu_info are
same. These are the same values that got in there during saving of the
hibernation image. So whatever xen_vcpu got as a value during boot time 
registration on resume is
essentially lost once the jump into the saved kernel image happens. Interesting
part is if KASLR is not enabled boot time vcpup mfn is same as in the image.
Once you enable KASLR this value changes sometimes and whenever that happens
resume gets stuck. Does that make sense?

No it does not resume successfully if hypercall fails because I was trying to
explicitly reset vcpu and invoke hypercall.
I am just wondering why does restore logic fails to work here or probably I am
missing a critical piece here.
> >
> > Another line of thought is something what kexec does to come around this 
> > problem
> > is to abuse soft_reset and issue it during syscore_resume or may be before 
> > the image get loaded.
> > I haven't experimented with that yet as I am assuming there has to be a way 
> > to re-register vcpus during resume.
> 
> 
> Right, that sounds like it should work.
> 
You mean soft reset or re-register vcpu?

-Anchal
> 
> -boris
> 
> 



Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

2021-05-28 Thread Anchal Agarwal
On Wed, May 26, 2021 at 02:29:53PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 5/26/21 12:40 AM, Anchal Agarwal wrote:
> > On Tue, May 25, 2021 at 06:23:35PM -0400, Boris Ostrovsky wrote:
> >> CAUTION: This email originated from outside of the organization. Do not 
> >> click links or open attachments unless you can confirm the sender and know 
> >> the content is safe.
> >>
> >>
> >>
> >> On 5/21/21 1:26 AM, Anchal Agarwal wrote:
> >>>>> What I meant there wrt VCPU info was that VCPU info is not unregistered 
> >>>>> during hibernation,
> >>>>> so Xen still remembers the old physical addresses for the VCPU 
> >>>>> information, created by the
> >>>>> booting kernel. But since the hibernation kernel may have different 
> >>>>> physical
> >>>>> addresses for VCPU info and if mismatch happens, it may cause issues 
> >>>>> with resume.
> >>>>> During hibernation, the VCPU info register hypercall is not invoked 
> >>>>> again.
> >>>> I still don't think that's the cause but it's certainly worth having a 
> >>>> look.
> >>>>
> >>> Hi Boris,
> >>> Apologies for picking this up after last year.
> >>> I did some dive deep on the above statement and that is indeed the case 
> >>> that's happening.
> >>> I did some debugging around KASLR and hibernation using reboot mode.
> >>> I observed in my debug prints that whenever vcpu_info* address for 
> >>> secondary vcpu assigned
> >>> in xen_vcpu_setup at boot is different than what is in the image, resume 
> >>> gets stuck for that vcpu
> >>> in bringup_cpu(). That means we have different addresses for 
> >>> _cpu(xen_vcpu_info, cpu) at boot and after
> >>> control jumps into the image.
> >>>
> >>> I failed to get any prints after it got stuck in bringup_cpu() and
> >>> I do not have an option to send a sysrq signal to the guest or rather get 
> >>> a kdump.
> >>
> >> xenctx and xen-hvmctx might be helpful.
> >>
> >>
> >>> This change is not observed in every hibernate-resume cycle. I am not 
> >>> sure if this is a bug or an
> >>> expected behavior.
> >>> Also, I am contemplating the idea that it may be a bug in xen code 
> >>> getting triggered only when
> >>> KASLR is enabled but I do not have substantial data to prove that.
> >>> Is this a coincidence that this always happens for 1st vcpu?
> >>> Moreover, since hypervisor is not aware that guest is hibernated and it 
> >>> looks like a regular shutdown to dom0 during reboot mode,
> >>> will re-registering vcpu_info for secondary vcpu's even plausible?
> >>
> >> I think I am missing how this is supposed to work (maybe we've talked 
> >> about this but it's been many months since then). You hibernate the guest 
> >> and it writes the state to swap. The guest is then shut down? And what's 
> >> next? How do you wake it up?
> >>
> >>
> >> -boris
> >>
> > To resume a guest, guest boots up as the fresh guest and then 
> > software_resume()
> > is called which if finds a stored hibernation image, quiesces the devices 
> > and loads
> > the memory contents from the image. The control then transfers to the 
> > targeted kernel.
> > This further disables non boot cpus,sycore_suspend/resume callbacks are 
> > invoked which sets up
> > the shared_info, pvclock, grant tables etc. Since the vcpu_info pointer for 
> > each
> > non-boot cpu is already registered, the hypercall does not happen again when
> > bringing up the non boot cpus. This leads to inconsistencies as pointed
> > out earlier when KASLR is enabled.
> 
> 
> I'd think the 'if' condition in the code fragment below should always fail 
> since hypervisor is creating new guest, resulting in the hypercall. Just like 
> in the case of save/restore.
>
That only fails during boot but not after the control jumps into the image. The
non boot cpus are brought offline(freeze_secondary_cpus) and then online via 
cpu hotplug path. In that case xen_vcpu_setup doesn't invokes the hypercall 
again.
> 
> Do you call xen_vcpu_info_reset() on resume? That will re-initialize 
> p

Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

2021-05-25 Thread Anchal Agarwal
On Tue, May 25, 2021 at 06:23:35PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 5/21/21 1:26 AM, Anchal Agarwal wrote:
> >>> What I meant there wrt VCPU info was that VCPU info is not unregistered 
> >>> during hibernation,
> >>> so Xen still remembers the old physical addresses for the VCPU 
> >>> information, created by the
> >>> booting kernel. But since the hibernation kernel may have different 
> >>> physical
> >>> addresses for VCPU info and if mismatch happens, it may cause issues with 
> >>> resume.
> >>> During hibernation, the VCPU info register hypercall is not invoked again.
> >>
> >> I still don't think that's the cause but it's certainly worth having a 
> >> look.
> >>
> > Hi Boris,
> > Apologies for picking this up after last year.
> > I did some dive deep on the above statement and that is indeed the case 
> > that's happening.
> > I did some debugging around KASLR and hibernation using reboot mode.
> > I observed in my debug prints that whenever vcpu_info* address for 
> > secondary vcpu assigned
> > in xen_vcpu_setup at boot is different than what is in the image, resume 
> > gets stuck for that vcpu
> > in bringup_cpu(). That means we have different addresses for 
> > _cpu(xen_vcpu_info, cpu) at boot and after
> > control jumps into the image.
> >
> > I failed to get any prints after it got stuck in bringup_cpu() and
> > I do not have an option to send a sysrq signal to the guest or rather get a 
> > kdump.
> 
> 
> xenctx and xen-hvmctx might be helpful.
> 
> 
> > This change is not observed in every hibernate-resume cycle. I am not sure 
> > if this is a bug or an
> > expected behavior.
> > Also, I am contemplating the idea that it may be a bug in xen code getting 
> > triggered only when
> > KASLR is enabled but I do not have substantial data to prove that.
> > Is this a coincidence that this always happens for 1st vcpu?
> > Moreover, since hypervisor is not aware that guest is hibernated and it 
> > looks like a regular shutdown to dom0 during reboot mode,
> > will re-registering vcpu_info for secondary vcpu's even plausible?
> 
> 
> I think I am missing how this is supposed to work (maybe we've talked about 
> this but it's been many months since then). You hibernate the guest and it 
> writes the state to swap. The guest is then shut down? And what's next? How 
> do you wake it up?
> 
> 
> -boris
> 
To resume a guest, guest boots up as the fresh guest and then software_resume()
is called which if finds a stored hibernation image, quiesces the devices and 
loads 
the memory contents from the image. The control then transfers to the targeted 
kernel.
This further disables non boot cpus,sycore_suspend/resume callbacks are invoked 
which sets up
the shared_info, pvclock, grant tables etc. Since the vcpu_info pointer for each
non-boot cpu is already registered, the hypercall does not happen again when
bringing up the non boot cpus. This leads to inconsistencies as pointed
out earlier when KASLR is enabled.

Thanks,
Anchal
> 
> 
> >  I could definitely use some advice to debug this further.
> >
> >
> > Some printk's from my debugging:
> >
> > At Boot:
> >
> > xen_vcpu_setup: xen_have_vcpu_info_placement=1 cpu=1, 
> > vcpup=0x9e548fa560e0, info.mfn=3996246 info.offset=224,
> >
> > Image Loads:
> > It ends up in the condition:
> >  xen_vcpu_setup()
> >  {
> >  ...
> >  if (xen_hvm_domain()) {
> > if (per_cpu(xen_vcpu, cpu) == _cpu(xen_vcpu_info, cpu))
> > return 0;
> >  }
> >  ...
> >  }
> >
> > xen_vcpu_setup: checking mfn on resume cpu=1, info.mfn=3934806 
> > info.offset=224, _cpu(xen_vcpu_info, cpu)=0x9d7240a560e0
> >
> > This is tested on c4.2xlarge [8vcpu 15GB mem] instance with 5.10 kernel 
> > running
> > in the guest.
> >
> > Thanks,
> > Anchal.
> >> -boris
> >>
> >>



Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

2021-05-20 Thread Anchal Agarwal
On Thu, Oct 01, 2020 at 08:43:58AM -0400, boris.ostrov...@oracle.com wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> >>> Also, wrt KASLR stuff, that issue is still seen sometimes but I 
> >>> haven't had
> >>> bandwidth to dive deep into the issue and fix it.
>  So what's the plan there? You first mentioned this issue early this year 
>  and judged by your response it is not clear whether you will ever spend 
>  time looking at it.
> 
> >>> I do want to fix it and did do some debugging earlier this year just 
> >>> haven't
> >>> gotten back to it. Also, wanted to understand if the issue is a blocker 
> >>> to this
> >>> series?
> >>
> >> Integrating code with known bugs is less than ideal.
> >>
> > So for this series to be accepted, KASLR needs to be fixed along with other
> > comments of course?
> 
> 
> Yes, please.
> 
> 
> 
> >>> I had some theories when debugging around this like if the random base 
> >>> address picked by kaslr for the
> >>> resuming kernel mismatches the suspended kernel and just jogging my 
> >>> memory, I didn't find that as the case.
> >>> Another hunch was if physical address of registered vcpu info at boot is 
> >>> different from what suspended kernel
> >>> has and that can cause CPU's to get stuck when coming online.
> >>
> >> I'd think if this were the case you'd have 100% failure rate. And we are 
> >> also re-registering vcpu info on xen restore and I am not aware of any 
> >> failures due to KASLR.
> >>
> > What I meant there wrt VCPU info was that VCPU info is not unregistered 
> > during hibernation,
> > so Xen still remembers the old physical addresses for the VCPU information, 
> > created by the
> > booting kernel. But since the hibernation kernel may have different physical
> > addresses for VCPU info and if mismatch happens, it may cause issues with 
> > resume.
> > During hibernation, the VCPU info register hypercall is not invoked again.
> 
> 
> I still don't think that's the cause but it's certainly worth having a look.
> 
Hi Boris,
Apologies for picking this up after last year. 
I did some dive deep on the above statement and that is indeed the case that's 
happening. 
I did some debugging around KASLR and hibernation using reboot mode.
I observed in my debug prints that whenever vcpu_info* address for secondary 
vcpu assigned 
in xen_vcpu_setup at boot is different than what is in the image, resume gets 
stuck for that vcpu
in bringup_cpu(). That means we have different addresses for 
_cpu(xen_vcpu_info, cpu) at boot and after
control jumps into the image. 

I failed to get any prints after it got stuck in bringup_cpu() and
I do not have an option to send a sysrq signal to the guest or rather get a 
kdump.
This change is not observed in every hibernate-resume cycle. I am not sure if 
this is a bug or an 
expected behavior. 
Also, I am contemplating the idea that it may be a bug in xen code getting 
triggered only when
KASLR is enabled but I do not have substantial data to prove that.
Is this a coincidence that this always happens for 1st vcpu?
Moreover, since hypervisor is not aware that guest is hibernated and it looks 
like a regular shutdown to dom0 during reboot mode,
will re-registering vcpu_info for secondary vcpu's even plausible? I could 
definitely use some advice to debug this further.

 
Some printk's from my debugging:

At Boot:

xen_vcpu_setup: xen_have_vcpu_info_placement=1 cpu=1, vcpup=0x9e548fa560e0, 
info.mfn=3996246 info.offset=224,

Image Loads:
It ends up in the condition:
 xen_vcpu_setup()
 {
 ...
 if (xen_hvm_domain()) {
if (per_cpu(xen_vcpu, cpu) == _cpu(xen_vcpu_info, cpu))
return 0; 
 }
 ...
 }

xen_vcpu_setup: checking mfn on resume cpu=1, info.mfn=3934806 info.offset=224, 
_cpu(xen_vcpu_info, cpu)=0x9d7240a560e0

This is tested on c4.2xlarge [8vcpu 15GB mem] instance with 5.10 kernel running
in the guest.

Thanks,
Anchal.
> 
> -boris
> 
> 



Re: [PATCH v3 01/11] xen/manage: keep track of the on-going suspend mode

2020-09-25 Thread Anchal Agarwal
On Tue, Sep 22, 2020 at 11:17:36PM +, Anchal Agarwal wrote:
> On Tue, Sep 22, 2020 at 12:18:05PM -0400, boris.ostrov...@oracle.com wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On 9/21/20 5:54 PM, Anchal Agarwal wrote:
> > > Thanks for the above suggestion. You are right I didn't find a way to 
> > > declare
> > > a global state either. I just broke the above check in 2 so that once we 
> > > have
> > > support for ARM we should be able to remove aarch64 condition easily. Let 
> > > me
> > > know if I am missing nay corner cases with this one.
> > >
> > > static int xen_pm_notifier(struct notifier_block *notifier,
> > >   unsigned long pm_event, void *unused)
> > > {
> > > int ret = NOTIFY_OK;
> > > if (!xen_hvm_domain() || xen_initial_domain())
> > >   ret = NOTIFY_BAD;
> > > if(IS_ENABLED(CONFIG_ARM64) && (pm_event == PM_SUSPEND_PREPARE || 
> > > pm_event == HIBERNATION_PREPARE))
> > >   ret = NOTIFY_BAD;
> > >
> > > return ret;
> > > }
> > 
> > 
> > 
> > This will allow PM suspend to proceed on x86.
> Right!! Missed it.
> Also, wrt KASLR stuff, that issue is still seen sometimes but I haven't had
> bandwidth to dive deep into the issue and fix it. I seem to have lost your 
> email
> in my inbox hence covering the question here.
> > 
> >
Can I add your Reviewed-by or Signed-off-by to it?
> > -boris
> > 
>
-Anchal



Re: [PATCH v3 02/11] xenbus: add freeze/thaw/restore callbacks support

2020-09-15 Thread Anchal Agarwal
On Sun, Sep 13, 2020 at 12:11:47PM -0400, boris.ostrov...@oracle.com wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 8/21/20 6:26 PM, Anchal Agarwal wrote:
> > From: Munehisa Kamata 
> >
> > Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
> > suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
> > PMSG_RESTORE events for Xen suspend. However, they're actually assigned
> > to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
> > respectively, and only suspend and resume callbacks are supported at
> > driver level. To support PM suspend and PM hibernation, modify the bus
> > level PM callbacks to invoke not only device driver's suspend/resume but
> > also freeze/thaw/restore.
> >
> > Note that we'll use freeze/restore callbacks even for PM suspend whereas
> > suspend/resume callbacks are normally used in the case, becausae the
> > existing xenbus device drivers already have suspend/resume callbacks
> > specifically designed for Xen suspend.
> 
> 
> Something is wrong with this sentence. Or with my brain --- I can't
> quite parse this.
> 
The message is trying to say that that freeze/thaw/restore callbacks will be
used for both PM SUSPEND and PM HIBERNATION. Since, we are only focussing on PM
hibernation, I will remove all wordings of PM suspend from this message to avoid
confusion. I left it there in case someone wants to pick it up in future knowing
framework is already present.
> 
> And please be consistent with "PM suspend" vs. "PM hibernation".
>
I should remove PM suspend from everywhere since the mode is not tested
for.
> 
> >  So we can allow the device
> > drivers to keep the existing callbacks wihtout modification.
> >
> 
> 
> > @@ -599,16 +600,33 @@ int xenbus_dev_suspend(struct device *dev)
> >   struct xenbus_driver *drv;
> >   struct xenbus_device *xdev
> >   = container_of(dev, struct xenbus_device, dev);
> > + bool xen_suspend = is_xen_suspend();
> >
> >   DPRINTK("%s", xdev->nodename);
> >
> >   if (dev->driver == NULL)
> >   return 0;
> >   drv = to_xenbus_driver(dev->driver);
> > - if (drv->suspend)
> > - err = drv->suspend(xdev);
> > - if (err)
> > - dev_warn(dev, "suspend failed: %i\n", err);
> > + if (xen_suspend) {
> > + if (drv->suspend)
> > + err = drv->suspend(xdev);
> > + } else {
> > + if (drv->freeze) {
> 
> 
> 'else if' (to avoid extra indent level).  In xenbus_dev_resume() too.
> 
> 
> > + err = drv->freeze(xdev);
> > + if (!err) {
> > + free_otherend_watch(xdev);
> > + free_otherend_details(xdev);
> > + return 0;
> > + }
> > + }
> > + }
> > +
> > + if (err) {
> > + dev_warn(>dev,
> 
> 
> Is there a reason why you replaced dev with xdev->dev (here and elsewhere)?
> 
> 
Nope, they should be same. We can use dev here too. I should probably just use
dev.
> >  "%s %s failed: %d\n", xen_suspend ?
> > + "suspend" : "freeze", xdev->nodename, err);
> > + return err;
> > + }
> > +
> >
> 
> > @@ -653,8 +683,44 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
> >
> >  int xenbus_dev_cancel(struct device *dev)
> >  {
> > - /* Do nothing */
> > - DPRINTK("cancel");
> > + int err;
> > + struct xenbus_driver *drv;
> > + struct xenbus_device *xendev = to_xenbus_device(dev);
> 
> 
> xdev for consistency please.
> 
Yes this I left unchanged, it should be consistent with xdev.
> 
> > + bool xen_suspend = is_xen_suspend();
> 
> 
> No need for this, you use it only once anyway.
> 
> 
> -boris
>
Thanks,
Anchal
> 



Re: [PATCH v3 00/11] Fix PM hibernation in Xen guests

2020-09-11 Thread Anchal Agarwal
On Fri, Aug 28, 2020 at 06:39:45PM +, Anchal Agarwal wrote:
> On Fri, Aug 28, 2020 at 08:29:24PM +0200, Rafael J. Wysocki wrote:
> > CAUTION: This email originated from outside of the organization. Do not 
> > click links or open attachments unless you can confirm the sender and know 
> > the content is safe.
> > 
> > 
> > 
> > On Fri, Aug 28, 2020 at 8:26 PM Anchal Agarwal  wrote:
> > >
> > > On Fri, Aug 21, 2020 at 10:22:43PM +, Anchal Agarwal wrote:
> > > > Hello,
> > > > This series fixes PM hibernation for hvm guests running on xen 
> > > > hypervisor.
> > > > The running guest could now be hibernated and resumed successfully at a
> > > > later time. The fixes for PM hibernation are added to block and
> > > > network device drivers i.e xen-blkfront and xen-netfront. Any other 
> > > > driver
> > > > that needs to add S4 support if not already, can follow same method of
> > > > introducing freeze/thaw/restore callbacks.
> > > > The patches had been tested against upstream kernel and xen4.11. Large
> > > > scale testing is also done on Xen based Amazon EC2 instances. All this 
> > > > testing
> > > > involved running memory exhausting workload in the background.
> > > >
> > > > Doing guest hibernation does not involve any support from hypervisor and
> > > > this way guest has complete control over its state. Infrastructure
> > > > restrictions for saving up guest state can be overcome by guest 
> > > > initiated
> > > > hibernation.
> > > >
> > > > These patches were send out as RFC before and all the feedback had been
> > > > incorporated in the patches. The last v1 & v2 could be found here:
> > > >
> > > > [v1]: https://lkml.org/lkml/2020/5/19/1312
> > > > [v2]: https://lkml.org/lkml/2020/7/2/995
> > > > All comments and feedback from v2 had been incorporated in v3 series.
> > > >
> > > > Known issues:
> > > > 1.KASLR causes intermittent hibernation failures. VM fails to resumes 
> > > > and
> > > > has to be restarted. I will investigate this issue separately and 
> > > > shouldn't
> > > > be a blocker for this patch series.
> > > > 2. During hibernation, I observed sometimes that freezing of tasks 
> > > > fails due
> > > > to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may 
> > > > be 1
> > > > out of 200 runs and hibernation is aborted in this case. Re-trying 
> > > > hibernation
> > > > may work. Also, this is a known issue with hibernation and some
> > > > filesystems like XFS has been discussed by the community for years with 
> > > > not an
> > > > effectve resolution at this point.
> > > >
> > > > Testing How to:
> > > > ---
> > > > 1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 
> > > > +upstream
> > > > xen-4.11]
> > > > 2. Bring up a HVM guest w/t kernel compiled with hibernation patches
> > > > [I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem 
> > > > images].
> > > > 3. Create a swap file size=RAM size
> > > > 4. Update grub parameters and reboot
> > > > 5. Trigger pm-hibernation from within the VM
> > > >
> > > > Example:
> > > > Set up a file-backed swap space. Swap file size>=Total memory on the 
> > > > system
> > > > sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
> > > > sudo chmod 600 /swap
> > > > sudo mkswap /swap
> > > > sudo swapon /swap
> > > >
> > > > Update resume device/resume offset in grub if using swap file:
> > > > resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1
> > > >
> > > > Execute:
> > > > 
> > > > sudo pm-hibernate
> > > > OR
> > > > echo disk > /sys/power/state && echo reboot > /sys/power/disk
> > > >
> > > > Compute resume offset code:
> > > > "
> > > > #!/usr/bin/env python
> > > > import sys
> > > > import array
> > > > import fcntl
> > > >
> > > > #swap file
> > > > f = open(sys.argv[1], 'r')
> > > > buf = array.array('L', [0])
> > > >
> >

Re: [PATCH v3 05/11] genirq: Shutdown irq chips in suspend/resume during hibernation

2020-08-24 Thread Anchal Agarwal
On Sat, Aug 22, 2020 at 02:36:37AM +0200, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On Fri, Aug 21 2020 at 22:27, Thomas Gleixner wrote:
> > Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
> > it the core interrupt suspend/resume paths.
> >
> > Changelog:
> > v1->v2: Corrected the author's name to tglx@
> 
> Can you please move that Changelog part below the --- seperator next
> time because that's really not part of the final commit messaage and the
> maintainer has then to strip it off manually
> 
Ack.
> > Signed-off-by: Anchal Agarwal 
> > Signed-off-by: Thomas Gleixner 
> 
> These SOB lines are just wrongly ordered as they suggest:
> 
>  Anchal has authored the patch and Thomas transported it
> 
> which is clearly not the case. So the right order is:
> 
I must admit I wasn't aware of that. Will fix.
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: Anchal Agarwal 
> 
> And that needs another tweak at the top of the change log. The first
> line in the mail body wants to be:
> 
> From: Thomas Gleixner 
Yes I accidentally missed that in this patch.
Others have that line on all the patches and even v2 for this patch
has. Will fix.
> 
> followed by an empty new line before the actual changelog text
> starts. That way the attribution of the patch when applying it will be
> correct.
> 
> Documentation/process/ is there for a reason and following the few
> simple rules to get that straight is not rocket science.
> 
> Thanks,
> 
> tglx
> 
> 
Anchal



Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-22 Thread Anchal Agarwal
On Tue, Jul 21, 2020 at 05:18:34PM -0700, Stefano Stabellini wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On Tue, 21 Jul 2020, Boris Ostrovsky wrote:
> > >> +static int xen_setup_pm_notifier(void)
> > >> +{
> > >> + if (!xen_hvm_domain())
> > >> + return -ENODEV;
> > >>
> > >> I forgot --- what did we decide about non-x86 (i.e. ARM)?
> > > It would be great to support that however, its  out of
> > > scope for this patch set.
> > > I’ll be happy to discuss it separately.
> > 
> >  I wasn't implying that this *should* work on ARM but rather whether 
> >  this
> >  will break ARM somehow (because xen_hvm_domain() is true there).
> > 
> > 
> > >>> Ok makes sense. TBH, I haven't tested this part of code on ARM and the 
> > >>> series
> > >>> was only support x86 guests hibernation.
> > >>> Moreover, this notifier is there to distinguish between 2 PM
> > >>> events PM SUSPEND and PM hibernation. Now since we only care about PM
> > >>> HIBERNATION I may just remove this code and rely on "SHUTDOWN_SUSPEND" 
> > >>> state.
> > >>> However, I may have to fix other patches in the series where this check 
> > >>> may
> > >>> appear and cater it only for x86 right?
> > >>
> > >>
> > >> I don't know what would happen if ARM guest tries to handle hibernation
> > >> callbacks. The only ones that you are introducing are in block and net
> > >> fronts and that's arch-independent.
> > >>
> > >>
> > >> You do add a bunch of x86-specific code though (syscore ops), would
> > >> something similar be needed for ARM?
> > >>
> > >>
> > > I don't expect this to work out of the box on ARM. To start with something
> > > similar will be needed for ARM too.
> > > We may still want to keep the driver code as-is.
> > >
> > > I understand the concern here wrt ARM, however, currently the support is 
> > > only
> > > proposed for x86 guests here and similar work could be carried out for 
> > > ARM.
> > > Also, if regular hibernation works correctly on arm, then all is needed 
> > > is to
> > > fix Xen side of things.
> > >
> > > I am not sure what could be done to achieve any assurances on arm side as 
> > > far as
> > > this series is concerned.
> 
> Just to clarify: new features don't need to work on ARM or cause any
> addition efforts to you to make them work on ARM. The patch series only
> needs not to break existing code paths (on ARM and any other platforms).
> It should also not make it overly difficult to implement the ARM side of
> things (if there is one) at some point in the future.
> 
> FYI drivers/xen/manage.c is compiled and working on ARM today, however
> Xen suspend/resume is not supported. I don't know for sure if
> guest-initiated hibernation works because I have not tested it.
> 
> 
> 
> > If you are not sure what the effects are (or sure that it won't work) on
> > ARM then I'd add IS_ENABLED(CONFIG_X86) check, i.e.
> >
> >
> > if (!IS_ENABLED(CONFIG_X86) || !xen_hvm_domain())
> >   return -ENODEV;
> 
> That is a good principle to have and thanks for suggesting it. However,
> in this specific case there is nothing in this patch that doesn't work
> on ARM. From an ARM perspective I think we should enable it and
> _pm_notifier_block should be registered.
> 
This question is for Boris, I think you we decided to get rid of the notifier
in V3 as all we need  to check is SHUTDOWN_SUSPEND state which sounds plausible
to me. So this check may go away. It may still be needed for sycore_ops
callbacks registration.
> Given that all guests are HVM guests on ARM, it should work fine as is.
> 
> 
> I gave a quick look at the rest of the series and everything looks fine
> to me from an ARM perspective. I cannot imaging that the new freeze,
> thaw, and restore callbacks for net and block are going to cause any
> trouble on ARM. The two main x86-specific functions are
> xen_syscore_suspend/resume and they look trivial to implement on ARM (in
> the sense that they are likely going to look exactly the same.)
> 
Yes but for now since things are not tested I will put this
!IS_ENABLED(CONFIG_X86) on syscore_ops calls registration part just to be safe
and not break anything.
> 
> One question for Anchal: what's going to happen if you trigger a
> hibernation, you have the new callbacks, but you are missing
> xen_syscore_suspend/resume?
> 
> Is it any worse than not having the new freeze, thaw and restore
> callbacks at all and try to do a hibernation?
If callbacks are not there, I don't expect hibernation to work correctly.
These callbacks takes care of xen primitives like shared_info_page,
grant table, sched clock, runstate time which are important to save the correct
state of the guest and bring it back up. Other patches in the series, adds all
the logic to these syscore callbacks. Freeze/thaw/restore are just there 

Re: [PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-21 Thread Anchal Agarwal
On Tue, Jul 21, 2020 at 10:30:18AM +0200, Roger Pau Monné wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> Marek: I'm adding you in case you could be able to give this a try and
> make sure it doesn't break suspend for dom0.
> 
> On Tue, Jul 21, 2020 at 12:17:36AM +, Anchal Agarwal wrote:
> > On Mon, Jul 20, 2020 at 11:37:05AM +0200, Roger Pau Monné wrote:
> > > CAUTION: This email originated from outside of the organization. Do not 
> > > click links or open attachments unless you can confirm the sender and 
> > > know the content is safe.
> > >
> > >
> > >
> > > On Sat, Jul 18, 2020 at 09:47:04PM -0400, Boris Ostrovsky wrote:
> > > > (Roger, question for you at the very end)
> > > >
> > > > On 7/17/20 3:10 PM, Anchal Agarwal wrote:
> > > > > On Wed, Jul 15, 2020 at 05:18:08PM -0400, Boris Ostrovsky wrote:
> > > > >> CAUTION: This email originated from outside of the organization. Do 
> > > > >> not click links or open attachments unless you can confirm the 
> > > > >> sender and know the content is safe.
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 7/15/20 4:49 PM, Anchal Agarwal wrote:
> > > > >>> On Mon, Jul 13, 2020 at 11:52:01AM -0400, Boris Ostrovsky wrote:
> > > > >>>> CAUTION: This email originated from outside of the organization. 
> > > > >>>> Do not click links or open attachments unless you can confirm the 
> > > > >>>> sender and know the content is safe.
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On 7/2/20 2:21 PM, Anchal Agarwal wrote:
> > > > >>>> And PVH dom0.
> > > > >>> That's another good use case to make it work with however, I still
> > > > >>> think that should be tested/worked upon separately as the feature 
> > > > >>> itself
> > > > >>> (PVH Dom0) is very new.
> > > > >>
> > > > >> Same question here --- will this break PVH dom0?
> > > > >>
> > > > > I haven't tested it as a part of this series. Is that a blocker here?
> > > >
> > > >
> > > > I suspect dom0 will not do well now as far as hibernation goes, in which
> > > > case you are not breaking anything.
> > > >
> > > >
> > > > Roger?
> > >
> > > I sadly don't have any box ATM that supports hibernation where I
> > > could test it. We have hibernation support for PV dom0, so while I
> > > haven't done anything specific to support or test hibernation on PVH
> > > dom0 I would at least aim to not make this any worse, and hence the
> > > check should at least also fail for a PVH dom0?
> > >
> > > if (!xen_hvm_domain() || xen_initial_domain())
> > > return -ENODEV;
> > >
> > > Ie: none of this should be applied to a PVH dom0, as it doesn't have
> > > PV devices and hence should follow the bare metal device suspend.
> > >
> > So from what I understand you meant for any guest running on pvh dom0 
> > should not
> > hibernate if hibernation is triggered from within the guest or should they?
> 
> Er no to both I think. What I meant is that a PVH dom0 should be able
> to properly suspend, and we should make sure this work doesn't make
> this any harder (or breaks it if it's currently working).
> 
> Or at least that's how I understood the question raised by Boris.
> 
> You are adding code to the generic suspend path that's also used by dom0
> in order to perform bare metal suspension. This is fine now for a PV
> dom0 because the code is gated on xen_hvm_domain, but you should also
> take into account that a PVH dom0 is considered a HVM domain, and
> hence will get the notifier registered.
>
Ok that makes sense now. This is good to be safe, but my patch series is only to
support domU hibernation, so I am not sure if this will affect pvh dom0.
However, since I do not have a good way of testing sure I will add the check.

Moreover, in Patch-0004, I do register suspend/resume syscore_ops specifically 
for domU
hibernation only if its xen_hvm_domain. I don't see any reason that should not
be registered for domU's running on pvh dom0. Those suspend/resume callbacks 
will
only be invoked in case hibernation and will be skipped if generic suspend path
is in progress. Do you see any issue with that?

> > > Also I would contact the QubesOS guys, they rely heavily on the
> > > suspend feature for dom0, and that's something not currently tested by
> > > osstest so any breakages there go unnoticed.
> > >
> > Was this for me or Boris? If its the former then I have no idea how to?
> 
> I've now added Marek.
> 
> Roger.
Anchal



[PATCH v2 02/11] xenbus: add freeze/thaw/restore callbacks support

2020-07-02 Thread Anchal Agarwal
From: Munehisa Kamata 

Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.

Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.

[Anchal Agarwal: Changelog]:
RFC v1->v2: Refactored the callbacks code
v1->v2: Use dev_warn instead of pr_warn, naming/initialization
conventions
Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/xenbus/xenbus_probe.c | 96 ++-
 include/xen/xenbus.h  |  3 +
 2 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index 38725d97d909..715919aacd28 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -599,16 +600,33 @@ int xenbus_dev_suspend(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
+   bool xen_suspend = xen_is_xen_suspend();
 
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
return 0;
drv = to_xenbus_driver(dev->driver);
-   if (drv->suspend)
-   err = drv->suspend(xdev);
-   if (err)
-   dev_warn(dev, "suspend failed: %i\n", err);
+   if (xen_suspend) {
+   if (drv->suspend)
+   err = drv->suspend(xdev);
+   } else {
+   if (drv->freeze) {
+   err = drv->freeze(xdev);
+   if (!err) {
+   free_otherend_watch(xdev);
+   free_otherend_details(xdev);
+   return 0;
+   }
+   }
+   }
+
+   if (err) {
+   dev_warn(>dev, "%s %s failed: %d\n", xen_suspend ?
+   "suspend" : "freeze", xdev->nodename, err);
+   return err;
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
@@ -619,6 +637,7 @@ int xenbus_dev_resume(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
+   bool xen_suspend = xen_is_xen_suspend();
 
DPRINTK("%s", xdev->nodename);
 
@@ -627,23 +646,34 @@ int xenbus_dev_resume(struct device *dev)
drv = to_xenbus_driver(dev->driver);
err = talk_to_otherend(xdev);
if (err) {
-   dev_warn(dev, "resume (talk_to_otherend) failed: %i\n", err);
+   dev_warn(>dev, "%s (talk_to_otherend) %s failed: %d\n",
+   xen_suspend ? "resume" : "restore",
+   xdev->nodename, err);
return err;
}
 
-   xdev->state = XenbusStateInitialising;
+   if (xen_suspend) {
+   xdev->state = XenbusStateInitialising;
+   if (drv->resume)
+   err = drv->resume(xdev);
+   } else {
+   if (drv->restore)
+   err = drv->restore(xdev);
+   }
 
-   if (drv->resume) {
-   err = drv->resume(xdev);
-   if (err) {
-   dev_warn(dev, "resume failed: %i\n", err);
-   return err;
-   }
+   if (err) {
+   dev_warn(>dev, "%s %s failed: %d\n",
+   xen_suspend ? "resume" : "restore",
+   xdev->nodename, err);
+   return err;
}
 
err = watch_otherend(xdev);
if (err) {
-   dev_warn(dev, "resume (watch_otherend) failed: %d\n", err);
+   dev_warn(>dev, "%s (watch_otherend) %s failed: %d.\n",
+   xen_suspend ? "resume" : "restore",
+   xdev->nodename, err);
+
  

[PATCH v2 10/11] xen: Update sched clock offset to avoid system instability in hibernation

2020-07-02 Thread Anchal Agarwal
Save/restore xen_sched_clock_offset in syscore suspend/resume during PM
hibernation. Commit '867cefb4cb1012: ("xen: Fix x86 sched_clock() interface
for xen")' fixes xen guest time handling during migration. A similar issue
is seen during PM hibernation when system runs CPU intensive workload.
Post resume pvclock resets the value to 0 however, xen sched_clock_offset
is never updated. System instability is seen during resume from hibernation
when system is under heavy CPU load. Since xen_sched_clock_offset is not
updated, system does not see the monotonic clock value and the scheduler
would then think that heavy CPU hog tasks need more time in CPU, causing
the system to freeze

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/suspend.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 10cd14326472..4d8b1d2390b9 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -95,6 +95,7 @@ static int xen_syscore_suspend(void)
 
gnttab_suspend();
xen_manage_runstate_time(-1);
+   xen_save_sched_clock_offset();
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, );
@@ -110,6 +111,12 @@ static void xen_syscore_resume(void)
xen_hvm_map_shared_info();
 
pvclock_resume();
+   /*
+* Restore xen_sched_clock_offset during resume to maintain
+* monotonic clock value
+*/
+   xen_restore_sched_clock_offset();
+
xen_manage_runstate_time(0);
gnttab_resume();
 }
-- 
2.20.1




[PATCH v2 09/11] xen: Introduce wrapper for save/restore sched clock offset

2020-07-02 Thread Anchal Agarwal
Introduce wrappers for save/restore xen_sched_clock_offset to be
used by PM hibernation code to avoid system instability during resume.

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/time.c| 15 +--
 arch/x86/xen/xen-ops.h |  2 ++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index c8897aad13cd..676950eb0cb5 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -386,12 +386,23 @@ static const struct pv_time_ops xen_time_ops __initconst 
= {
 static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
 static u64 xen_clock_value_saved;
 
+/*This is needed to maintain a monotonic clock value during PM hibernation */
+void xen_save_sched_clock_offset(void)
+{
+   xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+}
+
+void xen_restore_sched_clock_offset(void)
+{
+   xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+}
+
 void xen_save_time_memory_area(void)
 {
struct vcpu_register_time_memory_area t;
int ret;
 
-   xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+   xen_save_sched_clock_offset();
 
if (!xen_clock)
return;
@@ -434,7 +445,7 @@ void xen_restore_time_memory_area(void)
 out:
/* Need pvclock_resume() before using xen_clocksource_read(). */
pvclock_resume();
-   xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+   xen_restore_sched_clock_offset();
 }
 
 static void xen_setup_vsyscall_time_info(void)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 41e9e9120f2d..f4b78b19493b 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -70,6 +70,8 @@ void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);
+void xen_save_sched_clock_offset(void);
+void xen_restore_sched_clock_offset(void);
 
 irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
 
-- 
2.20.1




[PATCH v2 07/11] xen-netfront: add callbacks for PM suspend and hibernation

2020-07-02 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.

[Anchal Agarwal: Changelog]:
RFCv1->RFCv2: Variable name fix and checkpatch.pl fixes]

Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/net/xen-netfront.c | 98 +-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..65edcdd6e05f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -56,6 +57,12 @@
 #include 
 #include 
 
+enum netif_freeze_state {
+   NETIF_FREEZE_STATE_UNFROZEN,
+   NETIF_FREEZE_STATE_FREEZING,
+   NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 
0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+  netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+"timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
 
atomic_t rx_gso_checksum_fixup;
+
+   int freeze_state;
+
+   struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+   struct netfront_info *np = netdev_priv(dev);
+   unsigned int num_queues = dev->real_num_tx_queues;
+   unsigned int queue_index;
+   struct netfront_queue *queue;
+
+   for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+   queue = >queues[queue_index];
+   disable_irq(queue->tx_irq);
+   disable_irq(queue->rx_irq);
+   }
+   return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff 
*skb,
grant_ref_t ref)
 {
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
np->queues = NULL;
 
+   init_completion(>wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info 
*info,
return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
+   unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+   int err = 0;
+
+   xennet_disable_interrupts(info->netdev);
+
+   netif_device_detach(info->netdev);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+   /* Kick the backend to disconnect */
+   xenbus_switch_state(dev, XenbusStateClosing);
+
+   /* We don't want to move forward before the frontend is diconnected
+* from the backend cleanly.
+*/
+   timeout = wait_for_completion_timeout(>wait_backend_disconnected,
+ timeout);
+   if (!timeout) {
+   err = -EBUSY;
+   xenbus_dev_error(dev, err, "Freezing timed out;"
+"the device may become inconsistent state");
+   return err;
+   }
+
+   /* Tear down queues */
+   xennet_disconnect_backend(info);
+   xennet_destroy_queues(info);
+
+   i

[PATCH v2 06/11] xen-blkfront: add callbacks for PM suspend and hibernation

2020-07-02 Thread Anchal Agarwal
From: Munehisa Kamata 

S4 power transisiton states are much different than xen
suspend/resume. Former is visible to the guest and frontend drivers should
be aware of the state transistions and should be able to take appropriate
actions when needed. In transition to S4 we need to make sure that at least
all the in-flight blkif requests get completed, since they probably contain
bits of the guest's memory image and that's not going to get saved any
other way. Hence, re-issuing of in-flight requests as in case of xen resume
will not work here. This is in contrast to xen-suspend where we need to
freeze with as little processing as possible to avoid dirtying RAM late in
the migration cycle and we know that in-flight data can wait.

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks. The freeze handler
stops block-layer queue and disconnect the frontend from the backend while
freeing ring_info and associated resources. Before disconnecting from the
backend, we need to prevent any new IO from being queued and wait for
existing IO to complete. Freeze/unfreeze of the queues will guarantee
that there are no requests in use on the shared ring. However, for sanity
we should check state of the ring before disconnecting to make sure that
there are no outstanding requests to be processed on the ring.
The restore handler re-allocates ring_info, unquiesces and unfreezes the
queue and re-connect to the backend, so that rest of the kernel can
continue to use the block device transparently.

Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[   36.852659] deferring g.e. 0xf9 (pfn 0x)
[   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!

In this case, persistent grants would need to be disabled.

[Anchal Agarwal: Changelog]:
RFC v1->v2: Removed timeout per request before disconnect during
blkfront freeze.
Added queue freeze/quiesce to the blkfront_freeze
Code cleanup
RFC v2->v3: None
RFC v3->v1: Code cleanup, Refractoring
v1->v2: * remove err variable in blkfront_freeze
* BugFix: error handling if rings are still busy
  after queue freeze/quiesce and returnign driver to
  connected state
* add TODO if blkback fails to disconnect on freeze
* Code formatting

Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/block/xen-blkfront.c | 122 +--
 1 file changed, 118 insertions(+), 4 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3b889ea950c2..9e3ed1b9f509 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -48,6 +48,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -80,6 +82,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+   BLKIF_STATE_FREEZING,
+   BLKIF_STATE_FROZEN,
 };
 
 struct grant {
@@ -219,6 +223,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+   struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+   init_completion(>wait_backend_disconnected);
 
return 0;
 }
@@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
unsigned int i;
struct blkfront_ring_info *rinfo;
 
+   if (info->connected == BLKIF_STATE_FREEZING)
+   goto free_rings;
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);
 
+free_rings:
for_each_rinfo(info, rinfo, i)
blkif_free_ring(rinfo);
 
@@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
 
-   if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
+   if (unlikely(info->connected != BLKIF_STATE_CONNECTED &&
+   info->connected != BLKIF_

[PATCH v2 08/11] x86/xen: save and restore steal clock during PM hibernation

2020-07-02 Thread Anchal Agarwal
Save/restore steal times in syscore suspend/resume during PM
hibernation. Commit '5e25f5db6abb9: ("xen/time: do not
decrease steal time after live migration on xen")' fixes xen
guest steal time handling during migration. A similar issue is seen
during PM hibernation.
Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.

top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 
0.0%si,-1253874204672.0%st

This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.

Changelog:
v1->v2: Removed patches that introduced new function calls for saving/restoring
sched clock offset and using existing ones that are used during LM

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/suspend.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index e8c924e93fc5..10cd14326472 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -94,10 +94,9 @@ static int xen_syscore_suspend(void)
int ret;
 
gnttab_suspend();
-
+   xen_manage_runstate_time(-1);
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
-
ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, );
if (!ret)
HYPERVISOR_shared_info = _dummy_shared_info;
@@ -111,7 +110,7 @@ static void xen_syscore_resume(void)
xen_hvm_map_shared_info();
 
pvclock_resume();
-
+   xen_manage_runstate_time(0);
gnttab_resume();
 }
 
-- 
2.20.1




[PATCH v2 00/11] Fix PM hibernation in Xen guests

2020-07-02 Thread Anchal Agarwal
Hello,
This series fixes PM hibernation for hvm guests running on xen hypervisor.
The running guest could now be hibernated and resumed successfully at a
later time. The fixes for PM hibernation are added to block and
network device drivers i.e xen-blkfront and xen-netfront. Any other driver
that needs to add S4 support if not already, can follow same method of
introducing freeze/thaw/restore callbacks.
The patches had been tested against upstream kernel and xen4.11. Large
scale testing is also done on Xen based Amazon EC2 instances. All this testing
involved running memory exhausting workload in the background.

Doing guest hibernation does not involve any support from hypervisor and
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.

These patches were send out as RFC before and all the feedback had been
incorporated in the patches. The last v1 could be found here:

[v1]: https://lkml.org/lkml/2020/5/19/1312
All comments and feedback from v1 had been incorporated in v2 series.
Any comments/suggestions are welcome

Known issues:
1.KASLR causes intermittent hibernation failures. VM fails to resumes and
has to be restarted. I will investigate this issue separately and shouldn't
be a blocker for this patch series.
2. During hibernation, I observed sometimes that freezing of tasks fails due
to busy XFS workqueuei[xfs-cil/xfs-sync]. This is also intermittent may be 1
out of 200 runs and hibernation is aborted in this case. Re-trying hibernation
may work. Also, this is a known issue with hibernation and some
filesystems like XFS has been discussed by the community for years with not an
effectve resolution at this point.

Testing How to:
---
1. Setup xen hypervisor on a physical machine[ I used Ubuntu 16.04 +upstream
xen-4.11]
2. Bring up a HVM guest w/t kernel compiled with hibernation patches
[I used ubuntu18.04 netboot bionic images and also Amazon Linux on-prem images].
3. Create a swap file size=RAM size
4. Update grub parameters and reboot
5. Trigger pm-hibernation from within the VM

Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap

Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704 no_console_suspend=1

Execute:

sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk

Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl

#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])

#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"


Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (4):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume
  x86/xen: save and restore steal clock during PM hibernation
  xen: Introduce wrapper for save/restore sched clock offset
  xen: Update sched clock offset to avoid system instability in
hibernation

Munehisa Kamata (5):
  xen/manage: keep track of the on-going suspend mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen-netfront: add callbacks for PM suspend and hibernation

Thomas Gleixner (1):
  genirq: Shutdown irq chips in suspend/resume during hibernation

 arch/x86/xen/enlighten_hvm.c  |   7 ++
 arch/x86/xen/suspend.c|  53 +
 arch/x86/xen/time.c   |  15 +++-
 arch/x86/xen/xen-ops.h|   3 +
 drivers/block/xen-blkfront.c  | 122 +-
 drivers/net/xen-netfront.c|  98 +++-
 drivers/xen/events/events_base.c  |   1 +
 drivers/xen/manage.c  |  60 +++
 drivers/xen/xenbus/xenbus_probe.c |  96 +++
 include/linux/irq.h   |   2 +
 include/xen/xen-ops.h |   3 +
 include/xen/xenbus.h  |   3 +
 kernel/irq/chip.c |   2 +-
 kernel/irq/internals.h|   1 +
 kernel/irq/pm.c   |  31 +---
 kernel/power/user.c   |   6 +-
 16 files changed, 470 insertions(+), 33 deletions(-)

-- 
2.20.1




[PATCH v2 03/11] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume

2020-07-02 Thread Anchal Agarwal
Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also  does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.

Changelog:
v1->v2: Remove extra check for shared_info_pfn to be NULL

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/enlighten_hvm.c | 6 ++
 arch/x86/xen/xen-ops.h   | 1 +
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 3e89b0067ff0..d91099928746 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -28,6 +28,12 @@
 
 static unsigned long shared_info_pfn;
 
+void xen_hvm_map_shared_info(void)
+{
+   xen_hvm_init_shared_info();
+   HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
 void xen_hvm_init_shared_info(void)
 {
struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 53b224fd6177..41e9e9120f2d 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -54,6 +54,7 @@ void xen_enable_sysenter(void);
 void xen_enable_syscall(void);
 void xen_vcpu_restore(void);
 
+void xen_hvm_map_shared_info(void);
 void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
-- 
2.20.1




[PATCH v2 01/11] xen/manage: keep track of the on-going suspend mode

2020-07-02 Thread Anchal Agarwal
From: Munehisa Kamata 

Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.

Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.

[Anchal Agarwal: Changelog]:
 RFC v1->v2: Code refactoring
 v1->v2: Remove unused functions for PM SUSPEND/PM hibernation

Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/manage.c  | 60 +++
 include/xen/xen-ops.h |  1 +
 2 files changed, 61 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..69833fd6cfd1 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,6 +41,20 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+   NO_SUSPEND = 0,
+   XEN_SUSPEND,
+   PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_is_xen_suspend(void)
+{
+   return suspend_mode == XEN_SUSPEND;
+}
+
 struct suspend_info {
int cancelled;
 };
@@ -99,6 +114,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
 
+   lock_system_sleep();
+
+   suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
 
err = freeze_processes();
@@ -162,6 +181,10 @@ static void do_suspend(void)
thaw_processes();
 out:
shutting_down = SHUTDOWN_INVALID;
+
+   suspend_mode = NO_SUSPEND;
+
+   unlock_system_sleep();
 }
 #endif /* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -387,3 +410,40 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+   unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_SUSPEND_PREPARE:
+   case PM_HIBERNATION_PREPARE:
+   case PM_RESTORE_PREPARE:
+   suspend_mode = PM_HIBERNATION;
+   break;
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   suspend_mode = NO_SUSPEND;
+   break;
+   default:
+   pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+   return -EINVAL;
+   }
+
+   return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+   .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+   if (!xen_hvm_domain())
+   return -ENODEV;
+
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 39a5580f8feb..2521d6a306cd 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,7 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_is_xen_suspend(void);
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.20.1




Re: [PATCH 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume

2020-06-08 Thread Anchal Agarwal
On Fri, Jun 05, 2020 at 05:39:54PM -0400, Boris Ostrovsky wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On 6/4/20 7:03 PM, Anchal Agarwal wrote:
> > On Sat, May 30, 2020 at 07:02:01PM -0400, Boris Ostrovsky wrote:
> >> CAUTION: This email originated from outside of the organization. Do not 
> >> click links or open attachments unless you can confirm the sender and know 
> >> the content is safe.
> >>
> >>
> >>
> >> On 5/19/20 7:25 PM, Anchal Agarwal wrote:
> >>> Introduce a small function which re-uses shared page's PA allocated
> >>> during guest initialization time in reserve_shared_info() and not
> >>> allocate new page during resume flow.
> >>> It also  does the mapping of shared_info_page by calling
> >>> xen_hvm_init_shared_info() to use the function.
> >>>
> >>> Signed-off-by: Anchal Agarwal 
> >>> ---
> >>>  arch/x86/xen/enlighten_hvm.c | 7 +++
> >>>  arch/x86/xen/xen-ops.h   | 1 +
> >>>  2 files changed, 8 insertions(+)
> >>>
> >>> diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
> >>> index e138f7de52d2..75b1ec7a0fcd 100644
> >>> --- a/arch/x86/xen/enlighten_hvm.c
> >>> +++ b/arch/x86/xen/enlighten_hvm.c
> >>> @@ -27,6 +27,13 @@
> >>>
> >>>  static unsigned long shared_info_pfn;
> >>>
> >>> +void xen_hvm_map_shared_info(void)
> >>> +{
> >>> + xen_hvm_init_shared_info();
> >>> + if (shared_info_pfn)
> >>> + HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> >>> +}
> >>> +
> >>
> >> AFAICT it is only called once so I don't see a need for new routine.
> >>
> >>
> > HYPERVISOR_shared_info can only be mapped in this scope without refactoring
> > much of the code.
> 
> 
> Refactoring what? All am suggesting is
>
shared_info_pfn does not seem to be in scope here, it's scope is limited
to enlighten_hvm.c. That's the reason I introduced a new function there.

> --- a/arch/x86/xen/suspend.c
> +++ b/arch/x86/xen/suspend.c
> @@ -124,7 +124,9 @@ static void xen_syscore_resume(void)
> return;
> 
> /* No need to setup vcpu_info as it's already moved off */
> -   xen_hvm_map_shared_info();
> +   xen_hvm_init_shared_info();
> +   if (shared_info_pfn)
> +   HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
> 
> pvclock_resume();
> 
> >> And is it possible for shared_info_pfn to be NULL in resume path (which
> >> is where this is called)?
> >>
> >>
> > I don't think it should be, still a sanity check but I don't think its 
> > needed there
> > because hibernation will fail in any case if thats the case.
> 
> 
> If shared_info_pfn is NULL you'd have problems long before hibernation
> started. We set it in xen_hvm_guest_init() and never touch again.
> 
> 
> In fact, I'd argue that it should be __ro_after_init.
> 
> 
I agree, and I should have mentioned that I will remove that check and its not
necessary as this gets mapped way early in the boot process.
> > However, HYPERVISOR_shared_info does needs to be re-mapped on resume as its 
> > been
> > marked to dummy address on suspend. Its also safe in case va changes.
> > Does the answer your question?
> 
> 
> I wasn't arguing whether HYPERVISOR_shared_info needs to be set, I was
> only saying that shared_info_pfn doesn't need to be tested.
> 
Got it. :)
> 
> -boris
> 
Thanks,
Anchal
> 



[PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-05-21 Thread Anchal Agarwal
From: Munehisa Kamata 

S4 power transisiton states are much different than xen
suspend/resume. Former is visible to the guest and frontend drivers should
be aware of the state transistions and should be able to take appropriate
actions when needed. In transition to S4 we need to make sure that at least
all the in-flight blkif requests get completed, since they probably contain
bits of the guest's memory image and that's not going to get saved any
other way. Hence, re-issuing of in-flight requests as in case of xen resume
will not work here. This is in contrast to xen-suspend where we need to
freeze with as little processing as possible to avoid dirtying RAM late in
the migration cycle and we know that in-flight data can wait.

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks. The freeze handler
stops block-layer queue and disconnect the frontend from the backend while
freeing ring_info and associated resources. Before disconnecting from the
backend, we need to prevent any new IO from being queued and wait for
existing IO to complete. Freeze/unfreeze of the queues will guarantee that
there are no requests in use on the shared ring. However, for sanity we
should check state of the ring before disconnecting to make sure that there
are no outstanding requests to be processed on the ring. The restore
handler re-allocates ring_info, unquiesces and unfreezes the queue
and re-connect to the backend, so that rest of the kernel can continue
to use the block device transparently.

Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[   36.852659] deferring g.e. 0xf9 (pfn 0x)
[   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!

In this case, persistent grants would need to be disabled.

[Anchal Changelog: Removed timeout/request during blkfront freeze.
Reworked the whole patch to work with blk-mq and incorporate upstream's
comments]

Fixes: Build errors reported by kbuild due to linebreak
Reported-by: kbuild test robot 

Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/block/xen-blkfront.c | 118 +--
 1 file changed, 112 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3b889ea950c2..34b0e51697b6 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -48,6 +48,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -80,6 +82,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+   BLKIF_STATE_FREEZING,
+   BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -219,6 +223,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+   struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+   init_completion(>wait_backend_disconnected);
 
return 0;
 }
@@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, 
unsigned int *offset)
case XEN_SCSI_DISK5_MAJOR:
case XEN_SCSI_DISK6_MAJOR:
case XEN_SCSI_DISK7_MAJOR:
-   *offset = (*minor / PARTS_PER_DISK) + 
+   *offset = (*minor / PARTS_PER_DISK) +
((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
EMULATED_SD_DISK_NAME_OFFSET;
*minor = *minor +
@@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, 
unsigned int *offset)
case XEN_SCSI_DISK13_MAJOR:
case XEN_SCSI_DISK14_MAJOR:
case XEN_SCSI_DISK15_MAJOR:
-   *offset = (*minor / PARTS_PER_DISK) + 
+   *offset = (*minor / PARTS_PER_DISK) +
((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
EMULATED_SD_DISK_NAME_OFFSET;
*minor = *minor +
@@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
unsigned int i;
struct blkfront_ring_info *rinfo;
 
+   if (info->connected == BLKIF_STATE_FREEZING)
+   goto free_rings;
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_

[PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

2020-05-19 Thread Anchal Agarwal
From: Aleksei Besogonov 

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().

Before this patch, the command line processing was the only place where
swsusp_resume_block was set.
[Changelog: Resolved patch conflict as code fragmented to
snapshot_set_swap_area]
Signed-off-by: Aleksei Besogonov 
Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
---
 kernel/power/user.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index 7959449765d9..1afa1f0a223e 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -235,8 +235,12 @@ static int snapshot_set_swap_area(struct snapshot_data 
*data,
return -EINVAL;
}
data->swap = swap_type_of(swdev, offset, NULL);
-   if (data->swap < 0)
+   if (data->swap < 0) {
return -ENODEV;
+   } else {
+   swsusp_resume_device = swdev;
+   swsusp_resume_block = offset;
+   }
return 0;
 }
 
-- 
2.24.1.AMZN




[PATCH 07/12] xen-netfront: add callbacks for PM suspend and hibernation

2020-05-19 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.

[Anchal changelog: Variable name fix and checkpatch.pl fixes]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/net/xen-netfront.c | 98 +-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..65edcdd6e05f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -56,6 +57,12 @@
 #include 
 #include 
 
+enum netif_freeze_state {
+   NETIF_FREEZE_STATE_UNFROZEN,
+   NETIF_FREEZE_STATE_FREEZING,
+   NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 
0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+  netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+"timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
 
atomic_t rx_gso_checksum_fixup;
+
+   int freeze_state;
+
+   struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+   struct netfront_info *np = netdev_priv(dev);
+   unsigned int num_queues = dev->real_num_tx_queues;
+   unsigned int queue_index;
+   struct netfront_queue *queue;
+
+   for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+   queue = >queues[queue_index];
+   disable_irq(queue->tx_irq);
+   disable_irq(queue->rx_irq);
+   }
+   return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff 
*skb,
grant_ref_t ref)
 {
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
np->queues = NULL;
 
+   init_completion(>wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info 
*info,
return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
+   unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+   int err = 0;
+
+   xennet_disable_interrupts(info->netdev);
+
+   netif_device_detach(info->netdev);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+   /* Kick the backend to disconnect */
+   xenbus_switch_state(dev, XenbusStateClosing);
+
+   /* We don't want to move forward before the frontend is diconnected
+* from the backend cleanly.
+*/
+   timeout = wait_for_completion_timeout(>wait_backend_disconnected,
+ timeout);
+   if (!timeout) {
+   err = -EBUSY;
+   xenbus_dev_error(dev, err, "Freezing timed out;"
+"the device may become inconsistent state");
+   return err;
+   }
+
+   /* Tear down queues */
+   xennet_disconnect_backend(info);
+   xennet_destroy_queues(info);
+
+   info->freeze_state = NETI

[PATCH 05/12] genirq: Shutdown irq chips in suspend/resume during hibernation

2020-05-19 Thread Anchal Agarwal
Many legacy device drivers do not implement power management (PM)
functions which means that interrupts requested by these drivers stay
in active state when the kernel is hibernated.

This does not matter on bare metal and on most hypervisors because the
interrupt is restored on resume without any noticable side effects as
it stays connected to the same physical or virtual interrupt line.

The XEN interrupt mechanism is different as it maintains a mapping
between the Linux interrupt number and a XEN event channel. If the
interrupt stays active on hibernation this mapping is preserved but
there is unfortunately no guarantee that on resume the same event
channels are reassigned to these devices. This can result in event
channel conflicts which prevent the affected devices from being
restored correctly.

One way to solve this would be to add the necessary power management
functions to all affected legacy device drivers, but that's a
questionable effort which does not provide any benefits on non-XEN
environments.

The least intrusive and most efficient solution is to provide a
mechanism which allows the core interrupt code to tear down these
interrupts on hibernation and bring them back up again on resume. This
allows the XEN event channel mechanism to assign an arbitrary event
channel on resume without affecting the functionality of these
devices.

Fortunately all these device interrupts are handled by a dedicated XEN
interrupt chip so the chip can be marked that all interrupts connected
to it are handled this way. This is pretty much in line with the other
interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.

Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
it the core interrupt suspend/resume paths.

Signed-off-by: Anchal Agarwal 
Signed-off--by: Thomas Gleixner 
---
 drivers/xen/events/events_base.c |  1 +
 include/linux/irq.h  |  2 ++
 kernel/irq/chip.c|  2 +-
 kernel/irq/internals.h   |  1 +
 kernel/irq/pm.c  | 31 ++-
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 3a791c8485d0..decf65bd3451 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1613,6 +1613,7 @@ static struct irq_chip xen_pirq_chip __read_mostly = {
.irq_set_affinity   = set_affinity_irq,
 
.irq_retrigger  = retrigger_dynirq,
+   .flags  = IRQCHIP_SHUTDOWN_ON_SUSPEND,
 };
 
 static struct irq_chip xen_percpu_chip __read_mostly = {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 8d5bc2c237d7..94cb8c994d06 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -542,6 +542,7 @@ struct irq_chip {
  * IRQCHIP_EOI_THREADED:   Chip requires eoi() on unmask in threaded mode
  * IRQCHIP_SUPPORTS_LEVEL_MSI  Chip can provide two doorbells for Level MSIs
  * IRQCHIP_SUPPORTS_NMI:   Chip can deliver NMIs, only for root irqchips
+ * IRQCHIP_SHUTDOWN_ON_SUSPEND: Shutdown non wake irqs in the suspend path
  */
 enum {
IRQCHIP_SET_TYPE_MASKED = (1 <<  0),
@@ -553,6 +554,7 @@ enum {
IRQCHIP_EOI_THREADED= (1 <<  6),
IRQCHIP_SUPPORTS_LEVEL_MSI  = (1 <<  7),
IRQCHIP_SUPPORTS_NMI= (1 <<  8),
+   IRQCHIP_SHUTDOWN_ON_SUSPEND = (1 <<  9),
 };
 
 #include 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 41e7e37a0928..fd59489ff14b 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -233,7 +233,7 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask 
*aff, bool force)
 }
 #endif
 
-static int __irq_startup(struct irq_desc *desc)
+int __irq_startup(struct irq_desc *desc)
 {
struct irq_data *d = irq_desc_get_irq_data(desc);
int ret = 0;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 7db284b10ac9..b6fca5eacff7 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -80,6 +80,7 @@ extern void __enable_irq(struct irq_desc *desc);
 extern int irq_activate(struct irq_desc *desc);
 extern int irq_activate_and_startup(struct irq_desc *desc, bool resend);
 extern int irq_startup(struct irq_desc *desc, bool resend, bool force);
+extern int __irq_startup(struct irq_desc *desc);
 
 extern void irq_shutdown(struct irq_desc *desc);
 extern void irq_shutdown_and_deactivate(struct irq_desc *desc);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index 8f557fa1f4fe..dc48a25f1756 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -85,16 +85,25 @@ static bool suspend_device_irq(struct irq_desc *desc)
}
 
desc->istate |= IRQS_SUSPENDED;
-   __disable_irq(desc);
-
/*
-* Hardware which has no wakeup source configuration facility
-* requires that the non wakeup interrupts are masked at the
-* chip level. The chip implementation indicates that with
-

[PATCH 04/12] x86/xen: add system core suspend and resume callbacks

2020-05-19 Thread Anchal Agarwal
From: Munehisa Kamata 

Add Xen PVHVM specific system core callbacks for PM suspend and
hibernation support. The callbacks suspend and resume Xen
primitives,like shared_info, pvclock and grant table. Note that
Xen suspend can handle them in a different manner, but system
core callbacks are called from the context. So if the callbacks
are called from Xen suspend context, return immediately.

Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 
---
 arch/x86/xen/enlighten_hvm.c |  1 +
 arch/x86/xen/suspend.c   | 53 
 include/xen/xen-ops.h|  3 ++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 75b1ec7a0fcd..138e71786e03 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -204,6 +204,7 @@ static void __init xen_hvm_guest_init(void)
if (xen_feature(XENFEAT_hvm_callback_vector))
xen_have_vector_callback = 1;
 
+   xen_setup_syscore_ops();
xen_hvm_smp_init();
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 1d83152c761b..784c4484100b 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -2,17 +2,22 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -82,3 +87,51 @@ void xen_arch_suspend(void)
 
on_each_cpu(xen_vcpu_notify_suspend, NULL, 1);
 }
+
+static int xen_syscore_suspend(void)
+{
+   struct xen_remove_from_physmap xrfp;
+   int ret;
+
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return 0;
+
+   xrfp.domid = DOMID_SELF;
+   xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+
+   ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, );
+   if (!ret)
+   HYPERVISOR_shared_info = _dummy_shared_info;
+
+   return ret;
+}
+
+static void xen_syscore_resume(void)
+{
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return;
+
+   /* No need to setup vcpu_info as it's already moved off */
+   xen_hvm_map_shared_info();
+
+   pvclock_resume();
+
+   gnttab_resume();
+}
+
+/*
+ * These callbacks will be called with interrupts disabled and when having only
+ * one CPU online.
+ */
+static struct syscore_ops xen_hvm_syscore_ops = {
+   .suspend = xen_syscore_suspend,
+   .resume = xen_syscore_resume
+};
+
+void __init xen_setup_syscore_ops(void)
+{
+   if (xen_hvm_domain())
+   register_syscore_ops(_hvm_syscore_ops);
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 4ffe031adfc7..89b1e88712d6 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -43,6 +43,9 @@ int xen_setup_shutdown_event(void);
 bool xen_suspend_mode_is_xen_suspend(void);
 bool xen_suspend_mode_is_pm_suspend(void);
 bool xen_suspend_mode_is_pm_hibernation(void);
+
+void xen_setup_syscore_ops(void);
+
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN




[PATCH 01/12] xen/manage: keep track of the on-going suspend mode

2020-05-19 Thread Anchal Agarwal
From: Munehisa Kamata 

Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.

Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.

[Anchal Changelog: Code refactoring]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/manage.c  | 73 +++
 include/xen/xen-ops.h |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..0b30ab522b77 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,6 +41,31 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+   NO_SUSPEND = 0,
+   XEN_SUSPEND,
+   PM_SUSPEND,
+   PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+   return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+   return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+   return suspend_mode == PM_HIBERNATION;
+}
+
 struct suspend_info {
int cancelled;
 };
@@ -99,6 +125,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
 
+   lock_system_sleep();
+
+   suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
 
err = freeze_processes();
@@ -162,6 +192,10 @@ static void do_suspend(void)
thaw_processes();
 out:
shutting_down = SHUTDOWN_INVALID;
+
+   suspend_mode = NO_SUSPEND;
+
+   unlock_system_sleep();
 }
 #endif /* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -387,3 +421,42 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+  unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_SUSPEND_PREPARE:
+   suspend_mode = PM_SUSPEND;
+   break;
+   case PM_HIBERNATION_PREPARE:
+   case PM_RESTORE_PREPARE:
+   suspend_mode = PM_HIBERNATION;
+   break;
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   suspend_mode = NO_SUSPEND;
+   break;
+   default:
+   pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+   return -EINVAL;
+   }
+
+   return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+   .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+   if (!xen_hvm_domain())
+   return -ENODEV;
+
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 095be1d66f31..4ffe031adfc7 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,9 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN




Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-03-13 Thread Anchal Agarwal
On Thu, Mar 12, 2020 at 10:04:35AM +0100, Roger Pau Monné wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> On Wed, Mar 11, 2020 at 10:25:15PM +, Agarwal, Anchal wrote:
> > Hi Roger,
> > I am trying to understand your comments on indirect descriptors specially 
> > without polluting the mailing list hence emailing you personally.
> 
> IMO it's better to send to the mailing list. The issues or questions
> you have about indirect descriptors can be helpful to others in the
> future. If there's no confidential information please send to the
> list next time.
> 
> Feel free to forward this reply to the list also.
>
Sure no problem at all.
> > Hope that's ok by you.  Please see my response inline.
> >
> > On Fri, Mar 06, 2020 at 06:40:33PM +, Anchal Agarwal wrote:
> > > On Fri, Feb 21, 2020 at 03:24:45PM +0100, Roger Pau Monné wrote:
> > > > On Fri, Feb 14, 2020 at 11:25:34PM +, Anchal Agarwal wrote:
> > > > >   blkfront_gather_backend_features(info);
> > > > >   /* Reset limits changed by blk_mq_update_nr_hw_queues(). */
> > > > >   blkif_set_queue_limits(info);
> > > > > @@ -2046,6 +2063,9 @@ static int blkif_recover(struct 
> > blkfront_info *info)
> > > > >   kick_pending_request_queues(rinfo);
> > > > >   }
> > > > >
> > > > > + if (frozen)
> > > > > + return 0;
> > > >
> > > > I have to admit my memory is fuzzy here, but don't you need to
> > > > re-queue requests in case the backend has different limits of 
> > indirect
> > > > descriptors per request for example?
> > > >
> > > > Or do we expect that the frontend is always going to be resumed on 
> > the
> > > > same backend, and thus features won't change?
> > > >
> > > So to understand your question better here, AFAIU the  maximum number 
> > of indirect
> > > grefs is fixed by the backend, but the frontend can issue requests 
> > with any
> > > number of indirect segments as long as it's less than the number 
> > provided by
> > > the backend. So by your question you mean this max number of 
> > MAX_INDIRECT_SEGMENTS
> > > 256 on backend can change ?
> >
> > Yes, number of indirect descriptors supported by the backend can
> > change, because you moved to a different backend, or because the
> > maximum supported by the backend has changed. It's also possible to
> > resume on a backend that has no indirect descriptors support at all.
> >
> > AFAIU, the code for requeuing the requests is only for xen suspend/resume. 
> > These request in the queue are
> > same that gets added to queuelist in blkfront_resume. Also, even if 
> > indirect descriptors change on resume,
> > they just need to be broadcasted to frontend and which means we could just 
> > mean that a request can process
> > more data.
> 
> Or less data. You could legitimately migrate from a host that has
> indirect descriptors to one without, in which case requests would need
> to be smaller to fit the ring slots.
> 
> > We do setup indirect descriptors on front end on blkif_recover before 
> > returning and queue limits are
> > setup accordingly.
> > Am I missing anything here?
> 
> Calling blkif_recover should take care of it AFAICT. As it resets the
> queue limits according to the data announced on xenstore.
> 
> I think I got confused, using blkif_recover should be fine, sorry.
> 
Ok. Thanks for confirming. I will fixup other suggestions in the patch and send
out a v4.
> >
> > > > > @@ -2625,6 +2671,62 @@ static void blkif_release(struct gendisk 
> > *disk, fmode_t mode)
> > > > >   mutex_unlock(_mutex);
> > > > >  }
> > > > >
> > > > > +static int blkfront_freeze(struct xenbus_device *dev)
> > > > > +{
> > > > > + unsigned int i;
> > > > > + struct blkfront_info *info = dev_get_drvdata(>dev);
> > > > > + struct blkfront_ring_info *rinfo;
> > > > > + /* This would be reasonable timeout as used in 
> > xenbus_dev_shutdown() */
> > > > > + unsigned int timeout = 5 * HZ;
> > > > > + int err = 0;
> > > > &g

Re: [Xen-devel] [EXTERNAL][RFC PATCH v3 07/12] genirq: Shutdown irq chips in suspend/resume during hibernation

2020-03-09 Thread Anchal Agarwal
On Sat, Mar 07, 2020 at 12:03:52AM +0100, Thomas Gleixner wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you can confirm the sender and know the 
> content is safe.
> 
> 
> 
> Anchal Agarwal  writes:
> 
> > There are no pm handlers for the legacy devices, so during tear down
> > stale event channel <> IRQ mapping may still remain in the image and
> > resume may fail. To avoid adding much code by implementing handlers for
> > legacy devices, add a new irq_chip flag IRQCHIP_SHUTDOWN_ON_SUSPEND which
> > when enabled on an irq-chip e.g xen-pirq, it will let core suspend/resume
> > irq code to shutdown and restart the active irqs. PM suspend/hibernation
> > code will rely on this.
> > Without this, in PM hibernation, information about the event channel
> > remains in hibernation image, but there is no guarantee that the same
> > event channel numbers are assigned to the devices when restoring the
> > system. This may cause conflict like the following and prevent some
> > devices from being restored correctly.
> 
> The above is just an agglomeration of words and acronyms and some of
> these sentences do not even make sense. Anyone who is not aware of event
> channels and whatever XENisms you talk about will be entirely
> confused. Changelogs really need to be understandable for mere mortals
> and there is no space restriction so acronyms can be written out.
> 
I don't understand what does not makes sense here. Of course the one you
described is more elaborate and explanatory and I agree I just wrote a short 
one from perspective of PM hibernation related to Xen domU. 
All I explained was why teardown is needed, what is the solution and 
what will happen if we do not clear those mappings. 
> Something like this:
> 
>   Many legacy device drivers do not implement power management (PM)
>   functions which means that interrupts requested by these drivers stay
>   in active state when the kernel is hibernated.
> 
>   This does not matter on bare metal and on most hypervisors because the
>   interrupt is restored on resume without any noticable side effects as
>   it stays connected to the same physical or virtual interrupt line.
> 
>   The XEN interrupt mechanism is different as it maintains a mapping
>   between the Linux interrupt number and a XEN event channel. If the
>   interrupt stays active on hibernation this mapping is preserved but
>   there is unfortunately no guarantee that on resume the same event
>   channels are reassigned to these devices. This can result in event
>   channel conflicts which prevent the affected devices from being
>   restored correctly.
> 
>   One way to solve this would be to add the necessary power management
>   functions to all affected legacy device drivers, but that's a
>   questionable effort which does not provide any benefits on non-XEN
>   environments.
> 
>   The least intrusive and most efficient solution is to provide a
>   mechanism which allows the core interrupt code to tear down these
>   interrupts on hibernation and bring them back up again on resume. This
>   allows the XEN event channel mechanism to assign an arbitrary event
>   channel on resume without affecting the functionality of these
>   devices.
> 
>   Fortunately all these device interrupts are handled by a dedicated XEN
>   interrupt chip so the chip can be marked that all interrupts connected
>   to it are handled this way. This is pretty much in line with the other
>   interrupt chip specific quirks, e.g. IRQCHIP_MASK_ON_SUSPEND.
> 
>   Add a new quirk flag IRQCHIP_SHUTDOWN_ON_SUSPEND and add support for
>   it the core interrupt suspend/resume paths.
> 
> Hmm?
> 
Sure.
> > Signed-off-by: Anchal Agarwal 
> > Suggested-by: Thomas Gleixner 
> 
> Not that I care much, but now that I've written both the patch and the
> changelog you might change that attribution slightly. For completeness
> sake:
> 
Why not. That's mandated now :)
>  Signed-off-by: Thomas Gleixner 
> 
> Thanks,
> 
> tglx
Thanks,
Anchal

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-03-06 Thread Anchal Agarwal
On Fri, Feb 21, 2020 at 03:24:45PM +0100, Roger Pau Monné wrote:
> On Fri, Feb 14, 2020 at 11:25:34PM +0000, Anchal Agarwal wrote:
> > From: Munehisa Kamata  > 
> > Add freeze, thaw and restore callbacks for PM suspend and hibernation
> > support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> > events, need to implement these xenbus_driver callbacks.
> > The freeze handler stops a block-layer queue and disconnect the
> > frontend from the backend while freeing ring_info and associated resources.
> > The restore handler re-allocates ring_info and re-connect to the
> > backend, so the rest of the kernel can continue to use the block device
> > transparently. Also, the handlers are used for both PM suspend and
> > hibernation so that we can keep the existing suspend/resume callbacks for
> > Xen suspend without modification. Before disconnecting from backend,
> > we need to prevent any new IO from being queued and wait for existing
> > IO to complete. Freeze/unfreeze of the queues will guarantee that there
> > are no requests in use on the shared ring.
> > 
> > Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> > xen/blkback: unmap all persistent grants when frontend gets disconnected,
> > the frontend may see massive amount of grant table warning when freeing
> > resources.
> > [   36.852659] deferring g.e. 0xf9 (pfn 0x)
> > [   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
> > 
> > In this case, persistent grants would need to be disabled.
> > 
> > [Anchal Changelog: Removed timeout/request during blkfront freeze.
> > Fixed major part of the code to work with blk-mq]
> > Signed-off-by: Anchal Agarwal 
> > Signed-off-by: Munehisa Kamata 
> > ---
> >  drivers/block/xen-blkfront.c | 119 ---
> >  1 file changed, 112 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index 478120233750..d715ed3cb69a 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -47,6 +47,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -79,6 +81,8 @@ enum blkif_state {
> > BLKIF_STATE_DISCONNECTED,
> > BLKIF_STATE_CONNECTED,
> > BLKIF_STATE_SUSPENDED,
> > +   BLKIF_STATE_FREEZING,
> > +   BLKIF_STATE_FROZEN
> >  };
> >  
> >  struct grant {
> > @@ -220,6 +224,7 @@ struct blkfront_info
> > struct list_head requests;
> > struct bio_list bio_list;
> > struct list_head info_list;
> > +   struct completion wait_backend_disconnected;
> >  };
> >  
> >  static unsigned int nr_minors;
> > @@ -261,6 +266,7 @@ static DEFINE_SPINLOCK(minor_lock);
> >  static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
> >  static void blkfront_gather_backend_features(struct blkfront_info *info);
> >  static int negotiate_mq(struct blkfront_info *info);
> > +static void __blkif_free(struct blkfront_info *info);
> 
> I'm not particularly found of adding underscore prefixes to functions,
> I would rather use a more descriptive name if possible.
> blkif_free_{queues/rings} maybe?
>
Apologies for delayed response as I was OOTO.
Appreciate your feedback. Will fix
> >  
> >  static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
> >  {
> > @@ -995,6 +1001,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, 
> > u16 sector_size,
> > info->sector_size = sector_size;
> > info->physical_sector_size = physical_sector_size;
> > blkif_set_queue_limits(info);
> > +   init_completion(>wait_backend_disconnected);
> >  
> > return 0;
> >  }
> > @@ -1218,6 +1225,8 @@ static void xlvbd_release_gendisk(struct 
> > blkfront_info *info)
> >  /* Already hold rinfo->ring_lock. */
> >  static inline void kick_pending_request_queues_locked(struct 
> > blkfront_ring_info *rinfo)
> >  {
> > +   if (unlikely(rinfo->dev_info->connected == BLKIF_STATE_FREEZING))
> > +   return;
> 
> Do you really need this check here?
> 
> The queue will be frozen and quiesced in blkfront_freeze when the state
> is set to BLKIF_STATE_FREEZING, and then the call to
> blk_mq_start_stopped_hw_queues is just a noop as long as the queue is
> quiesced (see blk_mq_run_hw_queue).
> 
You are right. Will fix it. May have skipped this part of the patch when fixing
blkfront_freeze.
&

Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-02-20 Thread Anchal Agarwal
On Thu, Feb 20, 2020 at 10:01:52AM -0700, Durrant, Paul wrote:
> > -Original Message-
> > From: Roger Pau Monné 
> > Sent: 20 February 2020 16:49
> > To: Durrant, Paul 
> > Cc: Agarwal, Anchal ; Valentin, Eduardo
> > ; len.br...@intel.com; pet...@infradead.org;
> > b...@kernel.crashing.org; x...@kernel.org; linux...@kvack.org;
> > pa...@ucw.cz; h...@zytor.com; t...@linutronix.de; sstabell...@kernel.org;
> > fllin...@amaozn.com; Kamata, Munehisa ;
> > mi...@redhat.com; xen-devel@lists.xenproject.org; Singh, Balbir
> > ; ax...@kernel.dk; konrad.w...@oracle.com;
> > b...@alien8.de; boris.ostrov...@oracle.com; jgr...@suse.com;
> > net...@vger.kernel.org; linux...@vger.kernel.org; r...@rjwysocki.net;
> > linux-ker...@vger.kernel.org; vkuzn...@redhat.com; da...@davemloft.net;
> > Woodhouse, David 
> > Subject: Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks
> > for PM suspend and hibernation
> > 
> > On Thu, Feb 20, 2020 at 04:23:13PM +, Durrant, Paul wrote:
> > > > -Original Message-
> > > > From: Roger Pau Monné 
> > > > Sent: 20 February 2020 15:45
> > > > To: Durrant, Paul 
> > > > Cc: Agarwal, Anchal ; Valentin, Eduardo
> > > > ; len.br...@intel.com; pet...@infradead.org;
> > > > b...@kernel.crashing.org; x...@kernel.org; linux...@kvack.org;
> > > > pa...@ucw.cz; h...@zytor.com; t...@linutronix.de;
> > sstabell...@kernel.org;
> > > > fllin...@amaozn.com; Kamata, Munehisa ;
> > > > mi...@redhat.com; xen-devel@lists.xenproject.org; Singh, Balbir
> > > > ; ax...@kernel.dk; konrad.w...@oracle.com;
> > > > b...@alien8.de; boris.ostrov...@oracle.com; jgr...@suse.com;
> > > > net...@vger.kernel.org; linux...@vger.kernel.org; r...@rjwysocki.net;
> > > > linux-ker...@vger.kernel.org; vkuzn...@redhat.com;
> > da...@davemloft.net;
> > > > Woodhouse, David 
> > > > Subject: Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add
> > callbacks
> > > > for PM suspend and hibernation
> > > >
> > > > On Thu, Feb 20, 2020 at 08:54:36AM +, Durrant, Paul wrote:
> > > > > > -Original Message-
> > > > > > From: Xen-devel  On Behalf
> > Of
> > > > > > Roger Pau Monné
> > > > > > Sent: 20 February 2020 08:39
> > > > > > To: Agarwal, Anchal 
> > > > > > Cc: Valentin, Eduardo ; len.br...@intel.com;
> > > > > > pet...@infradead.org; b...@kernel.crashing.org; x...@kernel.org;
> > linux-
> > > > > > m...@kvack.org; pa...@ucw.cz; h...@zytor.com; t...@linutronix.de;
> > > > > > sstabell...@kernel.org; fllin...@amaozn.com; Kamata, Munehisa
> > > > > > ; mi...@redhat.com; xen-
> > > > de...@lists.xenproject.org;
> > > > > > Singh, Balbir ; ax...@kernel.dk;
> > > > > > konrad.w...@oracle.com; b...@alien8.de; boris.ostrov...@oracle.com;
> > > > > > jgr...@suse.com; net...@vger.kernel.org; linux...@vger.kernel.org;
> > > > > > r...@rjwysocki.net; linux-ker...@vger.kernel.org;
> > vkuzn...@redhat.com;
> > > > > > da...@davemloft.net; Woodhouse, David 
> > > > > > Subject: Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add
> > > > callbacks
> > > > > > for PM suspend and hibernation
> > > > > >
> > > > > > Thanks for this work, please see below.
> > > > > >
> > > > > > On Wed, Feb 19, 2020 at 06:04:24PM +, Anchal Agarwal wrote:
> > > > > > > On Tue, Feb 18, 2020 at 10:16:11AM +0100, Roger Pau Monné wrote:
> > > > > > > > On Mon, Feb 17, 2020 at 11:05:53PM +, Anchal Agarwal
> > wrote:
> > > > > > > > > On Mon, Feb 17, 2020 at 11:05:09AM +0100, Roger Pau Monné
> > wrote:
> > > > > > > > > > On Fri, Feb 14, 2020 at 11:25:34PM +, Anchal Agarwal
> > > > wrote:
> > > > > > > > > Quiescing the queue seemed a better option here as we want
> > to
> > > > make
> > > > > > sure ongoing
> > > > > > > > > requests dispatches are totally drained.
> > > > > > > > > I should accept that some of these notion is borrowed from
> > how
> > > > nvme
> > > > > > freeze/unfreeze
> > > > > > > > > is done although its not apple to apple

Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-02-19 Thread Anchal Agarwal
On Tue, Feb 18, 2020 at 10:16:11AM +0100, Roger Pau Monné wrote:
> On Mon, Feb 17, 2020 at 11:05:53PM +0000, Anchal Agarwal wrote:
> > On Mon, Feb 17, 2020 at 11:05:09AM +0100, Roger Pau Monné wrote:
> > > On Fri, Feb 14, 2020 at 11:25:34PM +0000, Anchal Agarwal wrote:
> > > > From: Munehisa Kamata  > > > 
> > > > Add freeze, thaw and restore callbacks for PM suspend and hibernation
> > > > support. All frontend drivers that needs to use 
> > > > PM_HIBERNATION/PM_SUSPEND
> > > > events, need to implement these xenbus_driver callbacks.
> > > > The freeze handler stops a block-layer queue and disconnect the
> > > > frontend from the backend while freeing ring_info and associated 
> > > > resources.
> > > > The restore handler re-allocates ring_info and re-connect to the
> > > > backend, so the rest of the kernel can continue to use the block device
> > > > transparently. Also, the handlers are used for both PM suspend and
> > > > hibernation so that we can keep the existing suspend/resume callbacks 
> > > > for
> > > > Xen suspend without modification. Before disconnecting from backend,
> > > > we need to prevent any new IO from being queued and wait for existing
> > > > IO to complete.
> > > 
> > > This is different from Xen (xenstore) initiated suspension, as in that
> > > case Linux doesn't flush the rings or disconnects from the backend.
> > Yes, AFAIK in xen initiated suspension backend takes care of it. 
> 
> No, in Xen initiated suspension backend doesn't take care of flushing
> the rings, the frontend has a shadow copy of the ring contents and it
> re-issues the requests on resume.
> 
Yes, I meant suspension in general where both xenstore and backend knows
system is going under suspension and not flushing of rings. That happens
in frontend when backend indicates that state is closing and so on.
I may have written it in wrong context.
> > > > +static int blkfront_freeze(struct xenbus_device *dev)
> > > > +{
> > > > +   unsigned int i;
> > > > +   struct blkfront_info *info = dev_get_drvdata(>dev);
> > > > +   struct blkfront_ring_info *rinfo;
> > > > +   /* This would be reasonable timeout as used in 
> > > > xenbus_dev_shutdown() */
> > > > +   unsigned int timeout = 5 * HZ;
> > > > +   int err = 0;
> > > > +
> > > > +   info->connected = BLKIF_STATE_FREEZING;
> > > > +
> > > > +   blk_mq_freeze_queue(info->rq);
> > > > +   blk_mq_quiesce_queue(info->rq);
> > > > +
> > > > +   for (i = 0; i < info->nr_rings; i++) {
> > > > +   rinfo = >rinfo[i];
> > > > +
> > > > +   gnttab_cancel_free_callback(>callback);
> > > > +   flush_work(>work);
> > > > +   }
> > > > +
> > > > +   /* Kick the backend to disconnect */
> > > > +   xenbus_switch_state(dev, XenbusStateClosing);
> > > 
> > > Are you sure this is safe?
> > > 
> > In my testing running multiple fio jobs, other test scenarios running
> > a memory loader works fine. I did not came across a scenario that would
> > have failed resume due to blkfront issues unless you can sugest some?
> 
> AFAICT you don't wait for the in-flight requests to be finished, and
> just rely on blkback to finish processing those. I'm not sure all
> blkback implementations out there can guarantee that.
> 
> The approach used by Xen initiated suspension is to re-issue the
> in-flight requests when resuming. I have to admit I don't think this
> is the best approach, but I would like to keep both the Xen and the PM
> initiated suspension using the same logic, and hence I would request
> that you try to re-use the existing resume logic (blkfront_resume).
> 
> > > I don't think you wait for all requests pending on the ring to be
> > > finished by the backend, and hence you might loose requests as the
> > > ones on the ring would not be re-issued by blkfront_restore AFAICT.
> > > 
> > AFAIU, blk_mq_freeze_queue/blk_mq_quiesce_queue should take care of no used
> > request on the shared ring. Also, we I want to pause the queue and flush all
> > the pending requests in the shared ring before disconnecting from backend.
> 
> Oh, so blk_mq_freeze_queue does wait for in-flight requests to be
> finished. I guess it's fine then.
> 
Ok.
> > Quiescing the queue seemed a better optio

Re: [Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-02-17 Thread Anchal Agarwal
On Mon, Feb 17, 2020 at 11:05:09AM +0100, Roger Pau Monné wrote:
> On Fri, Feb 14, 2020 at 11:25:34PM +0000, Anchal Agarwal wrote:
> > From: Munehisa Kamata  > 
> > Add freeze, thaw and restore callbacks for PM suspend and hibernation
> > support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> > events, need to implement these xenbus_driver callbacks.
> > The freeze handler stops a block-layer queue and disconnect the
> > frontend from the backend while freeing ring_info and associated resources.
> > The restore handler re-allocates ring_info and re-connect to the
> > backend, so the rest of the kernel can continue to use the block device
> > transparently. Also, the handlers are used for both PM suspend and
> > hibernation so that we can keep the existing suspend/resume callbacks for
> > Xen suspend without modification. Before disconnecting from backend,
> > we need to prevent any new IO from being queued and wait for existing
> > IO to complete.
> 
> This is different from Xen (xenstore) initiated suspension, as in that
> case Linux doesn't flush the rings or disconnects from the backend.
Yes, AFAIK in xen initiated suspension backend takes care of it. 
> 
> This is done so that in case suspensions fails the recovery doesn't
> need to reconnect the PV devices, and in order to speed up suspension
> time (ie: waiting for all queues to be flushed can take time as Linux
> supports multiqueue, multipage rings and indirect descriptors), and
> the backend could be contended if there's a lot of IO pressure from
> guests.
> 
> Linux already keeps a shadow of the ring contents, so in-flight
> requests can be re-issued after the frontend has reconnected during
> resume.
> 
> > Freeze/unfreeze of the queues will guarantee that there
> > are no requests in use on the shared ring.
> > 
> > Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> > xen/blkback: unmap all persistent grants when frontend gets disconnected,
> > the frontend may see massive amount of grant table warning when freeing
> > resources.
> > [   36.852659] deferring g.e. 0xf9 (pfn 0x)
> > [   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
> > 
> > In this case, persistent grants would need to be disabled.
> > 
> > [Anchal Changelog: Removed timeout/request during blkfront freeze.
> > Fixed major part of the code to work with blk-mq]
> > Signed-off-by: Anchal Agarwal 
> > Signed-off-by: Munehisa Kamata 
> > ---
> >  drivers/block/xen-blkfront.c | 119 ---
> >  1 file changed, 112 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index 478120233750..d715ed3cb69a 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -47,6 +47,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -79,6 +81,8 @@ enum blkif_state {
> > BLKIF_STATE_DISCONNECTED,
> > BLKIF_STATE_CONNECTED,
> > BLKIF_STATE_SUSPENDED,
> > +   BLKIF_STATE_FREEZING,
> > +   BLKIF_STATE_FROZEN
> >  };
> >  
> >  struct grant {
> > @@ -220,6 +224,7 @@ struct blkfront_info
> > struct list_head requests;
> > struct bio_list bio_list;
> > struct list_head info_list;
> > +   struct completion wait_backend_disconnected;
> >  };
> >  
> >  static unsigned int nr_minors;
> > @@ -261,6 +266,7 @@ static DEFINE_SPINLOCK(minor_lock);
> >  static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
> >  static void blkfront_gather_backend_features(struct blkfront_info *info);
> >  static int negotiate_mq(struct blkfront_info *info);
> > +static void __blkif_free(struct blkfront_info *info);
> >  
> >  static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
> >  {
> > @@ -995,6 +1001,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, 
> > u16 sector_size,
> > info->sector_size = sector_size;
> > info->physical_sector_size = physical_sector_size;
> > blkif_set_queue_limits(info);
> > +   init_completion(>wait_backend_disconnected);
> >  
> > return 0;
> >  }
> > @@ -1218,6 +1225,8 @@ static void xlvbd_release_gendisk(struct 
> > blkfront_info *info)
> >  /* Already hold rinfo->ring_lock. */
> >  static inline void kick_pending_request_queues_locked(struct 
> > blkfront_ring_info *rinfo)
> >  {
> >

[Xen-devel] [RFC PATCH v3 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

2020-02-14 Thread Anchal Agarwal
From: Aleksei Besogonov 

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().

Before this patch, the command line processing was the only place where
swsusp_resume_block was set.

Signed-off-by: Aleksei Besogonov 
Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
---
 kernel/power/user.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index 77438954cc2b..d396e313cb7b 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -374,8 +374,12 @@ static long snapshot_ioctl(struct file *filp, unsigned int 
cmd,
if (swdev) {
offset = swap_area.offset;
data->swap = swap_type_of(swdev, offset, NULL);
-   if (data->swap < 0)
+   if (data->swap < 0) {
error = -ENODEV;
+   } else {
+   swsusp_resume_device = swdev;
+   swsusp_resume_block = offset;
+   }
} else {
data->swap = -1;
error = -EINVAL;
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 10/12] xen: Introduce wrapper for save/restore sched clock offset

2020-02-14 Thread Anchal Agarwal
Introduce wrappers for save/restore xen_sched_clock_offset to be
used by PM hibernation code to avoid system instability during resume.

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/time.c| 15 +--
 arch/x86/xen/xen-ops.h |  2 ++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 8cf632dda605..eeb6d3d2eaab 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -379,12 +379,23 @@ static const struct pv_time_ops xen_time_ops __initconst 
= {
 static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
 static u64 xen_clock_value_saved;
 
+/*This is needed to maintain a monotonic clock value during PM hibernation */
+void xen_save_sched_clock_offset(void)
+{
+   xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+}
+
+void xen_restore_sched_clock_offset(void)
+{
+   xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+}
+
 void xen_save_time_memory_area(void)
 {
struct vcpu_register_time_memory_area t;
int ret;
 
-   xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+   xen_save_sched_clock_offset();
 
if (!xen_clock)
return;
@@ -426,7 +437,7 @@ void xen_restore_time_memory_area(void)
 out:
/* Need pvclock_resume() before using xen_clocksource_read(). */
pvclock_resume();
-   xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+   xen_restore_sched_clock_offset();
 }
 
 static void xen_setup_vsyscall_time_info(void)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index d84c357994bd..9f49124df033 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -72,6 +72,8 @@ void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);
+void xen_save_sched_clock_offset(void);
+void xen_restore_sched_clock_offset(void);
 
 irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 11/12] xen: Update sched clock offset to avoid system instability in hibernation

2020-02-14 Thread Anchal Agarwal
Save/restore xen_sched_clock_offset in syscore suspend/resume during PM
hibernation. Commit '867cefb4cb1012: ("xen: Fix x86 sched_clock() interface
for xen")' fixes xen guest time handling during migration. A similar issue
is seen during PM hibernation when system runs CPU intensive workload.
Post resume pvclock resets the value to 0 however, xen sched_clock_offset
is never updated. System instability is seen during resume from hibernation
when system is under heavy CPU load. Since xen_sched_clock_offset is not
updated, system does not see the monotonic clock value and the scheduler
would then think that heavy CPU hog tasks need more time in CPU, causing
the system to freeze

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/suspend.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index dae0f74f5390..7e5275944810 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -105,6 +105,8 @@ static int xen_syscore_suspend(void)
xen_save_steal_clock(cpu);
}
 
+   xen_save_sched_clock_offset();
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -126,6 +128,12 @@ static void xen_syscore_resume(void)
 
pvclock_resume();
 
+   /*
+* Restore xen_sched_clock_offset during resume to maintain
+* monotonic clock value
+*/
+   xen_restore_sched_clock_offset();
+
/* Nonboot CPUs will be resumed when they're brought up */
xen_restore_steal_clock(smp_processor_id());
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 09/12] x86/xen: save and restore steal clock

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 

Save steal clock values of all present CPUs in the system core ops
suspend callbacks. Also, restore a boot CPU's steal clock in the system
core resume callback. For non-boot CPUs, restore after they're brought
up, because runstate info for non-boot CPUs are not active until then.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/suspend.c | 13 -
 arch/x86/xen/time.c|  3 +++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 784c4484100b..dae0f74f5390 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -91,12 +91,20 @@ void xen_arch_suspend(void)
 static int xen_syscore_suspend(void)
 {
struct xen_remove_from_physmap xrfp;
-   int ret;
+   int cpu, ret;
 
/* Xen suspend does similar stuffs in its own logic */
if (xen_suspend_mode_is_xen_suspend())
return 0;
 
+   for_each_present_cpu(cpu) {
+   /*
+* Nonboot CPUs are already offline, but the last copy of
+* runstate info is still accessible.
+*/
+   xen_save_steal_clock(cpu);
+   }
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
 
pvclock_resume();
 
+   /* Nonboot CPUs will be resumed when they're brought up */
+   xen_restore_steal_clock(smp_processor_id());
+
gnttab_resume();
 }
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index befbdd8b17f0..8cf632dda605 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -537,6 +537,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
 {
int cpu = smp_processor_id();
xen_setup_runstate_info(cpu);
+   if (cpu)
+   xen_restore_steal_clock(cpu);
+
/*
 * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
 * doing it xen_hvm_cpu_notify (which gets called by smp_init during
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 07/12] genirq: Shutdown irq chips in suspend/resume during hibernation

2020-02-14 Thread Anchal Agarwal
There are no pm handlers for the legacy devices, so during tear down
stale event channel <> IRQ mapping may still remain in the image and
resume may fail. To avoid adding much code by implementing handlers for
legacy devices, add a new irq_chip flag IRQCHIP_SHUTDOWN_ON_SUSPEND which
when enabled on an irq-chip e.g xen-pirq, it will let core suspend/resume
irq code to shutdown and restart the active irqs. PM suspend/hibernation
code will rely on this.
Without this, in PM hibernation, information about the event channel
remains in hibernation image, but there is no guarantee that the same
event channel numbers are assigned to the devices when restoring the
system. This may cause conflict like the following and prevent some
devices from being restored correctly.

Signed-off-by: Anchal Agarwal 
Suggested-by: Thomas Gleixner 
---
 drivers/xen/events/events_base.c |  1 +
 include/linux/irq.h  |  2 ++
 kernel/irq/chip.c|  2 +-
 kernel/irq/internals.h   |  1 +
 kernel/irq/pm.c  | 31 ++-
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 6c8843968a52..e44f27b45bef 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1620,6 +1620,7 @@ static struct irq_chip xen_pirq_chip __read_mostly = {
.irq_set_affinity   = set_affinity_irq,
 
.irq_retrigger  = retrigger_dynirq,
+   .flags  = IRQCHIP_SHUTDOWN_ON_SUSPEND,
 };
 
 static struct irq_chip xen_percpu_chip __read_mostly = {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fb301cf29148..2873a579fd9d 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -511,6 +511,7 @@ struct irq_chip {
  * IRQCHIP_EOI_THREADED:   Chip requires eoi() on unmask in threaded mode
  * IRQCHIP_SUPPORTS_LEVEL_MSI  Chip can provide two doorbells for Level MSIs
  * IRQCHIP_SUPPORTS_NMI:   Chip can deliver NMIs, only for root irqchips
+ * IRQCHIP_SHUTDOWN_ON_SUSPEND: Shutdown non wake irqs in the suspend path
  */
 enum {
IRQCHIP_SET_TYPE_MASKED = (1 <<  0),
@@ -522,6 +523,7 @@ enum {
IRQCHIP_EOI_THREADED= (1 <<  6),
IRQCHIP_SUPPORTS_LEVEL_MSI  = (1 <<  7),
IRQCHIP_SUPPORTS_NMI= (1 <<  8),
+   IRQCHIP_SHUTDOWN_ON_SUSPEND = (1 <<  9),
 };
 
 #include 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index b76703b2c0af..a1e8df5193ba 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -233,7 +233,7 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask 
*aff, bool force)
 }
 #endif
 
-static int __irq_startup(struct irq_desc *desc)
+int __irq_startup(struct irq_desc *desc)
 {
struct irq_data *d = irq_desc_get_irq_data(desc);
int ret = 0;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 3924fbe829d4..11c7c55bda63 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -80,6 +80,7 @@ extern void __enable_irq(struct irq_desc *desc);
 extern int irq_activate(struct irq_desc *desc);
 extern int irq_activate_and_startup(struct irq_desc *desc, bool resend);
 extern int irq_startup(struct irq_desc *desc, bool resend, bool force);
+extern int __irq_startup(struct irq_desc *desc);
 
 extern void irq_shutdown(struct irq_desc *desc);
 extern void irq_shutdown_and_deactivate(struct irq_desc *desc);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index 8f557fa1f4fe..dc48a25f1756 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -85,16 +85,25 @@ static bool suspend_device_irq(struct irq_desc *desc)
}
 
desc->istate |= IRQS_SUSPENDED;
-   __disable_irq(desc);
-
/*
-* Hardware which has no wakeup source configuration facility
-* requires that the non wakeup interrupts are masked at the
-* chip level. The chip implementation indicates that with
-* IRQCHIP_MASK_ON_SUSPEND.
+* Some irq chips (e.g. XEN PIRQ) require a full shutdown on suspend
+* as some of the legacy drivers(e.g. floppy) do nothing during the
+* suspend path
 */
-   if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
-   mask_irq(desc);
+   if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND) {
+   irq_shutdown(desc);
+   } else {
+   __disable_irq(desc);
+
+  /*
+   * Hardware which has no wakeup source configuration facility
+   * requires that the non wakeup interrupts are masked at the
+   * chip level. The chip implementation indicates that with
+   * IRQCHIP_MASK_ON_SUSPEND.
+   */
+   if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
+   mask_irq(desc);
+   }
return true;
 }
 
@@ -1

[Xen-devel] [RFC PATCH v3 08/12] xen/time: introduce xen_{save, restore}_steal_clock

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 

Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.

top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 
0.0%si,-1253874204672.0%st

This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.

This patch introduces xen_save_steal_clock() which saves current values
in runstate info into per-cpu variables. Its couterpart,
xen_restore_steal_clock(), sets offset if it found the current values in
runstate info are smaller than previous ones. xen_steal_clock() is also
modified to use the offset to ensure that scheduler only sees
monotonically increasing number.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
---
 drivers/xen/time.c| 29 -
 include/xen/xen-ops.h |  2 ++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 0968859c29d0..3560222cc0dd 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -23,6 +23,9 @@ static DEFINE_PER_CPU(struct vcpu_runstate_info, 
xen_runstate);
 
 static DEFINE_PER_CPU(u64[4], old_runstate_time);
 
+static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
+static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -149,7 +152,7 @@ bool xen_vcpu_stolen(int vcpu)
return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
-u64 xen_steal_clock(int cpu)
+static u64 __xen_steal_clock(int cpu)
 {
struct vcpu_runstate_info state;
 
@@ -157,6 +160,30 @@ u64 xen_steal_clock(int cpu)
return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
 }
 
+u64 xen_steal_clock(int cpu)
+{
+   return __xen_steal_clock(cpu) + per_cpu(xen_steal_clock_offset, cpu);
+}
+
+void xen_save_steal_clock(int cpu)
+{
+   per_cpu(xen_prev_steal_clock, cpu) = xen_steal_clock(cpu);
+}
+
+void xen_restore_steal_clock(int cpu)
+{
+   u64 steal_clock = __xen_steal_clock(cpu);
+
+   if (per_cpu(xen_prev_steal_clock, cpu) > steal_clock) {
+   /* Need to update the offset */
+   per_cpu(xen_steal_clock_offset, cpu) =
+   per_cpu(xen_prev_steal_clock, cpu) - steal_clock;
+   } else {
+   /* Avoid unnecessary steal clock warp */
+   per_cpu(xen_steal_clock_offset, cpu) = 0;
+   }
+}
+
 void xen_setup_runstate_info(int cpu)
 {
struct vcpu_register_runstate_memory_area area;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 3b3992b5b0c2..12b3f4474a05 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -37,6 +37,8 @@ void xen_time_setup_guest(void);
 void xen_manage_runstate_time(int action);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 u64 xen_steal_clock(int cpu);
+void xen_save_steal_clock(int cpu);
+void xen_restore_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 
Signed-off-by: Munehisa Kamata 
---
 drivers/block/xen-blkfront.c | 119 ---
 1 file changed, 112 insertions(+), 7 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 478120233750..d715ed3cb69a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -47,6 +47,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -79,6 +81,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+   BLKIF_STATE_FREEZING,
+   BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -220,6 +224,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+   struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -261,6 +266,7 @@ static DEFINE_SPINLOCK(minor_lock);
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static void blkfront_gather_backend_features(struct blkfront_info *info);
 static int negotiate_mq(struct blkfront_info *info);
+static void __blkif_free(struct blkfront_info *info);
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -995,6 +1001,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+   init_completion(>wait_backend_disconnected);
 
return 0;
 }
@@ -1218,6 +1225,8 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
 /* Already hold rinfo->ring_lock. */
 static inline void kick_pending_request_queues_locked(struct 
blkfront_ring_info *rinfo)
 {
+   if (unlikely(rinfo->dev_info->connected == BLKIF_STATE_FREEZING))
+   return;
if (!RING_FULL(>ring))
blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
 }
@@ -1341,8 +1350,6 @@ static void blkif_free_ring(struct blkfront_ring_info 
*rinfo)
 
 static void blkif_free(struct blkfront_info *info, int suspend)
 {
-   unsigned int i;
-
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1350,6 +1357,13 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);
 
+   __blkif_free(info);
+}
+
+static void __blkif_free(struct blkfront_info *info)
+{
+   unsigned int i;
+
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(>rinfo[i]);
 
@@ -1553,8 +1567,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
 
-   if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
-   return IRQ_HANDLED;
+   if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
+   if (info->connected != BLKIF_STATE_FREEZING)
+   return IRQ_HANDLED;
+   }
 
spin_lock_irqsave(>ring_lock, flags);
  again:
@@ -2020,6 +2036,7 @@ static int blkif_recover(struct blkfront_info *info)
struct bio *bio;
unsigned int segs;
 
+   bool frozen = info->connected == BLKIF_STATE_FROZEN;
blkfront_gather_backend_features(info);
/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
blkif_set_queue_limits(info);
@@ -2046,6 +2063,9 @@ static int blkif_recover(struct blkfront_info *info)
kick_pending_request_queues(rinfo);
}
 
+   if (frozen)
+   return 0;
+
list_for_each_entry_safe(req, n, >requests, queuelist) {
/* Requeue pending requests (flush or discard) */
list_del_init(>queuelist);
@@ -2359,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
 
return;
case BLKIF_STATE_SUSPENDED:
+   case BLKIF_STATE_FROZEN:
/*
 * If we are recovering from suspension, we need to wait
 * for the backend to announce it's features before
@@ -2476,12 +2497,37 @@ static void blkback_changed(struct xenbus_device *dev,
break;
 
case XenbusStateClosed:
-   if (dev->state == XenbusStateClosed)
+   if (dev->state == XenbusStateClosed) {
+   if (info->connected == BLKIF_STATE_FREEZING) {
+   __blkif_free(info);
+   info->connected = BLKIF_STATE_FROZEN;
+   complete(>wait_backend_disconnected);
+   break;
+   }
+
break;
+   }
+
+   /*
+  

[Xen-devel] [RFC PATCH v3 05/12] xen-netfront: add callbacks for PM suspend and hibernation support

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.

[Anchal changelog: Variable name fix and checkpatch.pl fixes]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/net/xen-netfront.c | 98 +-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..65edcdd6e05f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -56,6 +57,12 @@
 #include 
 #include 
 
+enum netif_freeze_state {
+   NETIF_FREEZE_STATE_UNFROZEN,
+   NETIF_FREEZE_STATE_FREEZING,
+   NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 
0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+  netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+"timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
 
atomic_t rx_gso_checksum_fixup;
+
+   int freeze_state;
+
+   struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+   struct netfront_info *np = netdev_priv(dev);
+   unsigned int num_queues = dev->real_num_tx_queues;
+   unsigned int queue_index;
+   struct netfront_queue *queue;
+
+   for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+   queue = >queues[queue_index];
+   disable_irq(queue->tx_irq);
+   disable_irq(queue->rx_irq);
+   }
+   return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff 
*skb,
grant_ref_t ref)
 {
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
np->queues = NULL;
 
+   init_completion(>wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info 
*info,
return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
+   unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+   int err = 0;
+
+   xennet_disable_interrupts(info->netdev);
+
+   netif_device_detach(info->netdev);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+   /* Kick the backend to disconnect */
+   xenbus_switch_state(dev, XenbusStateClosing);
+
+   /* We don't want to move forward before the frontend is diconnected
+* from the backend cleanly.
+*/
+   timeout = wait_for_completion_timeout(>wait_backend_disconnected,
+ timeout);
+   if (!timeout) {
+   err = -EBUSY;
+   xenbus_dev_error(dev, err, "Freezing timed out;"
+"the device may become inconsistent state");
+   return err;
+   }
+
+   /* Tear down queues */
+   xennet_disconnect_backend(info);
+   xennet_destroy_queues(info);
+
+   info->freeze_state = NETI

[Xen-devel] [RFC PATCH v3 04/12] x86/xen: add system core suspend and resume callbacks

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 

Add Xen PVHVM specific system core callbacks for PM suspend and
hibernation support. The callbacks suspend and resume Xen
primitives,like shared_info, pvclock and grant table. Note that
Xen suspend can handle them in a different manner, but system
core callbacks are called from the context. So if the callbacks
are called from Xen suspend context, return immediately.

Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 
---
 arch/x86/xen/enlighten_hvm.c |  1 +
 arch/x86/xen/suspend.c   | 53 
 include/xen/xen-ops.h|  3 ++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 75b1ec7a0fcd..138e71786e03 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -204,6 +204,7 @@ static void __init xen_hvm_guest_init(void)
if (xen_feature(XENFEAT_hvm_callback_vector))
xen_have_vector_callback = 1;
 
+   xen_setup_syscore_ops();
xen_hvm_smp_init();
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 1d83152c761b..784c4484100b 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -2,17 +2,22 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -82,3 +87,51 @@ void xen_arch_suspend(void)
 
on_each_cpu(xen_vcpu_notify_suspend, NULL, 1);
 }
+
+static int xen_syscore_suspend(void)
+{
+   struct xen_remove_from_physmap xrfp;
+   int ret;
+
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return 0;
+
+   xrfp.domid = DOMID_SELF;
+   xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+
+   ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, );
+   if (!ret)
+   HYPERVISOR_shared_info = _dummy_shared_info;
+
+   return ret;
+}
+
+static void xen_syscore_resume(void)
+{
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return;
+
+   /* No need to setup vcpu_info as it's already moved off */
+   xen_hvm_map_shared_info();
+
+   pvclock_resume();
+
+   gnttab_resume();
+}
+
+/*
+ * These callbacks will be called with interrupts disabled and when having only
+ * one CPU online.
+ */
+static struct syscore_ops xen_hvm_syscore_ops = {
+   .suspend = xen_syscore_suspend,
+   .resume = xen_syscore_resume
+};
+
+void __init xen_setup_syscore_ops(void)
+{
+   if (xen_hvm_domain())
+   register_syscore_ops(_hvm_syscore_ops);
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 6c36e161dfd1..3b3992b5b0c2 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -43,6 +43,9 @@ int xen_setup_shutdown_event(void);
 bool xen_suspend_mode_is_xen_suspend(void);
 bool xen_suspend_mode_is_pm_suspend(void);
 bool xen_suspend_mode_is_pm_hibernation(void);
+
+void xen_setup_syscore_ops(void);
+
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume

2020-02-14 Thread Anchal Agarwal
Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also  does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/enlighten_hvm.c | 7 +++
 arch/x86/xen/xen-ops.h   | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e138f7de52d2..75b1ec7a0fcd 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -27,6 +27,13 @@
 
 static unsigned long shared_info_pfn;
 
+void xen_hvm_map_shared_info(void)
+{
+   xen_hvm_init_shared_info();
+   if (shared_info_pfn)
+   HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
 void xen_hvm_init_shared_info(void)
 {
struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 45a441c33d6d..d84c357994bd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -56,6 +56,7 @@ void xen_enable_syscall(void);
 void xen_vcpu_restore(void);
 
 void xen_callback_vector(void);
+void xen_hvm_map_shared_info(void);
 void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 02/12] xenbus: add freeze/thaw/restore callbacks support

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 

Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.

Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.

[Anchal Changelog: Refactored the callbacks code]
Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/xenbus/xenbus_probe.c | 99 +--
 include/xen/xenbus.h  |  3 +
 2 files changed, 84 insertions(+), 18 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index 5b471889d723..0fa868c2 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -597,27 +598,44 @@ int xenbus_dev_suspend(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
return 0;
drv = to_xenbus_driver(dev->driver);
-   if (drv->suspend)
-   err = drv->suspend(xdev);
-   if (err)
-   pr_warn("suspend %s failed: %i\n", dev_name(dev), err);
+
+   if (xen_suspend) {
+   if (drv->suspend)
+   err = drv->suspend(xdev);
+   } else {
+   if (drv->freeze) {
+   err = drv->freeze(xdev);
+   if (!err) {
+   free_otherend_watch(xdev);
+   free_otherend_details(xdev);
+   return 0;
+   }
+   }
+   }
+
+   if (err) {
+   pr_warn("%s %s failed: %i\n", xen_suspend ?
+   "suspend" : "freeze", dev_name(dev), err);
+   return err;
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
 
 int xenbus_dev_resume(struct device *dev)
 {
-   int err;
+   int err = 0;
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
@@ -625,24 +643,32 @@ int xenbus_dev_resume(struct device *dev)
drv = to_xenbus_driver(dev->driver);
err = talk_to_otherend(xdev);
if (err) {
-   pr_warn("resume (talk_to_otherend) %s failed: %i\n",
+   pr_warn("%s (talk_to_otherend) %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
 
-   xdev->state = XenbusStateInitialising;
+   if (xen_suspend) {
+   xdev->state = XenbusStateInitialising;
+   if (drv->resume)
+   err = drv->resume(xdev);
+   } else {
+   if (drv->restore)
+   err = drv->restore(xdev);
+   }
 
-   if (drv->resume) {
-   err = drv->resume(xdev);
-   if (err) {
-   pr_warn("resume %s failed: %i\n", dev_name(dev), err);
-   return err;
-   }
+   if (err) {
+   pr_warn("%s %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
+   dev_name(dev), err);
+   return err;
}
 
err = watch_otherend(xdev);
if (err) {
-   pr_warn("resume (watch_otherend) %s failed: %d.\n",
+   pr_warn("%s (watch_otherend) %s failed: %d.\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
@@ -653,8 +679,45 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
 
 int xenbus_dev_cancel(struct device *dev)
 {
-   /* Do nothing */
-   DPRINTK("cancel");
+   int err = 0;
+   struct xenbus_driver *drv;
+   struct xenbus_device *xdev
+   = container_of(dev, struct xenbus_device, dev);
+   bool xen_suspend = 

[Xen-devel] [RFC PATCH v3 01/12] xen/manage: keep track of the on-going suspend mode

2020-02-14 Thread Anchal Agarwal
From: Munehisa Kamata 

Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.

Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.

[Anchal Changelog: Merged patch xen/manage: introduce helper function
to know the on-going suspend mode into this one for better readability]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/manage.c  | 73 +++
 include/xen/xen-ops.h |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..0b30ab522b77 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,6 +41,31 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+   NO_SUSPEND = 0,
+   XEN_SUSPEND,
+   PM_SUSPEND,
+   PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+   return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+   return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+   return suspend_mode == PM_HIBERNATION;
+}
+
 struct suspend_info {
int cancelled;
 };
@@ -99,6 +125,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
 
+   lock_system_sleep();
+
+   suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
 
err = freeze_processes();
@@ -162,6 +192,10 @@ static void do_suspend(void)
thaw_processes();
 out:
shutting_down = SHUTDOWN_INVALID;
+
+   suspend_mode = NO_SUSPEND;
+
+   unlock_system_sleep();
 }
 #endif /* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -387,3 +421,42 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+  unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_SUSPEND_PREPARE:
+   suspend_mode = PM_SUSPEND;
+   break;
+   case PM_HIBERNATION_PREPARE:
+   case PM_RESTORE_PREPARE:
+   suspend_mode = PM_HIBERNATION;
+   break;
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   suspend_mode = NO_SUSPEND;
+   break;
+   default:
+   pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+   return -EINVAL;
+   }
+
+   return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+   .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+   if (!xen_hvm_domain())
+   return -ENODEV;
+
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index d89969aa9942..6c36e161dfd1 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,9 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC RESEND PATCH v3 00/12] Enable PM hibernation on guest VMs

2020-02-14 Thread Anchal Agarwal
Resending this in a more threaded format.
  
Hello,
I am sending out a v3 version of series of patches that implements guest
PM hibernation.
These guests are running on xen hypervisor. The patches had been tested
against mainstream kernel. EC2 instance hibernation feature is provided
to the AWS EC2 customers. PM hibernation uses swap space carved out within
the guest[or can be a separate partition], where hibernation image is
stored and restored from.

Doing guest hibernation does not involve any support from hypervisor and
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.

This series includes some improvements over RFC series sent last year:
https://lists.xenproject.org/archives/html/xen-devel/2018-06/msg00823.html

Changelog v3:
1. Feedback from V2
2. Introduced 2 new patches for xen sched clock offset fix
3. Fixed pirq shutdown/restore in generic irq subsystem
4. Split save/restore steal clock patches into 2 for better readability

Changelog v2:
1. Removed timeout/request present on the ring in xen-blkfront during blkfront 
freeze
2. Fixed restoring of PIRQs which was apparently working for 4.9 kernels but 
not for
newer kernel. [Legacy irqs were no longer restored after hibernation introduced 
with
this commit "020db9d3c1dc0"]
3. Merged couple of related patches to make the code more coherent and readable
4. Code refactoring
5. Sched clock fix when hibernating guest is under heavy CPU load
Note: Under very rare circumstances we see resume failures with KASLR enabled 
only
on xen instances.  We are roughly seeing 3% failures [>1000 runs] when testing 
with
various instance sizes and some workload running on each instance. I am 
currently
investigating the issue as to confirm if its a xen issue or kernel issue.
However, it should not hold back anyone from reviewing/accepting these patches.

Testing done:
All testing is done for multiple hibernation cycle for 5.4 kernel on EC2.

Testing How to:
---
Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap

Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704

Execute:

sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk

Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl

#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])

#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"

Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (4):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume
  genirq: Shutdown irq chips in suspend/resume during hibernation
  xen: Introduce wrapper for save/restore sched clock offset
  xen: Update sched clock offset to avoid system instability in
hibernation

Munehisa Kamata (7):
  xen/manage: keep track of the on-going suspend mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-netfront: add callbacks for PM suspend and hibernation support
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen/time: introduce xen_{save,restore}_steal_clock
  x86/xen: save and restore steal clock

 arch/x86/xen/enlighten_hvm.c  |   8 ++
 arch/x86/xen/suspend.c|  72 ++
 arch/x86/xen/time.c   |  18 -
 arch/x86/xen/xen-ops.h|   3 +
 drivers/block/xen-blkfront.c  | 119 --
 drivers/net/xen-netfront.c|  98 +++-
 drivers/xen/events/events_base.c  |   1 +
 drivers/xen/manage.c  |  73 ++
 drivers/xen/time.c|  29 +++-
 drivers/xen/xenbus/xenbus_probe.c |  99 -
 include/linux/irq.h   |   2 +
 include/xen/xen-ops.h |   8 ++
 include/xen/xenbus.h  |   3 +
 kernel/irq/chip.c |   2 +-
 kernel/irq/internals.h|   1 +
 kernel/irq/pm.c   |  31 +---
 kernel/power/user.c   |   6 +-
 17 files changed, 533 insertions(+), 40 deletions(-)

-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 11/12] xen: Update sched clock offset to avoid system instability in hibernation

2020-02-12 Thread Anchal Agarwal
Save/restore xen_sched_clock_offset in syscore suspend/resume during PM
hibernation. Commit '867cefb4cb1012: ("xen: Fix x86 sched_clock() interface
for xen")' fixes xen guest time handling during migration. A similar issue
is seen during PM hibernation when system runs CPU intensive workload.
Post resume pvclock resets the value to 0 however, xen sched_clock_offset
is never updated. System instability is seen during resume from hibernation
when system is under heavy CPU load. Since xen_sched_clock_offset is not
updated, system does not see the monotonic clock value and the scheduler
would then think that heavy CPU hog tasks need more time in CPU, causing
the system to freeze

Signed-off-by: Anchal Agarwal 
---
Changes Since V2:
 * New patch to update sched clock offset during hibernation to avoid
   hungups during resume when running a CPU intensive workload
---
 arch/x86/xen/suspend.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index dae0f74f5390..7e5275944810 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -105,6 +105,8 @@ static int xen_syscore_suspend(void)
xen_save_steal_clock(cpu);
}
 
+   xen_save_sched_clock_offset();
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -126,6 +128,12 @@ static void xen_syscore_resume(void)
 
pvclock_resume();
 
+   /*
+* Restore xen_sched_clock_offset during resume to maintain
+* monotonic clock value
+*/
+   xen_restore_sched_clock_offset();
+
/* Nonboot CPUs will be resumed when they're brought up */
xen_restore_steal_clock(smp_processor_id());
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

2020-02-12 Thread Anchal Agarwal
From: Aleksei Besogonov 

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().

Before this patch, the command line processing was the only place where
swsusp_resume_block was set.

Signed-off-by: Aleksei Besogonov 
Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 

---
  Changes since V2: None
---
 kernel/power/user.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index 77438954cc2b..d396e313cb7b 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -374,8 +374,12 @@ static long snapshot_ioctl(struct file *filp, unsigned int 
cmd,
if (swdev) {
offset = swap_area.offset;
data->swap = swap_type_of(swdev, offset, NULL);
-   if (data->swap < 0)
+   if (data->swap < 0) {
error = -ENODEV;
+   } else {
+   swsusp_resume_device = swdev;
+   swsusp_resume_block = offset;
+   }
} else {
data->swap = -1;
error = -EINVAL;
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 08/12] xen/time: introduce xen_{save, restore}_steal_clock

2020-02-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.

top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 
0.0%si,-1253874204672.0%st

This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.

This patch introduces xen_save_steal_clock() which saves current values
in runstate info into per-cpu variables. Its couterpart,
xen_restore_steal_clock(), sets offset if it found the current values in
runstate info are smaller than previous ones. xen_steal_clock() is also
modified to use the offset to ensure that scheduler only sees
monotonically increasing number.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 

---
Changes since V2:
* separated the previously merged patches
* In V2, introduction of save/restore steal clock and usage in
  hibernation code was merged in a single patch
---
 drivers/xen/time.c| 29 -
 include/xen/xen-ops.h |  2 ++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 0968859c29d0..3560222cc0dd 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -23,6 +23,9 @@ static DEFINE_PER_CPU(struct vcpu_runstate_info, 
xen_runstate);
 
 static DEFINE_PER_CPU(u64[4], old_runstate_time);
 
+static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
+static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
+
 /* return an consistent snapshot of 64-bit time/counter value */
 static u64 get64(const u64 *p)
 {
@@ -149,7 +152,7 @@ bool xen_vcpu_stolen(int vcpu)
return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
-u64 xen_steal_clock(int cpu)
+static u64 __xen_steal_clock(int cpu)
 {
struct vcpu_runstate_info state;
 
@@ -157,6 +160,30 @@ u64 xen_steal_clock(int cpu)
return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
 }
 
+u64 xen_steal_clock(int cpu)
+{
+   return __xen_steal_clock(cpu) + per_cpu(xen_steal_clock_offset, cpu);
+}
+
+void xen_save_steal_clock(int cpu)
+{
+   per_cpu(xen_prev_steal_clock, cpu) = xen_steal_clock(cpu);
+}
+
+void xen_restore_steal_clock(int cpu)
+{
+   u64 steal_clock = __xen_steal_clock(cpu);
+
+   if (per_cpu(xen_prev_steal_clock, cpu) > steal_clock) {
+   /* Need to update the offset */
+   per_cpu(xen_steal_clock_offset, cpu) =
+   per_cpu(xen_prev_steal_clock, cpu) - steal_clock;
+   } else {
+   /* Avoid unnecessary steal clock warp */
+   per_cpu(xen_steal_clock_offset, cpu) = 0;
+   }
+}
+
 void xen_setup_runstate_info(int cpu)
 {
struct vcpu_register_runstate_memory_area area;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 3b3992b5b0c2..12b3f4474a05 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -37,6 +37,8 @@ void xen_time_setup_guest(void);
 void xen_manage_runstate_time(int action);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 u64 xen_steal_clock(int cpu);
+void xen_save_steal_clock(int cpu);
+void xen_restore_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 10/12] xen: Introduce wrapper for save/restore sched clock offset

2020-02-12 Thread Anchal Agarwal
Introduce wrappers for save/restore xen_sched_clock_offset to be
used by PM hibernation code to avoid system instability during resume.

Signed-off-by: Anchal Agarwal 

---
Changes since V2:
* Dropped marking tsc unstable during hibernation patch
* Fixed issue with xen_sched_clock_offset during suspend/resume
* On further interrogation and testing, the issue wasn't with tsc
being stable/unstable

---
 arch/x86/xen/time.c| 15 +--
 arch/x86/xen/xen-ops.h |  2 ++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index 8cf632dda605..eeb6d3d2eaab 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -379,12 +379,23 @@ static const struct pv_time_ops xen_time_ops __initconst 
= {
 static struct pvclock_vsyscall_time_info *xen_clock __read_mostly;
 static u64 xen_clock_value_saved;
 
+/*This is needed to maintain a monotonic clock value during PM hibernation */
+void xen_save_sched_clock_offset(void)
+{
+   xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+}
+
+void xen_restore_sched_clock_offset(void)
+{
+   xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+}
+
 void xen_save_time_memory_area(void)
 {
struct vcpu_register_time_memory_area t;
int ret;
 
-   xen_clock_value_saved = xen_clocksource_read() - xen_sched_clock_offset;
+   xen_save_sched_clock_offset();
 
if (!xen_clock)
return;
@@ -426,7 +437,7 @@ void xen_restore_time_memory_area(void)
 out:
/* Need pvclock_resume() before using xen_clocksource_read(). */
pvclock_resume();
-   xen_sched_clock_offset = xen_clocksource_read() - xen_clock_value_saved;
+   xen_restore_sched_clock_offset();
 }
 
 static void xen_setup_vsyscall_time_info(void)
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index d84c357994bd..9f49124df033 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -72,6 +72,8 @@ void xen_save_time_memory_area(void);
 void xen_restore_time_memory_area(void);
 void xen_init_time_ops(void);
 void xen_hvm_init_time_ops(void);
+void xen_save_sched_clock_offset(void);
+void xen_restore_sched_clock_offset(void);
 
 irqreturn_t xen_debug_interrupt(int irq, void *dev_id);
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 07/12] genirq: Shutdown irq chips in suspend/resume during hibernation

2020-02-12 Thread Anchal Agarwal
There are no pm handlers for the legacy devices, so during tear down
stale event channel <> IRQ mapping may still remain in the image and
resume may fail. To avoid adding much code by implementing handlers for
legacy devices, add a new irq_chip flag IRQCHIP_SHUTDOWN_ON_SUSPEND which
when enabled on an irq-chip e.g xen-pirq, it will let core suspend/resume
irq code to shutdown and restart the active irqs. PM suspend/hibernation
code will rely on this.
Without this, in PM hibernation, information about the event channel
remains in hibernation image, but there is no guarantee that the same
event channel numbers are assigned to the devices when restoring the
system. This may cause conflict like the following and prevent some
devices from being restored correctly.

Signed-off-by: Anchal Agarwal 
Suggested-by: Thomas Gleixner 

---
Changes since V2:
* Its new  patch to fix shutdown/restore pirqs during hibernation
* Removed previous 2 patches to shutdown/restore pirqs in xen code
---
 drivers/xen/events/events_base.c |  1 +
 include/linux/irq.h  |  2 ++
 kernel/irq/chip.c|  2 +-
 kernel/irq/internals.h   |  1 +
 kernel/irq/pm.c  | 31 ++-
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 6c8843968a52..e44f27b45bef 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1620,6 +1620,7 @@ static struct irq_chip xen_pirq_chip __read_mostly = {
.irq_set_affinity   = set_affinity_irq,
 
.irq_retrigger  = retrigger_dynirq,
+   .flags  = IRQCHIP_SHUTDOWN_ON_SUSPEND,
 };
 
 static struct irq_chip xen_percpu_chip __read_mostly = {
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fb301cf29148..2873a579fd9d 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -511,6 +511,7 @@ struct irq_chip {
  * IRQCHIP_EOI_THREADED:   Chip requires eoi() on unmask in threaded mode
  * IRQCHIP_SUPPORTS_LEVEL_MSI  Chip can provide two doorbells for Level MSIs
  * IRQCHIP_SUPPORTS_NMI:   Chip can deliver NMIs, only for root irqchips
+ * IRQCHIP_SHUTDOWN_ON_SUSPEND: Shutdown non wake irqs in the suspend path
  */
 enum {
IRQCHIP_SET_TYPE_MASKED = (1 <<  0),
@@ -522,6 +523,7 @@ enum {
IRQCHIP_EOI_THREADED= (1 <<  6),
IRQCHIP_SUPPORTS_LEVEL_MSI  = (1 <<  7),
IRQCHIP_SUPPORTS_NMI= (1 <<  8),
+   IRQCHIP_SHUTDOWN_ON_SUSPEND = (1 <<  9),
 };
 
 #include 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index b76703b2c0af..a1e8df5193ba 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -233,7 +233,7 @@ __irq_startup_managed(struct irq_desc *desc, struct cpumask 
*aff, bool force)
 }
 #endif
 
-static int __irq_startup(struct irq_desc *desc)
+int __irq_startup(struct irq_desc *desc)
 {
struct irq_data *d = irq_desc_get_irq_data(desc);
int ret = 0;
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 3924fbe829d4..11c7c55bda63 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -80,6 +80,7 @@ extern void __enable_irq(struct irq_desc *desc);
 extern int irq_activate(struct irq_desc *desc);
 extern int irq_activate_and_startup(struct irq_desc *desc, bool resend);
 extern int irq_startup(struct irq_desc *desc, bool resend, bool force);
+extern int __irq_startup(struct irq_desc *desc);
 
 extern void irq_shutdown(struct irq_desc *desc);
 extern void irq_shutdown_and_deactivate(struct irq_desc *desc);
diff --git a/kernel/irq/pm.c b/kernel/irq/pm.c
index 8f557fa1f4fe..dc48a25f1756 100644
--- a/kernel/irq/pm.c
+++ b/kernel/irq/pm.c
@@ -85,16 +85,25 @@ static bool suspend_device_irq(struct irq_desc *desc)
}
 
desc->istate |= IRQS_SUSPENDED;
-   __disable_irq(desc);
-
/*
-* Hardware which has no wakeup source configuration facility
-* requires that the non wakeup interrupts are masked at the
-* chip level. The chip implementation indicates that with
-* IRQCHIP_MASK_ON_SUSPEND.
+* Some irq chips (e.g. XEN PIRQ) require a full shutdown on suspend
+* as some of the legacy drivers(e.g. floppy) do nothing during the
+* suspend path
 */
-   if (irq_desc_get_chip(desc)->flags & IRQCHIP_MASK_ON_SUSPEND)
-   mask_irq(desc);
+   if (irq_desc_get_chip(desc)->flags & IRQCHIP_SHUTDOWN_ON_SUSPEND) {
+   irq_shutdown(desc);
+   } else {
+   __disable_irq(desc);
+
+  /*
+   * Hardware which has no wakeup source configuration facility
+   * requires that the non wakeup interrupts are masked at the
+   * chip level. The chip implementation indicates that with
+   * IRQCHIP_MASK_ON_SUSPEND.
+   

[Xen-devel] [RFC PATCH v3 09/12] x86/xen: save and restore steal clock

2020-02-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Save steal clock values of all present CPUs in the system core ops
suspend callbacks. Also, restore a boot CPU's steal clock in the system
core resume callback. For non-boot CPUs, restore after they're brought
up, because runstate info for non-boot CPUs are not active until then.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 

---
Changes since V2:
* Separate patch to add save/restore call to suspend/resume code
---
 arch/x86/xen/suspend.c | 13 -
 arch/x86/xen/time.c|  3 +++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 784c4484100b..dae0f74f5390 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -91,12 +91,20 @@ void xen_arch_suspend(void)
 static int xen_syscore_suspend(void)
 {
struct xen_remove_from_physmap xrfp;
-   int ret;
+   int cpu, ret;
 
/* Xen suspend does similar stuffs in its own logic */
if (xen_suspend_mode_is_xen_suspend())
return 0;
 
+   for_each_present_cpu(cpu) {
+   /*
+* Nonboot CPUs are already offline, but the last copy of
+* runstate info is still accessible.
+*/
+   xen_save_steal_clock(cpu);
+   }
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
 
pvclock_resume();
 
+   /* Nonboot CPUs will be resumed when they're brought up */
+   xen_restore_steal_clock(smp_processor_id());
+
gnttab_resume();
 }
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index befbdd8b17f0..8cf632dda605 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -537,6 +537,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
 {
int cpu = smp_processor_id();
xen_setup_runstate_info(cpu);
+   if (cpu)
+   xen_restore_steal_clock(cpu);
+
/*
 * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
 * doing it xen_hvm_cpu_notify (which gets called by smp_init during
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2020-02-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks.
The freeze handler stops a block-layer queue and disconnect the
frontend from the backend while freeing ring_info and associated resources.
The restore handler re-allocates ring_info and re-connect to the
backend, so the rest of the kernel can continue to use the block device
transparently. Also, the handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks for
Xen suspend without modification. Before disconnecting from backend,
we need to prevent any new IO from being queued and wait for existing
IO to complete. Freeze/unfreeze of the queues will guarantee that there
are no requests in use on the shared ring.

Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[   36.852659] deferring g.e. 0xf9 (pfn 0x)
[   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!

In this case, persistent grants would need to be disabled.

[Anchal Changelog: Removed timeout/request during blkfront freeze.
Fixed major part of the code to work with blk-mq]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 

---
Changes since V2: None
---
 drivers/block/xen-blkfront.c | 119 ---
 1 file changed, 112 insertions(+), 7 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 478120233750..d715ed3cb69a 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -47,6 +47,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -79,6 +81,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+   BLKIF_STATE_FREEZING,
+   BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -220,6 +224,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+   struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -261,6 +266,7 @@ static DEFINE_SPINLOCK(minor_lock);
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static void blkfront_gather_backend_features(struct blkfront_info *info);
 static int negotiate_mq(struct blkfront_info *info);
+static void __blkif_free(struct blkfront_info *info);
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -995,6 +1001,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+   init_completion(>wait_backend_disconnected);
 
return 0;
 }
@@ -1218,6 +1225,8 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
 /* Already hold rinfo->ring_lock. */
 static inline void kick_pending_request_queues_locked(struct 
blkfront_ring_info *rinfo)
 {
+   if (unlikely(rinfo->dev_info->connected == BLKIF_STATE_FREEZING))
+   return;
if (!RING_FULL(>ring))
blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
 }
@@ -1341,8 +1350,6 @@ static void blkif_free_ring(struct blkfront_ring_info 
*rinfo)
 
 static void blkif_free(struct blkfront_info *info, int suspend)
 {
-   unsigned int i;
-
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1350,6 +1357,13 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);
 
+   __blkif_free(info);
+}
+
+static void __blkif_free(struct blkfront_info *info)
+{
+   unsigned int i;
+
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(>rinfo[i]);
 
@@ -1553,8 +1567,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
 
-   if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
-   return IRQ_HANDLED;
+   if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
+   if (info->connected != BLKIF_STATE_FREEZING)
+   return IRQ_HANDLED;
+   }
 
spin_lock_irqsave(>ring_lock, flags);
  again:
@@ -2020,6 +2036,7 @@ static int blkif_recover(struct blkfront_info *info)
struct bio *bio;
unsigned int segs;
 
+   bool frozen = info->connected 

[Xen-devel] [RFC PATCH v3 03/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume

2020-02-12 Thread Anchal Agarwal
Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also  does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.

Signed-off-by: Anchal Agarwal 

---
Changes since V2: None
---
 arch/x86/xen/enlighten_hvm.c | 7 +++
 arch/x86/xen/xen-ops.h   | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e138f7de52d2..75b1ec7a0fcd 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -27,6 +27,13 @@
 
 static unsigned long shared_info_pfn;
 
+void xen_hvm_map_shared_info(void)
+{
+   xen_hvm_init_shared_info();
+   if (shared_info_pfn)
+   HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
 void xen_hvm_init_shared_info(void)
 {
struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 45a441c33d6d..d84c357994bd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -56,6 +56,7 @@ void xen_enable_syscall(void);
 void xen_vcpu_restore(void);
 
 void xen_callback_vector(void);
+void xen_hvm_map_shared_info(void);
 void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 05/12] xen-netfront: add callbacks for PM suspend and hibernation support

2020-02-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.

[Anchal changelog: Variable name fix and checkpatch.pl fixes]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 

---
Changes since V2: None
---
 drivers/net/xen-netfront.c | 98 +-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 482c6c8b0fb7..65edcdd6e05f 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -56,6 +57,12 @@
 #include 
 #include 
 
+enum netif_freeze_state {
+   NETIF_FREEZE_STATE_UNFROZEN,
+   NETIF_FREEZE_STATE_FREEZING,
+   NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 
0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+  netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+"timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
 
atomic_t rx_gso_checksum_fixup;
+
+   int freeze_state;
+
+   struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+   struct netfront_info *np = netdev_priv(dev);
+   unsigned int num_queues = dev->real_num_tx_queues;
+   unsigned int queue_index;
+   struct netfront_queue *queue;
+
+   for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+   queue = >queues[queue_index];
+   disable_irq(queue->tx_irq);
+   disable_irq(queue->rx_irq);
+   }
+   return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff 
*skb,
grant_ref_t ref)
 {
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
np->queues = NULL;
 
+   init_completion(>wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info 
*info,
return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
+   unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+   int err = 0;
+
+   xennet_disable_interrupts(info->netdev);
+
+   netif_device_detach(info->netdev);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+   /* Kick the backend to disconnect */
+   xenbus_switch_state(dev, XenbusStateClosing);
+
+   /* We don't want to move forward before the frontend is diconnected
+* from the backend cleanly.
+*/
+   timeout = wait_for_completion_timeout(>wait_backend_disconnected,
+ timeout);
+   if (!timeout) {
+   err = -EBUSY;
+   xenbus_dev_error(dev, err, "Freezing timed out;"
+"the device may become inconsistent state");
+   return err;
+   }
+
+   /* Tear down queues */
+   xennet_disconnect_backend(info);
+   xennet_destroy_queues(info);
+
+   info

[Xen-devel] [RFC PATCH v3 02/12] xenbus: add freeze/thaw/restore callbacks support

2020-02-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.

Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.

[Anchal Changelog: Refactored the callbacks code]
Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 

---
Changes since V2: None
---
 drivers/xen/xenbus/xenbus_probe.c | 99 +--
 include/xen/xenbus.h  |  3 +
 2 files changed, 84 insertions(+), 18 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index 5b471889d723..0fa868c2 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -597,27 +598,44 @@ int xenbus_dev_suspend(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
return 0;
drv = to_xenbus_driver(dev->driver);
-   if (drv->suspend)
-   err = drv->suspend(xdev);
-   if (err)
-   pr_warn("suspend %s failed: %i\n", dev_name(dev), err);
+
+   if (xen_suspend) {
+   if (drv->suspend)
+   err = drv->suspend(xdev);
+   } else {
+   if (drv->freeze) {
+   err = drv->freeze(xdev);
+   if (!err) {
+   free_otherend_watch(xdev);
+   free_otherend_details(xdev);
+   return 0;
+   }
+   }
+   }
+
+   if (err) {
+   pr_warn("%s %s failed: %i\n", xen_suspend ?
+   "suspend" : "freeze", dev_name(dev), err);
+   return err;
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
 
 int xenbus_dev_resume(struct device *dev)
 {
-   int err;
+   int err = 0;
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
@@ -625,24 +643,32 @@ int xenbus_dev_resume(struct device *dev)
drv = to_xenbus_driver(dev->driver);
err = talk_to_otherend(xdev);
if (err) {
-   pr_warn("resume (talk_to_otherend) %s failed: %i\n",
+   pr_warn("%s (talk_to_otherend) %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
 
-   xdev->state = XenbusStateInitialising;
+   if (xen_suspend) {
+   xdev->state = XenbusStateInitialising;
+   if (drv->resume)
+   err = drv->resume(xdev);
+   } else {
+   if (drv->restore)
+   err = drv->restore(xdev);
+   }
 
-   if (drv->resume) {
-   err = drv->resume(xdev);
-   if (err) {
-   pr_warn("resume %s failed: %i\n", dev_name(dev), err);
-   return err;
-   }
+   if (err) {
+   pr_warn("%s %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
+   dev_name(dev), err);
+   return err;
}
 
err = watch_otherend(xdev);
if (err) {
-   pr_warn("resume (watch_otherend) %s failed: %d.\n",
+   pr_warn("%s (watch_otherend) %s failed: %d.\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
@@ -653,8 +679,45 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
 
 int xenbus_dev_cancel(struct device *dev)
 {
-   /* Do nothing */
-   DPRINTK("cancel");
+   int err = 0;
+   struct xenbus_driver *drv;
+   struct xenbus_device *xdev
+   = container_of(dev, struct xenbus_device, dev);
+   bool xen_suspend = 

[Xen-devel] [RFC PATCH v3 01/12] xen/manage: keep track of the on-going suspend mode

2020-02-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.

Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.

[Anchal Changelog: Merged patch xen/manage: introduce helper function
to know the on-going suspend mode into this one for better readability]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 

---

Changes since V2: None
---
 drivers/xen/manage.c  | 73 +++
 include/xen/xen-ops.h |  3 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..0b30ab522b77 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,6 +41,31 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+   NO_SUSPEND = 0,
+   XEN_SUSPEND,
+   PM_SUSPEND,
+   PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+   return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+   return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+   return suspend_mode == PM_HIBERNATION;
+}
+
 struct suspend_info {
int cancelled;
 };
@@ -99,6 +125,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
 
+   lock_system_sleep();
+
+   suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
 
err = freeze_processes();
@@ -162,6 +192,10 @@ static void do_suspend(void)
thaw_processes();
 out:
shutting_down = SHUTDOWN_INVALID;
+
+   suspend_mode = NO_SUSPEND;
+
+   unlock_system_sleep();
 }
 #endif /* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -387,3 +421,42 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+  unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_SUSPEND_PREPARE:
+   suspend_mode = PM_SUSPEND;
+   break;
+   case PM_HIBERNATION_PREPARE:
+   case PM_RESTORE_PREPARE:
+   suspend_mode = PM_HIBERNATION;
+   break;
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   suspend_mode = NO_SUSPEND;
+   break;
+   default:
+   pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+   return -EINVAL;
+   }
+
+   return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+   .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+   if (!xen_hvm_domain())
+   return -ENODEV;
+
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index d89969aa9942..6c36e161dfd1 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,9 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH v3 00/12] Enable PM hibernation on guest VMs

2020-02-12 Thread Anchal Agarwal
Hello,
I am sending out a v3 version of series of patches that implements guest
PM hibernation.
These guests are running on xen hypervisor. The patches had been tested
against mainstream kernel. EC2 instance hibernation feature is provided
to the AWS EC2 customers. PM hibernation uses swap space carved out within
the guest[or can be a separate partition], where hibernation image is
stored and restored from.

Doing guest hibernation does not involve any support from hypervisor and 
this way guest has complete control over its state. Infrastructure
restrictions for saving up guest state can be overcome by guest initiated
hibernation.

This series includes some improvements over RFC series sent last year:
https://lists.xenproject.org/archives/html/xen-devel/2018-06/msg00823.html

Changelog v3:
1. Feedback from V2
2. Introduced 2 new patches for xen sched clock offset fix
3. Fixed pirq shutdown/restore in generic irq subsystem
4. Split save/restore steal clock patches into 2 for better readability

Changelog v2:
1. Removed timeout/request present on the ring in xen-blkfront during blkfront 
freeze
2. Fixed restoring of PIRQs which was apparently working for 4.9 kernels but 
not for
newer kernel. [Legacy irqs were no longer restored after hibernation introduced 
with
this commit "020db9d3c1dc0"]
3. Merged couple of related patches to make the code more coherent and readable
4. Code refactoring
5. Sched clock fix when hibernating guest is under heavy CPU load
Note: Under very rare circumstances we see resume failures with KASLR enabled 
only
on xen instances.  We are roughly seeing 3% failures [>1000 runs] when testing 
with
various instance sizes and some workload running on each instance. I am 
currently
investigating the issue as to confirm if its a xen issue or kernel issue.
However, it should not hold back anyone from reviewing/accepting these patches.

Testing done:
All testing is done for multiple hibernation cycle for 5.4 kernel on EC2.

Testing How to:
---
Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap

Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704

 Execute:

sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk

Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl

#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])

#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"

Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (4):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume
  genirq: Shutdown irq chips in suspend/resume during hibernation
  xen: Introduce wrapper for save/restore sched clock offset
  xen: Update sched clock offset to avoid system instability in
hibernation

Munehisa Kamata (7):
  xen/manage: keep track of the on-going suspend mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-netfront: add callbacks for PM suspend and hibernation support
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen/time: introduce xen_{save,restore}_steal_clock
  x86/xen: save and restore steal clock

 arch/x86/xen/enlighten_hvm.c  |   8 ++
 arch/x86/xen/suspend.c|  72 ++
 arch/x86/xen/time.c   |  18 -
 arch/x86/xen/xen-ops.h|   3 +
 drivers/block/xen-blkfront.c  | 119 --
 drivers/net/xen-netfront.c|  98 +++-
 drivers/xen/events/events_base.c  |   1 +
 drivers/xen/manage.c  |  73 ++
 drivers/xen/time.c|  29 +++-
 drivers/xen/xenbus/xenbus_probe.c |  99 -
 include/linux/irq.h   |   2 +
 include/xen/xen-ops.h |   8 ++
 include/xen/xenbus.h  |   3 +
 kernel/irq/chip.c |   2 +-
 kernel/irq/internals.h|   1 +
 kernel/irq/pm.c   |  31 +---
 kernel/power/user.c   |   6 +-
 17 files changed, 533 insertions(+), 40 deletions(-)

-- 
2.24.1.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH V2 11/11] x86: tsc: avoid system instability in hibernation

2020-01-22 Thread Anchal Agarwal
On Tue, Jan 14, 2020 at 07:29:52PM +, Anchal Agarwal wrote:
> On Tue, Jan 14, 2020 at 12:30:02AM +0100, Rafael J. Wysocki wrote:
> > On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki  
> > wrote:
> > >
> > > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra  
> > > wrote:
> > > >
> > > > On Mon, Jan 13, 2020 at 11:43:18AM +, Singh, Balbir wrote:
> > > > > For your original comment, just wanted to clarify the following:
> > > > >
> > > > > 1. After hibernation, the machine can be resumed on a different but 
> > > > > compatible
> > > > > host (these are VM images hibernated)
> > > > > 2. This means the clock between host1 and host2 can/will be different
> > > > >
> > > > > In your comments are you making the assumption that the host(s) 
> > > > > is/are the
> > > > > same? Just checking the assumptions being made and being on the same 
> > > > > page with
> > > > > them.
> > > >
> > > > I would expect this to be the same problem we have as regular suspend,
> > > > after power off the TSC will have been reset, so resume will have to
> > > > somehow bridge that gap. I've no idea if/how it does that.
> > >
> > > In general, this is done by timekeeping_resume() and the only special
> > > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true)
> > > call in tsc_resume().
> > 
> > And I forgot about tsc_restore_sched_clock_state() that gets called
> > via restore_processor_state() on x86, before calling
> > timekeeping_resume().
> >
> In this case tsc_verify_tsc_adjust(true) this does nothing as
> feature bit X86_FEATURE_TSC_ADJUST is not available to guest. 
> I am no expert in this area, but could this be messing things up?
> 
> Thanks,
> Anchal
Gentle nudge on this. I will add more data here in case that helps.

1. Before this patch, tsc is stable but hibernation does not work
100% of the time. I agree if tsc is stable it should not be marked
unstable however, in this case if I run a cpu intensive workload
in the background and trigger reboot-hibernation loop I see a 
workqueue lockup. 

2. The lockup does not hose the system completely,
the reboot-hibernation carries out and system recovers. 
However, as mentioned in the commit message system does 
become unreachable for couple of seconds.

3. Xen suspend/resume seems to save/restore time_memory area in its
xen_arch_pre_suspend and xen_arch_post_suspend. The xen clock value
is saved. xen_sched_clock_offset is set at resume time to ensure a
monotonic clock value

4. Also, the instances do not have InvariantTSC exposed. Feature bit
X86_FEATURE_TSC_ADJUST is not available to guest and xen clocksource
is used by guests.

I am not sure if something needs to be fixed on hibernate path itself
or its very much ties to time handling on xen guest hibernation

Here is a part of log from last hibernation exit to next hibernation
entry. The loop was running for a while so boot to lockup log will be
huge. I am specifically including the timestamps.

...
01h 57m 15.627s(  16ms): [5.822701] OOM killer enabled.
01h 57m 15.627s(   0ms): [5.824981] Restarting tasks ... done.
01h 57m 15.627s(   0ms): [5.836397] PM: hibernation exit
01h 57m 17.636s(2009ms): [7.844471] PM: hibernation entry
01h 57m 52.725s(35089ms): [   42.934542] BUG: workqueue lockup - pool cpus=0
node=0 flags=0x0 nice=0 stuck for 37s!
01h 57m 52.730s(   5ms): [   42.941468] Showing busy workqueues and worker
pools:
01h 57m 52.734s(   4ms): [   42.945088] workqueue events: flags=0x0
01h 57m 52.737s(   3ms): [   42.948385]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=2/256
01h 57m 52.742s(   5ms): [   42.952838] pending: vmstat_shepherd,
check_corruption
01h 57m 52.746s(   4ms): [   42.956927] workqueue events_power_efficient:
flags=0x80
01h 57m 52.749s(   3ms): [   42.960731]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=4/256
01h 57m 52.754s(   5ms): [   42.964835] pending: neigh_periodic_work,
do_cache_clean [sunrpc], neigh_periodic_work, check_lifetime
01h 57m 52.781s(  27ms): [   42.971419] workqueue mm_percpu_wq: flags=0x8
01h 57m 52.781s(   0ms): [   42.974628]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=1/256
01h 57m 52.781s(   0ms): [   42.978901] pending: vmstat_update
01h 57m 52.781s(   0ms): [   42.981822] workqueue ipv6_addrconf: flags=0x40008
01h 57m 52.781s(   0ms): [   42.985524]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active=1/1
01h 57m 52.781s(   0ms): [   42.989670] pending: addrconf_verify_work [ipv6]
01h 57m 52.782s(   1ms): [   42.993282] workqueue xfs-conv/xvda1: flags=0xc
01h 57m 52.786s(   4ms): [   42.996708]   pwq 0: cpus=0 node=0 flags=0x0 nice=0
active

Re: [Xen-devel] [RFC PATCH V2 11/11] x86: tsc: avoid system instability in hibernation

2020-01-14 Thread Anchal Agarwal
On Tue, Jan 14, 2020 at 12:30:02AM +0100, Rafael J. Wysocki wrote:
> On Mon, Jan 13, 2020 at 10:50 PM Rafael J. Wysocki  wrote:
> >
> > On Mon, Jan 13, 2020 at 1:43 PM Peter Zijlstra  wrote:
> > >
> > > On Mon, Jan 13, 2020 at 11:43:18AM +, Singh, Balbir wrote:
> > > > For your original comment, just wanted to clarify the following:
> > > >
> > > > 1. After hibernation, the machine can be resumed on a different but 
> > > > compatible
> > > > host (these are VM images hibernated)
> > > > 2. This means the clock between host1 and host2 can/will be different
> > > >
> > > > In your comments are you making the assumption that the host(s) is/are 
> > > > the
> > > > same? Just checking the assumptions being made and being on the same 
> > > > page with
> > > > them.
> > >
> > > I would expect this to be the same problem we have as regular suspend,
> > > after power off the TSC will have been reset, so resume will have to
> > > somehow bridge that gap. I've no idea if/how it does that.
> >
> > In general, this is done by timekeeping_resume() and the only special
> > thing done for the TSC appears to be the tsc_verify_tsc_adjust(true)
> > call in tsc_resume().
> 
> And I forgot about tsc_restore_sched_clock_state() that gets called
> via restore_processor_state() on x86, before calling
> timekeeping_resume().
>
In this case tsc_verify_tsc_adjust(true) this does nothing as
feature bit X86_FEATURE_TSC_ADJUST is not available to guest. 
I am no expert in this area, but could this be messing things up?

Thanks,
Anchal

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH V2 09/11] xen: Clear IRQD_IRQ_STARTED flag during shutdown PIRQs

2020-01-10 Thread Anchal Agarwal
On Fri, Jan 10, 2020 at 08:13:16PM +0100, Thomas Gleixner wrote:
> Anchal,
> 
> Anchal Agarwal  writes:
> > On Thu, Jan 09, 2020 at 01:07:27PM +0100, Thomas Gleixner wrote:
> >> Anchal Agarwal  writes:
> >> So either you can handle it purely on the XEN side without touching any
> >> core state or you need to come up with some way of letting the core code
> >> know that it should invoke shutdown instead of disable.
> >> 
> >> Something like the completely untested patch below.
> >
> > Understandable. Really appreciate the patch suggestion below and i will 
> > test it
> > for sure and see if things can be fixed properly in irq core if thats the 
> > only
> > option. In the meanwhile, I tried to fix it on xen side unless it gives you 
> > the 
> > same feeling as above? MSI-x are just fine, just ioapic ones don't get any 
> > event
> > channel asssigned hence enable_dynirq does nothing. Those needs to be 
> > restarted.
> >
> > diff --git a/drivers/xen/events/events_base.c 
> > b/drivers/xen/events/events_base.c
> > index 1bb0b522d004..2ed152f35816 100644
> > --- a/drivers/xen/events/events_base.c
> > +++ b/drivers/xen/events/events_base.c
> > @@ -575,6 +575,11 @@ static void shutdown_pirq(struct irq_data *data)
> >
> > static void enable_pirq(struct irq_data *data)
> > {
> > +/*ioapic interrupts don't get event channel assigned
> >+ * after being explicitly shutdown during guest
> >+ * hibernation. They need to be restarted*/
> > +   if(!evtchn_from_irq(data->irq))
> > +   startup_pirq(data);
> > enable_dynirq(data);
> >  }
> 
> Interesting patch format :)
Apparently vim and me rushing through the email [did not format the patch]
were the culprit and I only caught it after sending an email
> 
> Doing the shutdown from syscore_ops and the startup conditionally in a
> totaly unrelated function is not really intuitive.
> 
I agree to the point that still the startup is not as synchronous 
to shutdown however, enable_pirq is still invoked during irq_startup
for xen specific code and I was trying to reuse the code path to fix 
within xen. Basically borrowing from what this commit [commit 020db9d3]
changed. Not sure if this could have broken under any other environment
though :(

But anyways I think the patch you suggested is much more clean and 
intuitive.

> So either you do it symmetrically in XEN via syscore_ops callbacks or
> you let the irq core code help you out with the patch I provided
> 
In my understanding, it may not be the right thing as syscore stuff runs
with one cpu online and disabled interrupts. Also I did try it in the past 
and failed horribly unless there is any smarter way of doing it.
It should correctly be done in suspend/resume devices as are other device 
interrupts.

I did test the patch you suggested and it works.
I haven't done large scale testing but it looks like it may just work fine.
I will send out an updated patch for shutdown/startup of pirq after I do some
more testing and will drop patches related to shutdown/startup of pirqs from 
the original series.

Thanks,

Anchal

> Thanks,
> 
> tglx

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH V2 01/11] xen/manage: keep track of the on-going suspend mode

2020-01-09 Thread Anchal Agarwal
On Thu, Jan 09, 2020 at 06:49:07PM -0500, Boris Ostrovsky wrote:
> 
> 
> On 1/9/20 6:46 PM, Boris Ostrovsky wrote:
> >
> >
> >On 1/7/20 6:37 PM, Anchal Agarwal wrote:
> >>+
> >>+static int xen_setup_pm_notifier(void)
> >>+{
> >>+    if (!xen_hvm_domain())
> >>+    return -ENODEV;
> >
> >ARM guests are also HVM domains. Is it OK for them to register the
> >notifier? The diffstat suggests that you are supporting ARM.
> 
> I obviously meant *not* supporting ARM, sorry.
> 
> -boris
> 
> >
> >-boris
> >

TBH, I have not yet experimented with these patches on
ARM guest yet but that will be the next step. The same 
code with changes as needed should be made to work for ARM.
Currently I am focussed on getting a sane set of 
patches into mainline for x86 guests.

Thanks,

Anchal

> >>+
> >>+    return register_pm_notifier(_pm_notifier_block);
> >>+}
> >>
> >
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH V2 09/11] xen: Clear IRQD_IRQ_STARTED flag during shutdown PIRQs

2020-01-09 Thread Anchal Agarwal
On Thu, Jan 09, 2020 at 01:07:27PM +0100, Thomas Gleixner wrote:
> Anchal Agarwal  writes:
> > On Wed, Jan 08, 2020 at 04:23:25PM +0100, Thomas Gleixner wrote:
> >> Anchal Agarwal  writes:
> >> > +void irq_state_clr_started(struct irq_desc *desc)
> >> >  {
> >> >  irqd_clear(>irq_data, IRQD_IRQ_STARTED);
> >> >  }
> >> > +EXPORT_SYMBOL_GPL(irq_state_clr_started);
> >> 
> >> This is core internal state and not supposed to be fiddled with by
> >> drivers.
> >> 
> >> irq_chip has irq_suspend/resume/pm_shutdown callbacks for a reason.
> >>
> > I agree, as its mentioned in the previous patch {[RFC PATCH V2 08/11]} this 
> > is 
> > one way of explicitly shutting down legacy devices without introducing too 
> > much 
> > code for each of the legacy devices. . for eg. in case of floppy there 
> > is no suspend/freeze handler which should have done the needful.
> > .
> > Either we implement them for all the legacy devices that have them missing 
> > or
> > explicitly shutdown pirqs. I have choosen later for simplicity. I understand
> > that ideally we should enable/disable devices interrupts in suspend/resume 
> > devices but that requires adding code for doing that to few drivers[and I 
> > may
> > not know all of them either]
> >
> > Now I discovered during the flow in hibernation_platform_enter under resume 
> > devices that for such devices irq_startup is called which checks for 
> > IRQD_IRQ_STARTED flag and based on that it calls irq_enable or irq_startup.
> > They are only restarted if the flag is not set which is cleared during 
> > shutdown. 
> > shutdown_pirq does not do that. Only masking/unmasking of evtchn does not 
> > work 
> > as pirq needs to be restarted.
> > xen-pirq.enable_irq is called rather than stratup_pirq. On resume if these 
> > pirqs
> > are not restarted in this case ACPI SCI interrupts, I do not see receiving 
> > any interrupts under cat /proc/interrupts even though host keeps generating 
> > S4 ACPI events. 
> > Does that makes sense?
> 
> No. You still violate all abstraction boundaries. On one hand you use a
> XEN specific suspend function to shut down interrupts, but then you want
> the core code to reestablish them on resume. That's just bad hackery which
> abuses partial knowledge of core internals. The state flag is only one
> part of the core internal state and just clearing it does not make the
> rest consistent. It just works by chance and not by design and any
> change of the core code will break it in colourful ways.
> 
> So either you can handle it purely on the XEN side without touching any
> core state or you need to come up with some way of letting the core code
> know that it should invoke shutdown instead of disable.
> 
> Something like the completely untested patch below.
> 
> Thanks,
> 
>tglx
Understandable. Really appreciate the patch suggestion below and i will test it
for sure and see if things can be fixed properly in irq core if thats the only
option. In the meanwhile, I tried to fix it on xen side unless it gives you the 
same feeling as above? MSI-x are just fine, just ioapic ones don't get any event
channel asssigned hence enable_dynirq does nothing. Those needs to be restarted.

Thanks,
Anchal

<---

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 1bb0b522d004..2ed152f35816 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -575,6 +575,11 @@ static void shutdown_pirq(struct irq_data *data)

static void enable_pirq(struct irq_data *data)
{
+/*ioapic interrupts don't get event channel assigned
   + * after being explicitly shutdown during guest
   + * hibernation. They need to be restarted*/
+   if(!evtchn_from_irq(data->irq))
+   startup_pirq(data);
enable_dynirq(data);
 }

> 
> 8<
> 
> diff --git a/include/linux/irq.h b/include/linux/irq.h
> index 7853eb9301f2..50f2057bc339 100644
> --- a/include/linux/irq.h
> +++ b/include/linux/irq.h
> @@ -511,6 +511,7 @@ struct irq_chip {
>   * IRQCHIP_EOI_THREADED: Chip requires eoi() on unmask in threaded mode
>   * IRQCHIP_SUPPORTS_LEVEL_MSIChip can provide two doorbells for 
> Level MSIs
>   * IRQCHIP_SUPPORTS_NMI: Chip can deliver NMIs, only for root irqchips
> + * IRQCHIP_SHUTDOWN_ON_SUSPEND:  Shutdown non wake irqs in the suspend 
> path
>   */
>  enum {
>   IRQCHIP_SET_TYPE_MASKED = (1 <<  0),
> @@ -522,6 +523,7 @@ enum {
>   IRQCHIP_EOI_THREADED= (1 <

Re: [Xen-devel] [RFC PATCH V2 09/11] xen: Clear IRQD_IRQ_STARTED flag during shutdown PIRQs

2020-01-08 Thread Anchal Agarwal
On Wed, Jan 08, 2020 at 04:23:25PM +0100, Thomas Gleixner wrote:
> Anchal Agarwal  writes:
> 
> > shutdown_pirq is invoked during hibernation path and hence
> > PIRQs should be restarted during resume.
> > Before this commit'020db9d3c1dc0a' xen/events: Fix interrupt lost
> > during irq_disable and irq_enable startup_pirq was automatically
> > called during irq_enable however, after this commit pirq's did not
> > get explicitly started once resumed from hibernation.
> >
> > chip->irq_startup is called only if IRQD_IRQ_STARTED is unset during
> > irq_startup on resume. This flag gets cleared by free_irq->irq_shutdown
> > during suspend. free_irq() never gets explicitly called for ioapic-edge
> > and ioapic-level interrupts as respective drivers do nothing during
> > suspend/resume. So we shut them down explicitly in the first place in
> > syscore_suspend path to clear IRQ<>event channel mapping. shutdown_pirq
> > being called explicitly during suspend does not clear this flags, hence
> > .irq_enable is called in irq_startup during resume instead and pirq's
> > never start up.
> 
> What? 
> 
> > +void irq_state_clr_started(struct irq_desc *desc)
> >  {
> > irqd_clear(>irq_data, IRQD_IRQ_STARTED);
> >  }
> > +EXPORT_SYMBOL_GPL(irq_state_clr_started);
> 
> This is core internal state and not supposed to be fiddled with by
> drivers.
> 
> irq_chip has irq_suspend/resume/pm_shutdown callbacks for a reason.
>
I agree, as its mentioned in the previous patch {[RFC PATCH V2 08/11]} this is 
one way of explicitly shutting down legacy devices without introducing too much 
code for each of the legacy devices. . for eg. in case of floppy there 
is no suspend/freeze handler which should have done the needful.
.
Either we implement them for all the legacy devices that have them missing or
explicitly shutdown pirqs. I have choosen later for simplicity. I understand
that ideally we should enable/disable devices interrupts in suspend/resume 
devices but that requires adding code for doing that to few drivers[and I may
not know all of them either]

Now I discovered during the flow in hibernation_platform_enter under resume 
devices that for such devices irq_startup is called which checks for 
IRQD_IRQ_STARTED flag and based on that it calls irq_enable or irq_startup.
They are only restarted if the flag is not set which is cleared during 
shutdown. 
shutdown_pirq does not do that. Only masking/unmasking of evtchn does not work 
as pirq needs to be restarted.
xen-pirq.enable_irq is called rather than stratup_pirq. On resume if these pirqs
are not restarted in this case ACPI SCI interrupts, I do not see receiving 
any interrupts under cat /proc/interrupts even though host keeps generating 
S4 ACPI events. 
Does that makes sense?

Thanks,
Anchal
> Thanks,
> 
>tglx

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 11/11] x86: tsc: avoid system instability in hibernation

2020-01-07 Thread Anchal Agarwal
From: Eduardo Valentin 

System instability are seen during resume from hibernation when system
is under heavy CPU load. This is due to the lack of update of sched
clock data, and the scheduler would then think that heavy CPU hog
tasks need more time in CPU, causing the system to freeze
during the unfreezing of tasks. For example, threaded irqs,
and kernel processes servicing network interface may be delayed
for several tens of seconds, causing the system to be unreachable.

Situation like this can be reported by using lockup detectors
such as workqueue lockup detectors:

[root@ip-172-31-67-114 ec2-user]# echo disk > /sys/power/state

Message from syslogd@ip-172-31-67-114 at May  7 18:23:21 ...
 kernel:BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 
57s!

Message from syslogd@ip-172-31-67-114 at May  7 18:23:21 ...
 kernel:BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 
57s!

Message from syslogd@ip-172-31-67-114 at May  7 18:23:21 ...
 kernel:BUG: workqueue lockup - pool cpus=3 node=0 flags=0x1 nice=0 stuck for 
57s!

Message from syslogd@ip-172-31-67-114 at May  7 18:29:06 ...
 kernel:BUG: workqueue lockup - pool cpus=3 node=0 flags=0x1 nice=0 stuck for 
403s!

The fix for this situation is to mark the sched clock as unstable
as early as possible in the resume path, leaving it unstable
for the duration of the resume process. This will force the
scheduler to attempt to align the sched clock across CPUs using
the delta with time of day, updating sched clock data. In a post
hibernation event, we can then mark the sched clock as stable
again, avoiding unnecessary syncs with time of day on systems
in which TSC is reliable.

Reviewed-by: Erik Quanstrom 
Reviewed-by: Frank van der Linden 
Reviewed-by: Balbir Singh 
Reviewed-by: Munehisa Kamata 
Tested-by: Anchal Agarwal 
Signed-off-by: Eduardo Valentin 
---
 arch/x86/kernel/tsc.c   | 29 +
 include/linux/sched/clock.h |  5 +
 kernel/sched/clock.c|  4 ++--
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 7e322e2daaf5..ae77b8bc4e46 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1534,3 +1535,31 @@ unsigned long calibrate_delay_is_known(void)
return 0;
 }
 #endif
+
+static int tsc_pm_notifier(struct notifier_block *notifier,
+   unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_HIBERNATION_PREPARE:
+   clear_sched_clock_stable();
+   break;
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   if (!check_tsc_unstable())
+   set_sched_clock_stable();
+   break;
+   }
+
+   return 0;
+};
+
+static struct notifier_block tsc_pm_notifier_block = {
+   .notifier_call = tsc_pm_notifier,
+};
+
+static int tsc_setup_pm_notifier(void)
+{
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(tsc_setup_pm_notifier);
diff --git a/include/linux/sched/clock.h b/include/linux/sched/clock.h
index 867d588314e0..902654ac5f7e 100644
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -32,6 +32,10 @@ static inline void clear_sched_clock_stable(void)
 {
 }
 
+static inline void set_sched_clock_stable(void)
+{
+}
+
 static inline void sched_clock_idle_sleep_event(void)
 {
 }
@@ -51,6 +55,7 @@ static inline u64 local_clock(void)
 }
 #else
 extern int sched_clock_stable(void);
+extern void set_sched_clock_stable(void);
 extern void clear_sched_clock_stable(void);
 
 /*
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 1152259a4ca0..374d40e5b1a2 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -116,7 +116,7 @@ static void __scd_stamp(struct sched_clock_data *scd)
scd->tick_raw = sched_clock();
 }
 
-static void __set_sched_clock_stable(void)
+void set_sched_clock_stable(void)
 {
struct sched_clock_data *scd;
 
@@ -236,7 +236,7 @@ static int __init sched_clock_init_late(void)
smp_mb(); /* matches {set,clear}_sched_clock_stable() */
 
if (__sched_clock_stable_early)
-   __set_sched_clock_stable();
+   set_sched_clock_stable();
 
return 0;
 }
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 10/11] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

2020-01-07 Thread Anchal Agarwal
From: Aleksei Besogonov 

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().

Before this patch, the command line processing was the only place where
swsusp_resume_block was set.

Signed-off-by: Aleksei Besogonov 
Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
---
 kernel/power/user.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index 77438954cc2b..d396e313cb7b 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -374,8 +374,12 @@ static long snapshot_ioctl(struct file *filp, unsigned int 
cmd,
if (swdev) {
offset = swap_area.offset;
data->swap = swap_type_of(swdev, offset, NULL);
-   if (data->swap < 0)
+   if (data->swap < 0) {
error = -ENODEV;
+   } else {
+   swsusp_resume_device = swdev;
+   swsusp_resume_block = offset;
+   }
} else {
data->swap = -1;
error = -EINVAL;
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 09/11] xen: Clear IRQD_IRQ_STARTED flag during shutdown PIRQs

2020-01-07 Thread Anchal Agarwal
shutdown_pirq is invoked during hibernation path and hence
PIRQs should be restarted during resume.
Before this commit'020db9d3c1dc0a' xen/events: Fix interrupt lost
during irq_disable and irq_enable startup_pirq was automatically
called during irq_enable however, after this commit pirq's did not
get explicitly started once resumed from hibernation.

chip->irq_startup is called only if IRQD_IRQ_STARTED is unset during
irq_startup on resume. This flag gets cleared by free_irq->irq_shutdown
during suspend. free_irq() never gets explicitly called for ioapic-edge
and ioapic-level interrupts as respective drivers do nothing during
suspend/resume. So we shut them down explicitly in the first place in
syscore_suspend path to clear IRQ<>event channel mapping. shutdown_pirq
being called explicitly during suspend does not clear this flags, hence
.irq_enable is called in irq_startup during resume instead and pirq's
never start up.

Signed-off-by: Anchal Agarwal 
---
 drivers/xen/events/events_base.c | 1 +
 include/linux/irq.h  | 1 +
 kernel/irq/chip.c| 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index b893536d8af4..aae7c4997b51 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1606,6 +1606,7 @@ void xen_shutdown_pirqs(void)
continue;
 
shutdown_pirq(irq_get_irq_data(info->irq));
+   irq_state_clr_started(irq_to_desc(info->irq));
}
 }
 
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fb301cf29148..1e125cd22cf0 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -745,6 +745,7 @@ extern int irq_set_msi_desc(unsigned int irq, struct 
msi_desc *entry);
 extern int irq_set_msi_desc_off(unsigned int irq_base, unsigned int irq_offset,
struct msi_desc *entry);
 extern struct irq_data *irq_get_irq_data(unsigned int irq);
+extern void irq_state_clr_started(struct irq_desc *desc);
 
 static inline struct irq_chip *irq_get_chip(unsigned int irq)
 {
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index b76703b2c0af..3e8a36c673d6 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -173,10 +173,11 @@ static void irq_state_clr_masked(struct irq_desc *desc)
irqd_clear(>irq_data, IRQD_IRQ_MASKED);
 }
 
-static void irq_state_clr_started(struct irq_desc *desc)
+void irq_state_clr_started(struct irq_desc *desc)
 {
irqd_clear(>irq_data, IRQD_IRQ_STARTED);
 }
+EXPORT_SYMBOL_GPL(irq_state_clr_started);
 
 static void irq_state_set_started(struct irq_desc *desc)
 {
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 08/11] x86/xen: close event channels for PIRQs in system core suspend callback

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

There are no pm handlers for the legacy devices, so during tear down
stale event channel <> IRQ mapping may still remain in the image and resume
may fail. To avoid adding much code by implementing handlers for legacy
devices, add a simple helper function to "shutdown" active PIRQs, which
actually closes event channels but keeps related IRQ structures intact.
PM suspend/hibernation code will rely on this.
Close event channels allocated for devices which are backed by PIRQ and
still active when suspending the system core. Normally, the devices are
emulated legacy devices, e.g. PS/2 keyboard, floppy controller and etc.
Without this, in PM hibernation, information about the event channel
remains in hibernation image, but there is no guarantee that the same
event channel numbers are assigned to the devices when restoring the
system. This may cause conflict like the following and prevent some
devices from being restored correctly.

[  102.330821] [ cut here ]
[  102.333264] WARNING: CPU: 0 PID: 2324 at
drivers/xen/events/events_base.c:878 bind_evtchn_to_irq+0x88/0xf0
...
[  102.348057] Call Trace:
[  102.348057]  [] dump_stack+0x63/0x84
[  102.348057]  [] __warn+0xd1/0xf0
[  102.348057]  [] warn_slowpath_null+0x1d/0x20
[  102.348057]  [] bind_evtchn_to_irq+0x88/0xf0
[  102.348057]  [] ? blkif_copy_from_grant+0xb0/0xb0 
[xen_blkfront]
[  102.348057]  [] bind_evtchn_to_irqhandler+0x27/0x80
[  102.348057]  [] talk_to_blkback+0x425/0xcd0 [xen_blkfront]
[  102.348057]  [] ? __kmalloc+0x1ea/0x200
[  102.348057]  [] blkfront_restore+0x2d/0x60 [xen_blkfront]
[  102.348057]  [] xenbus_dev_restore+0x58/0x100
[  102.348057]  [] ?  xenbus_frontend_delayed_resume+0x20/0x20
[  102.348057]  [] xenbus_dev_cond_restore+0x1e/0x30
[  102.348057]  [] dpm_run_callback+0x4e/0x130
[  102.348057]  [] device_resume+0xe7/0x210
[  102.348057]  [] ? pm_dev_dbg+0x80/0x80
[  102.348057]  [] dpm_resume+0x114/0x2f0
[  102.348057]  [] hibernation_snapshot+0x15f/0x380
[  102.348057]  [] hibernate+0x183/0x290
[  102.348057]  [] state_store+0xcf/0xe0
[  102.348057]  [] kobj_attr_store+0xf/0x20
[  102.348057]  [] sysfs_kf_write+0x3a/0x50
[  102.348057]  [] kernfs_fop_write+0x10b/0x190
[  102.348057]  [] __vfs_write+0x28/0x120
[  102.348057]  [] ? rw_verify_area+0x49/0xb0
[  102.348057]  [] vfs_write+0xb2/0x1b0
[  102.348057]  [] SyS_write+0x46/0xa0
[  102.348057]  [] entry_SYSCALL_64_fastpath+0x1a/0xa9
[  102.423005] ---[ end trace b8d6718e22e2b107 ]---
[  102.425031] genirq: Flags mismatch irq 6.  (blkif) vs.  
(floppy)

Note that we don't explicitly re-allocate event channels for such
devices in the resume callback. Re-allocation will occur when PM core
re-enable IRQs for the devices at later point.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/suspend.c   |  2 ++
 drivers/xen/events/events_base.c | 12 
 include/xen/events.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index dae0f74f5390..affa63d4b6bd 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -105,6 +105,8 @@ static int xen_syscore_suspend(void)
xen_save_steal_clock(cpu);
}
 
+   xen_shutdown_pirqs();
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 569437c158ca..b893536d8af4 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1597,6 +1597,18 @@ void xen_irq_resume(void)
restore_pirqs();
 }
 
+void xen_shutdown_pirqs(void)
+{
+   struct irq_info *info;
+
+   list_for_each_entry(info, _irq_list_head, list) {
+   if (info->type != IRQT_PIRQ || !VALID_EVTCHN(info->evtchn))
+   continue;
+
+   shutdown_pirq(irq_get_irq_data(info->irq));
+   }
+}
+
 static struct irq_chip xen_dynamic_chip __read_mostly = {
.name   = "xen-dyn",
 
diff --git a/include/xen/events.h b/include/xen/events.h
index c0e6a0598397..39b2c4e4d2ef 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -71,6 +71,7 @@ static inline void notify_remote_via_evtchn(int port)
 void notify_remote_via_irq(int irq);
 
 void xen_irq_resume(void);
+void xen_shutdown_pirqs(void);
 
 /* Clear an irq's pending state, in preparation for polling on it */
 void xen_clear_irq_pending(int irq);
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 07/11] x86/xen: save and restore steal clock during hibernation

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.

top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 0.0%si,
-1253874204672.0%st

This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.

Introduce xen_save_steal_clock() which saves current steal clock values
of all present CPUs in runstate info into per-cpu variables during system
core ops suspend callbacks. Its couterpart, xen_restore_steal_clock(),
restores a boot CPU's steal clock in the system core resume callback. It
sets offset if it found the current values in runstate info are smaller
than previous ones. xen_steal_clock() is also modified to use the offset
to ensure that scheduler only sees monotonically increasing number.

For non-boot CPUs, restore after they're brought up, because runstate
info for non-boot CPUs are not active until then.

[Anchal Changelog: Merged patch xen/time: introduce 
xen_{save,restore}_steal_clock
with this one for better code readability]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 arch/x86/xen/suspend.c | 13 -
 arch/x86/xen/time.c|  3 +++
 drivers/xen/time.c | 28 +++-
 include/xen/xen-ops.h  |  2 ++
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 784c4484100b..dae0f74f5390 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -91,12 +91,20 @@ void xen_arch_suspend(void)
 static int xen_syscore_suspend(void)
 {
struct xen_remove_from_physmap xrfp;
-   int ret;
+   int cpu, ret;
 
/* Xen suspend does similar stuffs in its own logic */
if (xen_suspend_mode_is_xen_suspend())
return 0;
 
+   for_each_present_cpu(cpu) {
+   /*
+* Nonboot CPUs are already offline, but the last copy of
+* runstate info is still accessible.
+*/
+   xen_save_steal_clock(cpu);
+   }
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
 
pvclock_resume();
 
+   /* Nonboot CPUs will be resumed when they're brought up */
+   xen_restore_steal_clock(smp_processor_id());
+
gnttab_resume();
 }
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index befbdd8b17f0..8cf632dda605 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -537,6 +537,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
 {
int cpu = smp_processor_id();
xen_setup_runstate_info(cpu);
+   if (cpu)
+   xen_restore_steal_clock(cpu);
+
/*
 * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
 * doing it xen_hvm_cpu_notify (which gets called by smp_init during
diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 0968859c29d0..3713d716070c 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -20,6 +20,8 @@
 
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
+static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
+static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
 
 static DEFINE_PER_CPU(u64[4], old_runstate_time);
 
@@ -149,7 +151,7 @@ bool xen_vcpu_stolen(int vcpu)
return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
-u64 xen_steal_clock(int cpu)
+static u64 __xen_steal_clock(int cpu)
 {
struct vcpu_runstate_info state;
 
@@ -157,6 +159,30 @@ u64 xen_steal_clock(int cpu)
return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
 }
 
+u64 xen_steal_clock(int cpu)
+{
+   return __xen_steal_clock(cpu) + per_cpu(xen_steal_clock_offset, cpu);
+}
+
+void xen_save_steal_clock(int cpu)
+{
+   per_cpu(xen_prev_steal_clock, cpu) = xen_steal_clock(cpu);
+}
+
+void xen_restore_steal_clock(int cpu)
+{
+   u64 steal_clock = __xen_steal_clock(cpu);
+
+   if (per_cpu(xen_prev_steal_clock, cpu) > steal_clock) {
+   /* Need to update the offset */
+   per_cpu(xen_steal_clock_offset, cpu) =
+   per_cpu(xen_prev_steal_clock, cpu) - steal_clock;
+   } else {
+   /* Avoid unnecessary steal clock warp */
+   per_cpu(xen_steal_clock_offset, cpu) = 0;
+   }
+}
+
 void xen_setup_runs

[Xen-devel] [RFC PATCH V2 06/11] xen-blkfront: add callbacks for PM suspend and hibernation

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
events, need to implement these xenbus_driver callbacks.
The freeze handler stops a block-layer queue and disconnect the
frontend from the backend while freeing ring_info and associated resources.
The restore handler re-allocates ring_info and re-connect to the
backend, so the rest of the kernel can continue to use the block device
transparently. Also, the handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks for
Xen suspend without modification. Before disconnecting from backend,
we need to prevent any new IO from being queued and wait for existing
IO to complete. Freeze/unfreeze of the queues will guarantee that there
are no requests in use on the shared ring.

Note:For older backends,if a backend doesn't have commit'12ea729645ace'
xen/blkback: unmap all persistent grants when frontend gets disconnected,
the frontend may see massive amount of grant table warning when freeing
resources.
[   36.852659] deferring g.e. 0xf9 (pfn 0x)
[   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!

In this case, persistent grants would need to be disabled.

[Anchal Changelog: Removed timeout/request during blkfront freeze.
Fixed major part of the code to work with blk-mq]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/block/xen-blkfront.c | 119 ---
 1 file changed, 112 insertions(+), 7 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index a74d03913822..b1d38ca4600f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -47,6 +47,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -79,6 +81,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+   BLKIF_STATE_FREEZING,
+   BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -220,6 +224,7 @@ struct blkfront_info
struct list_head requests;
struct bio_list bio_list;
struct list_head info_list;
+   struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -261,6 +266,7 @@ static DEFINE_SPINLOCK(minor_lock);
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static void blkfront_gather_backend_features(struct blkfront_info *info);
 static int negotiate_mq(struct blkfront_info *info);
+static void __blkif_free(struct blkfront_info *info);
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -995,6 +1001,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+   init_completion(>wait_backend_disconnected);
 
return 0;
 }
@@ -1218,6 +1225,8 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
 /* Already hold rinfo->ring_lock. */
 static inline void kick_pending_request_queues_locked(struct 
blkfront_ring_info *rinfo)
 {
+   if (unlikely(rinfo->dev_info->connected == BLKIF_STATE_FREEZING))
+   return;
if (!RING_FULL(>ring))
blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
 }
@@ -1341,8 +1350,6 @@ static void blkif_free_ring(struct blkfront_ring_info 
*rinfo)
 
 static void blkif_free(struct blkfront_info *info, int suspend)
 {
-   unsigned int i;
-
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1350,6 +1357,13 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);
 
+   __blkif_free(info);
+}
+
+static void __blkif_free(struct blkfront_info *info)
+{
+   unsigned int i;
+
for (i = 0; i < info->nr_rings; i++)
blkif_free_ring(>rinfo[i]);
 
@@ -1553,8 +1567,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->dev_info;
 
-   if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
-   return IRQ_HANDLED;
+   if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
+   if (info->connected != BLKIF_STATE_FREEZING)
+   return IRQ_HANDLED;
+   }
 
spin_lock_irqsave(>ring_lock, flags);
  again:
@@ -2020,6 +2036,7 @@ static int blkif_recover(struct blkfront_info *info)
struct bio *bio;
unsigned int segs;
 
+   bool frozen = info->connected == BL

[Xen-devel] [RFC PATCH V2 05/11] xen-netfront: add callbacks for PM suspend and hibernation support

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze, thaw and restore callbacks for PM suspend and hibernation
support. The freeze handler simply disconnects the frotnend from the
backend and frees resources associated with queues after disabling the
net_device from the system. The restore handler just changes the
frontend state and let the xenbus handler to re-allocate the resources
and re-connect to the backend. This can be performed transparently to
the rest of the system. The handlers are used for both PM suspend and
hibernation so that we can keep the existing suspend/resume callbacks
for Xen suspend without modification. Freezing netfront devices is
normally expected to finish within a few hundred milliseconds, but it
can rarely take more than 5 seconds and hit the hard coded timeout,
it would depend on backend state which may be congested and/or have
complex configuration. While it's rare case, longer default timeout
seems a bit more reasonable here to avoid hitting the timeout.
Also, make it configurable via module parameter so that we can cover
broader setups than what we know currently.

[Anchal changelog: Variable name fix and checkpatch.pl fixes]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/net/xen-netfront.c | 98 +-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 467fd0f0ffcd..aa7ef40378ca 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -56,6 +57,12 @@
 #include 
 #include 
 
+enum netif_freeze_state {
+   NETIF_FREEZE_STATE_UNFROZEN,
+   NETIF_FREEZE_STATE_FREEZING,
+   NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 
0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+  netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+"timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
 
atomic_t rx_gso_checksum_fixup;
+
+   int freeze_state;
+
+   struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -721,6 +738,21 @@ static int xennet_close(struct net_device *dev)
return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+   struct netfront_info *np = netdev_priv(dev);
+   unsigned int num_queues = dev->real_num_tx_queues;
+   unsigned int queue_index;
+   struct netfront_queue *queue;
+
+   for (queue_index = 0; queue_index < num_queues; ++queue_index) {
+   queue = >queues[queue_index];
+   disable_irq(queue->tx_irq);
+   disable_irq(queue->rx_irq);
+   }
+   return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff 
*skb,
grant_ref_t ref)
 {
@@ -1301,6 +1333,8 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
np->queues = NULL;
 
+   init_completion(>wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1794,6 +1828,50 @@ static int xennet_create_queues(struct netfront_info 
*info,
return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
+   unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+   int err = 0;
+
+   xennet_disable_interrupts(info->netdev);
+
+   netif_device_detach(info->netdev);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+   /* Kick the backend to disconnect */
+   xenbus_switch_state(dev, XenbusStateClosing);
+
+   /* We don't want to move forward before the frontend is diconnected
+* from the backend cleanly.
+*/
+   timeout = wait_for_completion_timeout(>wait_backend_disconnected,
+ timeout);
+   if (!timeout) {
+   err = -EBUSY;
+   xenbus_dev_error(dev, err, "Freezing timed out;"
+"the device may become inconsistent state");
+   return err;
+   }
+
+   /* Tear down queues */
+   xennet_disconnect_backend(info);
+   xennet_destroy_queues(info);
+
+   info->freeze_state

[Xen-devel] [RFC PATCH V2 04/11] x86/xen: add system core suspend and resume callbacks

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

Add Xen PVHVM specific system core callbacks for PM suspend and
hibernation support. The callbacks suspend and resume Xen
primitives,like shared_info, pvclock and grant table. Note that
Xen suspend can handle them in a different manner, but system
core callbacks are called from the context. So if the callbacks
are called from Xen suspend context, return immediately.

Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 
---
 arch/x86/xen/enlighten_hvm.c |  1 +
 arch/x86/xen/suspend.c   | 53 
 include/xen/xen-ops.h|  3 +++
 3 files changed, 57 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 75b1ec7a0fcd..138e71786e03 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -204,6 +204,7 @@ static void __init xen_hvm_guest_init(void)
if (xen_feature(XENFEAT_hvm_callback_vector))
xen_have_vector_callback = 1;
 
+   xen_setup_syscore_ops();
xen_hvm_smp_init();
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 1d83152c761b..784c4484100b 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -2,17 +2,22 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -82,3 +87,51 @@ void xen_arch_suspend(void)
 
on_each_cpu(xen_vcpu_notify_suspend, NULL, 1);
 }
+
+static int xen_syscore_suspend(void)
+{
+   struct xen_remove_from_physmap xrfp;
+   int ret;
+
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return 0;
+
+   xrfp.domid = DOMID_SELF;
+   xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+
+   ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, );
+   if (!ret)
+   HYPERVISOR_shared_info = _dummy_shared_info;
+
+   return ret;
+}
+
+static void xen_syscore_resume(void)
+{
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return;
+
+   /* No need to setup vcpu_info as it's already moved off */
+   xen_hvm_map_shared_info();
+
+   pvclock_resume();
+
+   gnttab_resume();
+}
+
+/*
+ * These callbacks will be called with interrupts disabled and when having only
+ * one CPU online.
+ */
+static struct syscore_ops xen_hvm_syscore_ops = {
+   .suspend = xen_syscore_suspend,
+   .resume = xen_syscore_resume
+};
+
+void __init xen_setup_syscore_ops(void)
+{
+   if (xen_hvm_domain())
+   register_syscore_ops(_hvm_syscore_ops);
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 6c36e161dfd1..3b3992b5b0c2 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -43,6 +43,9 @@ int xen_setup_shutdown_event(void);
 bool xen_suspend_mode_is_xen_suspend(void);
 bool xen_suspend_mode_is_pm_suspend(void);
 bool xen_suspend_mode_is_pm_hibernation(void);
+
+void xen_setup_syscore_ops(void);
+
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 03/11] x86/xen: Introduce new function to map

2020-01-07 Thread Anchal Agarwal
Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also  does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.

Signed-off-by: Anchal Agarwal 
---
 arch/x86/xen/enlighten_hvm.c | 7 +++
 arch/x86/xen/xen-ops.h   | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index e138f7de52d2..75b1ec7a0fcd 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -27,6 +27,13 @@
 
 static unsigned long shared_info_pfn;
 
+void xen_hvm_map_shared_info(void)
+{
+   xen_hvm_init_shared_info();
+   if (shared_info_pfn)
+   HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
 void xen_hvm_init_shared_info(void)
 {
struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 45a441c33d6d..d84c357994bd 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -56,6 +56,7 @@ void xen_enable_syscall(void);
 void xen_vcpu_restore(void);
 
 void xen_callback_vector(void);
+void xen_hvm_map_shared_info(void);
 void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 02/11] xenbus: add freeze/thaw/restore callbacks support

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.

Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.

[Anchal Changelog: Refactored the callbacks code]
Signed-off-by: Agarwal Anchal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/xenbus/xenbus_probe.c | 99 ---
 include/xen/xenbus.h  |  3 ++
 2 files changed, 84 insertions(+), 18 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index 5b471889d723..0fa868c2 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -597,27 +598,44 @@ int xenbus_dev_suspend(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
return 0;
drv = to_xenbus_driver(dev->driver);
-   if (drv->suspend)
-   err = drv->suspend(xdev);
-   if (err)
-   pr_warn("suspend %s failed: %i\n", dev_name(dev), err);
+
+   if (xen_suspend) {
+   if (drv->suspend)
+   err = drv->suspend(xdev);
+   } else {
+   if (drv->freeze) {
+   err = drv->freeze(xdev);
+   if (!err) {
+   free_otherend_watch(xdev);
+   free_otherend_details(xdev);
+   return 0;
+   }
+   }
+   }
+
+   if (err) {
+   pr_warn("%s %s failed: %i\n", xen_suspend ?
+   "suspend" : "freeze", dev_name(dev), err);
+   return err;
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
 
 int xenbus_dev_resume(struct device *dev)
 {
-   int err;
+   int err = 0;
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
-
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
@@ -625,24 +643,32 @@ int xenbus_dev_resume(struct device *dev)
drv = to_xenbus_driver(dev->driver);
err = talk_to_otherend(xdev);
if (err) {
-   pr_warn("resume (talk_to_otherend) %s failed: %i\n",
+   pr_warn("%s (talk_to_otherend) %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
 
-   xdev->state = XenbusStateInitialising;
+   if (xen_suspend) {
+   xdev->state = XenbusStateInitialising;
+   if (drv->resume)
+   err = drv->resume(xdev);
+   } else {
+   if (drv->restore)
+   err = drv->restore(xdev);
+   }
 
-   if (drv->resume) {
-   err = drv->resume(xdev);
-   if (err) {
-   pr_warn("resume %s failed: %i\n", dev_name(dev), err);
-   return err;
-   }
+   if (err) {
+   pr_warn("%s %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
+   dev_name(dev), err);
+   return err;
}
 
err = watch_otherend(xdev);
if (err) {
-   pr_warn("resume (watch_otherend) %s failed: %d.\n",
+   pr_warn("%s (watch_otherend) %s failed: %d.\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
@@ -653,8 +679,45 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
 
 int xenbus_dev_cancel(struct device *dev)
 {
-   /* Do nothing */
-   DPRINTK("cancel");
+   int err = 0;
+   struct xenbus_driver *drv;
+   struct xenbus_device *xdev
+   = container_of(dev, struct xenbus_device, dev);
+   bool xen_suspend = 

[Xen-devel] [RFC PATCH V2 01/11] xen/manage: keep track of the on-going suspend mode

2020-01-07 Thread Anchal Agarwal
From: Munehisa Kamata 

Guest hibernation is different from xen suspend/resume/live migration.
Xen save/restore does not use pm_ops as is needed by guest hibernation.
Hibernation in guest follows ACPI path and is guest inititated , the
hibernation image is saved within guest as compared to later modes
which are xen toolstack assisted and image creation/storage is in
control of hypervisor/host machine.
To differentiate between Xen suspend and PM hibernation, keep track
of the on-going suspend mode by mainly using a new PM notifier.
Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.
Since Xen suspend doesn't have corresponding PM event, its main logic
is modfied to acquire pm_mutex and set the current mode.

Though, acquirng pm_mutex is still right thing to do, we may
see deadlock if PM hibernation is interrupted by Xen suspend.
PM hibernation depends on xenwatch thread to process xenbus state
transactions, but the thread will sleep to wait pm_mutex which is
already held by PM hibernation context in the scenario. Xen shutdown
code may need some changes to avoid the issue.

[Anchal Changelog: Merged patch xen/manage: introduce helper function
to know the on-going suspend mode into this one for better readability]
Signed-off-by: Anchal Agarwal 
Signed-off-by: Munehisa Kamata 
---
 drivers/xen/manage.c  | 73 +++
 include/xen/xen-ops.h |  3 +++
 2 files changed, 76 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index cd046684e0d1..0b30ab522b77 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,6 +41,31 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+   NO_SUSPEND = 0,
+   XEN_SUSPEND,
+   PM_SUSPEND,
+   PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+   return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+   return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+   return suspend_mode == PM_HIBERNATION;
+}
+
 struct suspend_info {
int cancelled;
 };
@@ -99,6 +125,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
 
+   lock_system_sleep();
+
+   suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
 
err = freeze_processes();
@@ -162,6 +192,10 @@ static void do_suspend(void)
thaw_processes();
 out:
shutting_down = SHUTDOWN_INVALID;
+
+   suspend_mode = NO_SUSPEND;
+
+   unlock_system_sleep();
 }
 #endif /* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -387,3 +421,42 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+  unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_SUSPEND_PREPARE:
+   suspend_mode = PM_SUSPEND;
+   break;
+   case PM_HIBERNATION_PREPARE:
+   case PM_RESTORE_PREPARE:
+   suspend_mode = PM_HIBERNATION;
+   break;
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   suspend_mode = NO_SUSPEND;
+   break;
+   default:
+   pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+   return -EINVAL;
+   }
+
+   return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+   .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+   if (!xen_hvm_domain())
+   return -ENODEV;
+
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index d89969aa9942..6c36e161dfd1 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -40,6 +40,9 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
 extern unsigned long *xen_contiguous_bitmap;
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_ARM) || defined(CONFIG_ARM64)
-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH V2 00/11] Enable PM hibernation on guest VMs

2020-01-07 Thread Anchal Agarwal
Hello,
I am sending out a V2 version of series of patches that implements guest 
PM hibernation.
These guests are running on xen hypervisor. The patches had been tested
against mainstream kernel. EC2 instance hibernation feature is provided 
to the AWS EC2 customers. PM hibernation uses swap space carved out within 
the guest[or can be a separate partition], where hibernation image is 
stored and restored from.

Why is guest hibenration needed:
Doing guest hibernation does not involve any support from hypervisor and this
way guest has complete control over its state. Infrastructure restrictions like
saving up guest state etc can be overcome by guest initiated hibernation.

This series includes some improvements over RFC series sent last year:
https://lists.xenproject.org/archives/html/xen-devel/2018-06/msg00823.html

Any comments or suggestions are welcome.

Changelog v2:
1. Removed timeout/request present on the ring in xen-blkfront during blkfront 
freeze
2. Fixed restoring of PIRQs which was apparently working for 4.9 kernels but 
not for
newer kernel. [Legacy irqs were no longer restored after hibernation introduced 
with
this commit "020db9d3c1dc0"]
3. Merged couple of related patches to make the code more coherent and readable
4. Code refactoring
5. Sched clock fix when hibernating guest is under heavy CPU load
Note: Under very rare circumstances we see resume failures with KASLR enabled 
only
on xen instances.  We are roughly seeing 3% failures [>1000 runs] when testing 
with
various instance sizes and some workload running on each instance. I am 
currently
investigating the issue as to confirm if its a xen issue or kernel issue.
However, it should not hold back anyone from reviewing/accepting these patches.

Testing done:
All the testing is done using amazon linux images w/t stock upstream kernel
installed. All testing is done for multiple hibernation cycle.

i. multiple loops[~100] of hibernation in disk mode  w/t 5.4 guest 
kernel + 4.11 xen
ii. Hibernation tested with memory stress tester running in background on 
smaller and
larger instance sizes on EC2.[>500 runs]
iii. Testing is also done on physical host machine[Ubuntu18.04/4.15 
kernel/stock xen-4.6]
running amazon linux 2 OS as guest VM with multiple queues.
iv. Ran dd to write a large file with bs=1k and hibernated multiple times

Testing How to:
---
Example:
Set up a file-backed swap space. Swap file size>=Total memory on the system
sudo dd if=/dev/zero of=/swap bs=$(( 1024 * 1024 )) count=4096 # 4096MiB
sudo chmod 600 /swap
sudo mkswap /swap
sudo swapon /swap

Update resume device/resume offset in grub if using swap file:
resume=/dev/xvda1 resume_offset=200704

Execute:

sudo pm-hibernate
OR
echo disk > /sys/power/state && echo reboot > /sys/power/disk

Compute resume offset code:
"
#!/usr/bin/env python
import sys
import array
import fcntl

#swap file
f = open(sys.argv[1], 'r')
buf = array.array('L', [0])

#FIBMAP
ret = fcntl.ioctl(f.fileno(), 0x01, buf)
print buf[0]
"

Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (2):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume
  xen: Clear IRQD_IRQ_STARTED flag during shutdown PIRQs

Eduardo Valentin (1):
  x86: tsc: avoid system instability in hibernation

Munehisa Kamata (7):
  xen/manage: keep track of the on-going suspend mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-netfront: add callbacks for PM suspend and hibernation support
  xen-blkfront: add callbacks for PM suspend and hibernation
  x86/xen: save and restore steal clock during hibernation
  x86/xen: close event channels for PIRQs in system core suspend
callback

 arch/x86/kernel/tsc.c |  29 ++
 arch/x86/xen/enlighten_hvm.c  |   8 +++
 arch/x86/xen/suspend.c|  66 +
 arch/x86/xen/time.c   |   3 +
 arch/x86/xen/xen-ops.h|   1 +
 drivers/block/xen-blkfront.c  | 119 +++---
 drivers/net/xen-netfront.c|  98 ++-
 drivers/xen/events/events_base.c  |  13 +
 drivers/xen/manage.c  |  73 +++
 drivers/xen/time.c|  28 -
 drivers/xen/xenbus/xenbus_probe.c |  99 +--
 include/linux/irq.h   |   1 +
 include/linux/sched/clock.h   |   5 ++
 include/xen/events.h  |   1 +
 include/xen/xen-ops.h |   8 +++
 include/xen/xenbus.h  |   3 +
 kernel/irq/chip.c |   3 +-
 kernel/power/user.c   |   6 +-
 kernel/sched/clock.c  |   4 +-
 19 files changed, 537 insertions(+), 31 deletions(-)

-- 
2.15.3.AMZN


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
ht

Re: [Xen-devel] [PATCH] xen/netfront: Remove unneeded .resume callback

2019-03-28 Thread Anchal Agarwal
On Wed, Mar 27, 2019 at 08:40:20AM +0200, Oleksandr Andrushchenko wrote:
> On 3/25/19 7:30 PM, Anchal Agarwal wrote:
> >On Fri, Mar 22, 2019 at 10:44:33AM +, Oleksandr Andrushchenko wrote:
> >>On 3/20/19 5:50 AM, Munehisa Kamata wrote:
> >>>On 3/18/2019 3:02 AM, Oleksandr Andrushchenko wrote:
> >>>>+Amazon
> >>>>pls see inline
> >>>Hi Oleksandr,
> >>>
> >>>Let me add some comments as the original author of the series.
> >>Thank you for your work!
> >Hi Oleksandr,
> >>>>On 3/14/19 9:00 PM, Julien Grall wrote:
> >>>>>Hi,
> >>>>>
> >>>>>On 3/14/19 3:40 PM, Boris Ostrovsky wrote:
> >>>>>>On 3/14/19 11:10 AM, Oleksandr Andrushchenko wrote:
> >>>>>>>On 3/14/19 5:02 PM, Boris Ostrovsky wrote:
> >>>>>>>>On 3/14/19 10:52 AM, Oleksandr Andrushchenko wrote:
> >>>>>>>>>On 3/14/19 4:47 PM, Boris Ostrovsky wrote:
> >>>>>>>>>>On 3/14/19 9:17 AM, Oleksandr Andrushchenko wrote:
> >>>>>>>>>>>From: Oleksandr Andrushchenko 
> >>>>>>>>>>>
> >>>>>>>>>>>Currently on driver resume we remove all the network queues and
> >>>>>>>>>>>destroy shared Tx/Rx rings leaving the driver in its current state
> >>>>>>>>>>>and never signaling the backend of this frontend's state change.
> >>>>>>>>>>>This leads to the number of consequences:
> >>>>>>>>>>>- when frontend withdraws granted references to the rings etc. it
> >>>>>>>>>>>cannot
> >>>>>>>>>>>   be cleanly done as the backend still holds those (it 
> >>>>>>>>>>> was not
> >>>>>>>>>>>told to
> >>>>>>>>>>>   free the resources)
> >>>>>>>>>>>- it is not possible to resume driver operation as all the
> >>>>>>>>>>>communication
> >>>>>>>>>>>   means with the backned were destroyed by the frontend, 
> >>>>>>>>>>> thus
> >>>>>>>>>>>   making the frontend appear to the guest OS as 
> >>>>>>>>>>> functional, but
> >>>>>>>>>>>   not really.
> >>>>>>>>>>What do you mean? Are you saying that after resume you lose
> >>>>>>>>>>connectivity?
> >>>>>>>>>Exactly, if you take a look at the .resume callback as it is now
> >>>>>>>>>what it does it destroys the rings etc. and never notifies the 
> >>>>>>>>>backend
> >>>>>>>>>of that, e.g. it stays in, say, connected state with communication
> >>>>>>>>>channels destroyed. It never goes into any other Xen bus state, so
> >>>>>>>>>there is
> >>>>>>>>>no way its state machine can help recovering.
> >>>>>>>>My tree is about a month old so perhaps there is some sort of 
> >>>>>>>>regression
> >>>>>>>>but this certainly works for me. After resume netfront gets
> >>>>>>>>XenbusStateInitWait from backend which causes xennet_connect().
> >>>>>>>Ah, the difference can be of the way we get the guest enter
> >>>>>>>the suspend state. I am making my guest to suspend with:
> >>>>>>>echo mem > /sys/power/state
> >>>>>>>And then I use an interrupt to the guest (this is a test code)
> >>>>>>>to wake it up.
> >>>>>>>Could you please share your exact use-case when the guest enters 
> >>>>>>>suspend
> >>>>>>>and what you do to resume it?
> >>>>>>xl save / xl restore
> >>>>>>
> >>>>>>>I can see no way backend may want enter XenbusStateInitWait in my
> >>>>>>>use-case
> >>>>>>>as it simply doesn't know we want him to.
> >>>>>>Yours looks like ACPI path, I don't know how well it was tested TBH.
> >>>>>I remember a series from amazon [1] th

Re: [Xen-devel] [PATCH] xen/netfront: Remove unneeded .resume callback

2019-03-25 Thread Anchal Agarwal
On Fri, Mar 22, 2019 at 10:44:33AM +, Oleksandr Andrushchenko wrote:
> 
> On 3/20/19 5:50 AM, Munehisa Kamata wrote:
> > On 3/18/2019 3:02 AM, Oleksandr Andrushchenko wrote:
> >> +Amazon
> >> pls see inline
> > Hi Oleksandr,
> >
> > Let me add some comments as the original author of the series.
> Thank you for your work!
Hi Oleksandr,
> >
> >> On 3/14/19 9:00 PM, Julien Grall wrote:
> >>> Hi,
> >>>
> >>> On 3/14/19 3:40 PM, Boris Ostrovsky wrote:
>  On 3/14/19 11:10 AM, Oleksandr Andrushchenko wrote:
> > On 3/14/19 5:02 PM, Boris Ostrovsky wrote:
> >> On 3/14/19 10:52 AM, Oleksandr Andrushchenko wrote:
> >>> On 3/14/19 4:47 PM, Boris Ostrovsky wrote:
>  On 3/14/19 9:17 AM, Oleksandr Andrushchenko wrote:
> > From: Oleksandr Andrushchenko 
> >
> > Currently on driver resume we remove all the network queues and
> > destroy shared Tx/Rx rings leaving the driver in its current state
> > and never signaling the backend of this frontend's state change.
> > This leads to the number of consequences:
> > - when frontend withdraws granted references to the rings etc. it
> > cannot
> >   be cleanly done as the backend still holds those (it was 
> > not
> > told to
> >   free the resources)
> > - it is not possible to resume driver operation as all the
> > communication
> >   means with the backned were destroyed by the frontend, 
> > thus
> >   making the frontend appear to the guest OS as functional, 
> > but
> >   not really.
>  What do you mean? Are you saying that after resume you lose
>  connectivity?
> >>> Exactly, if you take a look at the .resume callback as it is now
> >>> what it does it destroys the rings etc. and never notifies the backend
> >>> of that, e.g. it stays in, say, connected state with communication
> >>> channels destroyed. It never goes into any other Xen bus state, so
> >>> there is
> >>> no way its state machine can help recovering.
> >> My tree is about a month old so perhaps there is some sort of 
> >> regression
> >> but this certainly works for me. After resume netfront gets
> >> XenbusStateInitWait from backend which causes xennet_connect().
> > Ah, the difference can be of the way we get the guest enter
> > the suspend state. I am making my guest to suspend with:
> > echo mem > /sys/power/state
> > And then I use an interrupt to the guest (this is a test code)
> > to wake it up.
> > Could you please share your exact use-case when the guest enters suspend
> > and what you do to resume it?
> 
>  xl save / xl restore
> 
> > I can see no way backend may want enter XenbusStateInitWait in my
> > use-case
> > as it simply doesn't know we want him to.
> 
>  Yours looks like ACPI path, I don't know how well it was tested TBH.
> >>> I remember a series from amazon [1] that plays around suspend and 
> >>> hibernation. The patch [2] leads me to think that guest triggered 
> >>> suspend/resume does not work properly. It looks like the series has never 
> >>> been fully reviewed. Not sure why...
> >> Julien, thanks a lot for bringing these patches to our attention which we 
> >> obviously missed.
> >>> Anyway, from my understanding this series may solve Oleksandr issue. 
> >>> However, this would only address the common code side. AFAIK Oleksandr is 
> >>> targeting Arm platform. If so, I think this would require more work than 
> >>> this series. Arm code still miss few bits properly suspend/resume arch 
> >>> specific code (see [2]).
> >>>
> >>> I have a branch on my git to track the series. However, they never have 
> >>> been resent after Ian Campbell left Citrix. I would be happy to review 
> >>> them if someone wants to pick them up and repost them.
> >>>
> >> First of all, let me make it clear that we are interested in hibernation 
> >> long term, so it would be
> >> desirable to re-use as much work form resume/suspend as we can. But, we 
> >> see it as a step by
> >> step work, e.g. first S2RAM and later on hibernation.
> >> Let me clarify the immediate use-case that we have, so it is easier to 
> >> understand what we want
> >> and what we don't at the moment. We are about to continue work started by 
> >> Mirela/Xilinx on
> >> Suspend-to-RAM for ARM [3] and we made number of assumptions:
> >> 1. We are talking about *system* suspend, e.g. the goal is to suspend all 
> >> the components
> >> of the system and Xen itself at once. Think about this as fast-boot and/or 
> >> energy saving
> >> feature if you will.
> >> 2. With suspend/resume there is no intention to migrate VMs to any other 
> >> host.
> >> 3. Most probably configuration of the back/front won't change between 
> >> suspend/resume.
> >> But long term we are also thinking for supporting 

Re: [Xen-devel] [PATCH] xen/netfront: Remove unneeded .resume callback

2019-03-21 Thread Anchal Agarwal
On Tue, Mar 19, 2019 at 08:50:05PM -0700, Munehisa Kamata wrote:
> On 3/18/2019 3:02 AM, Oleksandr Andrushchenko wrote:
> > +Amazon
> > pls see inline
> Hi Oleksandr,
> 
> Let me add some comments as the original author of the series.
> 
> > 
> > On 3/14/19 9:00 PM, Julien Grall wrote:
> >> Hi,
> >>
> >> On 3/14/19 3:40 PM, Boris Ostrovsky wrote:
> >>> On 3/14/19 11:10 AM, Oleksandr Andrushchenko wrote:
>  On 3/14/19 5:02 PM, Boris Ostrovsky wrote:
> > On 3/14/19 10:52 AM, Oleksandr Andrushchenko wrote:
> >> On 3/14/19 4:47 PM, Boris Ostrovsky wrote:
> >>> On 3/14/19 9:17 AM, Oleksandr Andrushchenko wrote:
>  From: Oleksandr Andrushchenko 
> 
>  Currently on driver resume we remove all the network queues and
>  destroy shared Tx/Rx rings leaving the driver in its current state
>  and never signaling the backend of this frontend's state change.
>  This leads to the number of consequences:
>  - when frontend withdraws granted references to the rings etc. it
>  cannot
>   be cleanly done as the backend still holds those (it was not
>  told to
>   free the resources)
>  - it is not possible to resume driver operation as all the
>  communication
>   means with the backned were destroyed by the frontend, thus
>   making the frontend appear to the guest OS as functional, 
>  but
>   not really.
> >>> What do you mean? Are you saying that after resume you lose
> >>> connectivity?
> >> Exactly, if you take a look at the .resume callback as it is now
> >> what it does it destroys the rings etc. and never notifies the backend
> >> of that, e.g. it stays in, say, connected state with communication
> >> channels destroyed. It never goes into any other Xen bus state, so
> >> there is
> >> no way its state machine can help recovering.
> >
> > My tree is about a month old so perhaps there is some sort of regression
> > but this certainly works for me. After resume netfront gets
> > XenbusStateInitWait from backend which causes xennet_connect().
>  Ah, the difference can be of the way we get the guest enter
>  the suspend state. I am making my guest to suspend with:
>  echo mem > /sys/power/state
>  And then I use an interrupt to the guest (this is a test code)
>  to wake it up.
>  Could you please share your exact use-case when the guest enters suspend
>  and what you do to resume it?
> >>>
> >>>
> >>> xl save / xl restore
> >>>
>  I can see no way backend may want enter XenbusStateInitWait in my
>  use-case
>  as it simply doesn't know we want him to.
> >>>
> >>>
> >>> Yours looks like ACPI path, I don't know how well it was tested TBH.
> >>
> >> I remember a series from amazon [1] that plays around suspend and 
> >> hibernation. The patch [2] leads me to think that guest triggered 
> >> suspend/resume does not work properly. It looks like the series has never 
> >> been fully reviewed. Not sure why...
> > Julien, thanks a lot for bringing these patches to our attention which we 
> > obviously missed.
> >>
> >> Anyway, from my understanding this series may solve Oleksandr issue. 
> >> However, this would only address the common code side. AFAIK Oleksandr is 
> >> targeting Arm platform. If so, I think this would require more work than 
> >> this series. Arm code still miss few bits properly suspend/resume arch 
> >> specific code (see [2]).
> >>
> >> I have a branch on my git to track the series. However, they never have 
> >> been resent after Ian Campbell left Citrix. I would be happy to review 
> >> them if someone wants to pick them up and repost them.
> >>
> > First of all, let me make it clear that we are interested in hibernation 
> > long term, so it would be
> > desirable to re-use as much work form resume/suspend as we can. But, we see 
> > it as a step by
> > step work, e.g. first S2RAM and later on hibernation.
> > Let me clarify the immediate use-case that we have, so it is easier to 
> > understand what we want
> > and what we don't at the moment. We are about to continue work started by 
> > Mirela/Xilinx on
> > Suspend-to-RAM for ARM [3] and we made number of assumptions:
> > 1. We are talking about *system* suspend, e.g. the goal is to suspend all 
> > the components
> > of the system and Xen itself at once. Think about this as fast-boot and/or 
> > energy saving
> > feature if you will.
> > 2. With suspend/resume there is no intention to migrate VMs to any other 
> > host.
> > 3. Most probably configuration of the back/front won't change between 
> > suspend/resume.
> > But long term we are also thinking for supporting suspend/resume in its 
> > broader meaning,
> > e.g. what is probably what you mean by suspend/resume.
> AFAIK .suspend and .resume callbacks in frontend drivers are
> specifically for xl save/restore case rather 

Re: [Xen-devel] [RFC PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2018-06-13 Thread Anchal Agarwal
Hi Roger,
To answer your question, due to the lack of mentioned commit
(commit 12ea729645ac ("xen/blkback: unmap all persistent grants when
frontend gets disconnected") in the older dom0 kernels(<3.2),resume from
hibernation can fail on guest side. In the absence of the commit,
Persistant Grants are not unmapped immediately when frontend is 
disconnected from backend and hence leave the block device in an 
inconsistent state. To avoid this unstability and work with larger set 
of kernel versions, this approach had been used. Once you don't have 
any pending req/resp it is safer for guest to resume from hibernation.

Thanks,
Anchal

On Wed, Jun 13, 2018 at 10:24:28AM +0200, Roger Pau Monn?? wrote:
> On Tue, Jun 12, 2018 at 08:56:13PM +, Anchal Agarwal wrote:
> > From: Munehisa Kamata 
> > 
> > Add freeze and restore callbacks for PM suspend and hibernation support.
> > The freeze handler stops a block-layer queue and disconnect the frontend
> > from the backend while freeing ring_info and associated resources. The
> > restore handler re-allocates ring_info and re-connect to the backedend,
> > so the rest of the kernel can continue to use the block device
> > transparently.Also, the handlers are used for both PM
> > suspend and hibernation so that we can keep the existing suspend/resume
> > callbacks for Xen suspend without modification.
> > If a backend doesn't have commit 12ea729645ac ("xen/blkback: unmap all
> > persistent grants when frontend gets disconnected"), the frontend may see
> > massive amount of grant table warning when freeing resources.
> > 
> >  [   36.852659] deferring g.e. 0xf9 (pfn 0x)
> >  [   36.855089] xen:grant_table: WARNING: g.e. 0x112 still in use!
> > 
> > In this case, persistent grants would need to be disabled.
> > 
> > Ensure no reqs/rsps in rings before disconnecting. When disconnecting
> > the frontend from the backend in blkfront_freeze(), there still may be
> > unconsumed requests or responses in the rings, especially when the
> > backend is backed by network-based device. If the frontend gets
> > disconnected with such reqs/rsps remaining there, it can cause
> > grant warnings and/or losing reqs/rsps by freeing pages afterward.
> 
> I'm not sure why having pending requests can cause grant warnings or
> lose of requests. If handled properly this shouldn't be an issue.
> Linux blkfront already does live migration (which also involves a
> reconnection of the frontend) with pending requests and that doesn't
> seem to be an issue.
> 
> > This can lead resumed kernel into unrecoverable state like unexpected
> > freeing of grant page and/or hung task due to the lost reqs or rsps.
> > Therefore we have to ensure that there is no unconsumed requests or
> > responses before disconnecting.
> 
> Given that we have multiqueue, plus multipage rings, I'm not sure
> waiting for the requests on the rings to complete is a good idea.
> 
> Why can't you just disconnect the frontend and requeue all the
> requests in flight? When the frontend connects on resume those
> requests will be queued again.
> 
> Thanks, Roger.
> 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 07/12] xen-netfront: add callbacks for PM suspend and hibernation support

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze and restore callbacks for PM suspend and hibernation support.
The freeze handler simply disconnects the frotnend from the backend and
frees resources associated with queues after disabling the net_device
from the system. The restore handler just changes the frontend state and
let the xenbus handler to re-allocate the resources and re-connect to the
backend. This can be performed transparently to the rest of the system.
The handlers are used for both PM suspend and hibernation so that we can
keep the existing suspend/resume callbacks for Xen suspend without
modification. Freezing netfront devices is normally expected to finish within a 
few
hundred milliseconds, but it can rarely take more than 5 seconds and
hit the hard coded timeout, it would depend on backend state which may
be congested and/or have complex configuration. While it's rare case,
longer default timeout seems a bit more reasonable here to avoid hitting
the timeout. Also, make it configurable via module parameter so that we
can cover broader setups than what we know currently.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Eduardo Valentin 
Reviewed-by: Munehisa Kamata 
---
 drivers/net/xen-netfront.c | 97 +-
 1 file changed, 96 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 4dd0668..4ea9284 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -56,6 +57,12 @@
 #include 
 #include 
 
+enum netif_freeze_state {
+   NETIF_FREEZE_STATE_UNFROZEN,
+   NETIF_FREEZE_STATE_FREEZING,
+   NETIF_FREEZE_STATE_FROZEN,
+};
+
 /* Module parameters */
 #define MAX_QUEUES_DEFAULT 8
 static unsigned int xennet_max_queues;
@@ -63,6 +70,12 @@ module_param_named(max_queues, xennet_max_queues, uint, 
0644);
 MODULE_PARM_DESC(max_queues,
 "Maximum number of queues per virtual interface");
 
+static unsigned int netfront_freeze_timeout_secs = 10;
+module_param_named(freeze_timeout_secs,
+  netfront_freeze_timeout_secs, uint, 0644);
+MODULE_PARM_DESC(freeze_timeout_secs,
+"timeout when freezing netfront device in seconds");
+
 static const struct ethtool_ops xennet_ethtool_ops;
 
 struct netfront_cb {
@@ -160,6 +173,10 @@ struct netfront_info {
struct netfront_stats __percpu *tx_stats;
 
atomic_t rx_gso_checksum_fixup;
+
+   int freeze_state;
+
+   struct completion wait_backend_disconnected;
 };
 
 struct netfront_rx_info {
@@ -723,6 +740,21 @@ static int xennet_close(struct net_device *dev)
return 0;
 }
 
+static int xennet_disable_interrupts(struct net_device *dev)
+{
+   struct netfront_info *np = netdev_priv(dev);
+   unsigned int num_queues = dev->real_num_tx_queues;
+   unsigned int i;
+   struct netfront_queue *queue;
+
+   for (i = 0; i < num_queues; ++i) {
+   queue = >queues[i];
+   disable_irq(queue->tx_irq);
+   disable_irq(queue->rx_irq);
+   }
+   return 0;
+}
+
 static void xennet_move_rx_slot(struct netfront_queue *queue, struct sk_buff 
*skb,
grant_ref_t ref)
 {
@@ -1296,6 +1328,8 @@ static struct net_device *xennet_create_dev(struct 
xenbus_device *dev)
 
np->queues = NULL;
 
+   init_completion(>wait_backend_disconnected);
+
err = -ENOMEM;
np->rx_stats = netdev_alloc_pcpu_stats(struct netfront_stats);
if (np->rx_stats == NULL)
@@ -1782,6 +1816,50 @@ static int xennet_create_queues(struct netfront_info 
*info,
return 0;
 }
 
+static int netfront_freeze(struct xenbus_device *dev)
+{
+   struct netfront_info *info = dev_get_drvdata(>dev);
+   unsigned long timeout = netfront_freeze_timeout_secs * HZ;
+   int err = 0;
+
+   xennet_disable_interrupts(info->netdev);
+
+   netif_device_detach(info->netdev);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FREEZING;
+
+   /* Kick the backend to disconnect */
+   xenbus_switch_state(dev, XenbusStateClosing);
+
+   /* We don't want to move forward before the frontend is diconnected
+* from the backend cleanly.
+*/
+   timeout = wait_for_completion_timeout(>wait_backend_disconnected,
+ timeout);
+   if (!timeout) {
+   err = -EBUSY;
+   xenbus_dev_error(dev, err, "Freezing timed out;"
+"the device may become inconsistent state");
+   return err;
+   }
+
+   /* Tear down queues */
+   xennet_disconnect_backend(info);
+   xennet_destroy_queues(info);
+
+   info->freeze_state = NETIF_FREEZE_STATE_FROZEN;
+
+   return err;
+}
+
+static

[Xen-devel] [RFC PATCH 10/12] xen/events: add xen_shutdown_pirqs helper function

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Add a simple helper function to "shutdown" active PIRQs, which actually
closes event channels but keeps related IRQ structures intact. PM
suspend/hibernation code will rely on this.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 drivers/xen/events/events_base.c | 12 
 include/xen/events.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 762378f..88137c8 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1581,6 +1581,18 @@ void xen_irq_resume(void)
restore_pirqs();
 }
 
+void xen_shutdown_pirqs(void)
+{
+   struct irq_info *info;
+
+   list_for_each_entry(info, _irq_list_head, list) {
+   if (info->type != IRQT_PIRQ || !VALID_EVTCHN(info->evtchn))
+   continue;
+
+   shutdown_pirq(irq_get_irq_data(info->irq));
+   }
+}
+
 static struct irq_chip xen_dynamic_chip __read_mostly = {
.name   = "xen-dyn",
 
diff --git a/include/xen/events.h b/include/xen/events.h
index c3e6bc6..e4d5ccb 100644
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -70,6 +70,7 @@ static inline void notify_remote_via_evtchn(int port)
 void notify_remote_via_irq(int irq);
 
 void xen_irq_resume(void);
+void xen_shutdown_pirqs(void);
 
 /* Clear an irq's pending state, in preparation for polling on it */
 void xen_clear_irq_pending(int irq);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 11/12] x86/xen: close event channels for PIRQs in system core suspend callback

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Close event channels allocated for devices which are backed by PIRQ and
still active when suspending the system core. Normally, the devices are
emulated legacy devices, e.g. PS/2 keyboard, floppy controller and etc.

Without this, in PM hibernation, information about the event channel
remains in hibernation image, but there is no guarantee that the same
event channel numbers are assigned to the devices when restoring the
system. This may cause conflict like the following and prevent some
devices from being restored correctly.

[  102.330821] [ cut here ]
[  102.333264] WARNING: CPU: 0 PID: 2324 at
drivers/xen/events/events_base.c:878 bind_evtchn_to_irq+0x88/0xf0
...
[  102.348057] Call Trace:
[  102.348057]  [] dump_stack+0x63/0x84
[  102.348057]  [] __warn+0xd1/0xf0
[  102.348057]  [] warn_slowpath_null+0x1d/0x20
[  102.348057]  [] bind_evtchn_to_irq+0x88/0xf0
[  102.348057]  [] ? blkif_copy_from_grant+0xb0/0xb0 
[xen_blkfront]
[  102.348057]  [] bind_evtchn_to_irqhandler+0x27/0x80
[  102.348057]  [] talk_to_blkback+0x425/0xcd0 [xen_blkfront]
[  102.348057]  [] ? __kmalloc+0x1ea/0x200
[  102.348057]  [] blkfront_restore+0x2d/0x60 [xen_blkfront]
[  102.348057]  [] xenbus_dev_restore+0x58/0x100
[  102.348057]  [] ?  xenbus_frontend_delayed_resume+0x20/0x20
[  102.348057]  [] xenbus_dev_cond_restore+0x1e/0x30
[  102.348057]  [] dpm_run_callback+0x4e/0x130
[  102.348057]  [] device_resume+0xe7/0x210
[  102.348057]  [] ? pm_dev_dbg+0x80/0x80
[  102.348057]  [] dpm_resume+0x114/0x2f0
[  102.348057]  [] hibernation_snapshot+0x15f/0x380
[  102.348057]  [] hibernate+0x183/0x290
[  102.348057]  [] state_store+0xcf/0xe0
[  102.348057]  [] kobj_attr_store+0xf/0x20
[  102.348057]  [] sysfs_kf_write+0x3a/0x50
[  102.348057]  [] kernfs_fop_write+0x10b/0x190
[  102.348057]  [] __vfs_write+0x28/0x120
[  102.348057]  [] ? rw_verify_area+0x49/0xb0
[  102.348057]  [] vfs_write+0xb2/0x1b0
[  102.348057]  [] SyS_write+0x46/0xa0
[  102.348057]  [] entry_SYSCALL_64_fastpath+0x1a/0xa9
[  102.423005] ---[ end trace b8d6718e22e2b107 ]---
[  102.425031] genirq: Flags mismatch irq 6.  (blkif) vs.  
(floppy)

Note that we don't explicitly re-allocate event channels for such
devices in the resume callback. Re-allocation will occur when PM core
re-enable IRQs for the devices at later point.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 arch/x86/xen/suspend.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index dae0f74..affa63d 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -105,6 +105,8 @@ static int xen_syscore_suspend(void)
xen_save_steal_clock(cpu);
}
 
+   xen_shutdown_pirqs();
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 02/12] xen/manage: introduce helper function to know the on-going suspend mode

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Introduce simple functions which help to know the on-going suspend mode
so that other Xen-related code can behave differently according to the
current suspend mode.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Alakesh Haloi 
Reviewed-by: Sebastian Biemueller 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 drivers/xen/manage.c  | 15 +++
 include/xen/xen-ops.h |  4 
 2 files changed, 19 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 8f9ea87..326631d 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -50,6 +50,21 @@ enum suspend_modes {
 /* Protected by pm_mutex */
 static enum suspend_modes suspend_mode = NO_SUSPEND;
 
+bool xen_suspend_mode_is_xen_suspend(void)
+{
+   return suspend_mode == XEN_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_suspend(void)
+{
+   return suspend_mode == PM_SUSPEND;
+}
+
+bool xen_suspend_mode_is_pm_hibernation(void)
+{
+   return suspend_mode == PM_HIBERNATION;
+}
+
 struct suspend_info {
int cancelled;
 };
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index fd23e42..be78f6f 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -39,6 +39,10 @@ u64 xen_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
+bool xen_suspend_mode_is_xen_suspend(void);
+bool xen_suspend_mode_is_pm_suspend(void);
+bool xen_suspend_mode_is_pm_hibernation(void);
+
 extern unsigned long *xen_contiguous_bitmap;
 
 #ifdef CONFIG_XEN_PV
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 03/12] xenbus: add freeze/thaw/restore callbacks support

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Since commit b3e96c0c7562 ("xen: use freeze/restore/thaw PM events for
suspend/resume/chkpt"), xenbus uses PMSG_FREEZE, PMSG_THAW and
PMSG_RESTORE events for Xen suspend. However, they're actually assigned
to xenbus_dev_suspend(), xenbus_dev_cancel() and xenbus_dev_resume()
respectively, and only suspend and resume callbacks are supported at
driver level. To support PM suspend and PM hibernation, modify the bus
level PM callbacks to invoke not only device driver's suspend/resume but
also freeze/thaw/restore.

Note that we'll use freeze/restore callbacks even for PM suspend whereas
suspend/resume callbacks are normally used in the case, becausae the
existing xenbus device drivers already have suspend/resume callbacks
specifically designed for Xen suspend. So we can allow the device
drivers to keep the existing callbacks wihtout modification.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 drivers/xen/xenbus/xenbus_probe.c | 102 --
 include/xen/xenbus.h  |   3 ++
 2 files changed, 89 insertions(+), 16 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index ec9eb4f..95b0a6d 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -588,26 +589,47 @@ int xenbus_dev_suspend(struct device *dev)
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
+   int (*cb)(struct xenbus_device *) = NULL;
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
 
DPRINTK("%s", xdev->nodename);
 
if (dev->driver == NULL)
return 0;
drv = to_xenbus_driver(dev->driver);
-   if (drv->suspend)
-   err = drv->suspend(xdev);
-   if (err)
-   pr_warn("suspend %s failed: %i\n", dev_name(dev), err);
+
+   if (xen_suspend)
+   cb = drv->suspend;
+   else
+   cb = drv->freeze;
+
+   if (cb)
+   err = cb(xdev);
+
+   if (err) {
+   pr_warn("%s %s failed: %i\n", xen_suspend ?
+   "suspend" : "freeze", dev_name(dev), err);
+   return err;
+   }
+
+   if (!xen_suspend) {
+   /* Forget otherend since this can become stale after restore */
+   free_otherend_watch(xdev);
+   free_otherend_details(xdev);
+   }
+
return 0;
 }
 EXPORT_SYMBOL_GPL(xenbus_dev_suspend);
 
 int xenbus_dev_resume(struct device *dev)
 {
-   int err;
+   int err = 0;
struct xenbus_driver *drv;
struct xenbus_device *xdev
= container_of(dev, struct xenbus_device, dev);
+   int (*cb)(struct xenbus_device *) = NULL;
+   bool xen_suspend = xen_suspend_mode_is_xen_suspend();
 
DPRINTK("%s", xdev->nodename);
 
@@ -616,24 +638,34 @@ int xenbus_dev_resume(struct device *dev)
drv = to_xenbus_driver(dev->driver);
err = talk_to_otherend(xdev);
if (err) {
-   pr_warn("resume (talk_to_otherend) %s failed: %i\n",
+   pr_warn("%s (talk_to_otherend) %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
 
-   xdev->state = XenbusStateInitialising;
+   if (xen_suspend)
+   xdev->state = XenbusStateInitialising;
 
-   if (drv->resume) {
-   err = drv->resume(xdev);
-   if (err) {
-   pr_warn("resume %s failed: %i\n", dev_name(dev), err);
-   return err;
-   }
+   if (xen_suspend)
+   cb = drv->resume;
+   else
+   cb = drv->restore;
+
+   if (cb)
+   err = cb(xdev);
+
+   if (err) {
+   pr_warn("%s %s failed: %i\n",
+   xen_suspend ? "resume" : "restore",
+   dev_name(dev), err);
+   return err;
}
 
err = watch_otherend(xdev);
if (err) {
-   pr_warn("resume (watch_otherend) %s failed: %d.\n",
+   pr_warn("%s (watch_otherend) %s failed: %d.\n",
+   xen_suspend ? "resume" : "restore",
dev_name(dev), err);
return err;
}
@@ -644,8 +676,46 @@ EXPORT_SYMBOL_GPL(xenbus_dev_resume);
 
 int xenbus_dev_cancel(struct device *dev)
 {
-   /* Do nothing */
-   DPRINTK("cancel");
+   int err = 0;
+

[Xen-devel] [RFC PATCH 05/12] x86/xen: add system core suspend and resume callbacks

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Add Xen PVHVM specific system core callbacks for PM suspend and
hibernation support. The callbacks suspend and resume Xen primitives,
like shared_info, pvclock and grant table. Note that Xen suspend can
handle them in a different manner, but system core callbacks are called
from the context. So if the callbacks are called from Xen suspend
context, return immediately.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 arch/x86/xen/enlighten_hvm.c |  1 +
 arch/x86/xen/suspend.c   | 53 
 include/xen/xen-ops.h|  2 ++
 3 files changed, 56 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index d24ad16..4196a65 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -202,6 +202,7 @@ static void __init xen_hvm_guest_init(void)
if (xen_feature(XENFEAT_hvm_callback_vector))
xen_have_vector_callback = 1;
 
+   xen_setup_syscore_ops();
xen_hvm_smp_init();
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
xen_unplug_emulated_devices();
diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 1d83152..784c448 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -2,17 +2,22 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include "xen-ops.h"
 #include "mmu.h"
@@ -82,3 +87,51 @@ void xen_arch_suspend(void)
 
on_each_cpu(xen_vcpu_notify_suspend, NULL, 1);
 }
+
+static int xen_syscore_suspend(void)
+{
+   struct xen_remove_from_physmap xrfp;
+   int ret;
+
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return 0;
+
+   xrfp.domid = DOMID_SELF;
+   xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
+
+   ret = HYPERVISOR_memory_op(XENMEM_remove_from_physmap, );
+   if (!ret)
+   HYPERVISOR_shared_info = _dummy_shared_info;
+
+   return ret;
+}
+
+static void xen_syscore_resume(void)
+{
+   /* Xen suspend does similar stuffs in its own logic */
+   if (xen_suspend_mode_is_xen_suspend())
+   return;
+
+   /* No need to setup vcpu_info as it's already moved off */
+   xen_hvm_map_shared_info();
+
+   pvclock_resume();
+
+   gnttab_resume();
+}
+
+/*
+ * These callbacks will be called with interrupts disabled and when having only
+ * one CPU online.
+ */
+static struct syscore_ops xen_hvm_syscore_ops = {
+   .suspend = xen_syscore_suspend,
+   .resume = xen_syscore_resume
+};
+
+void __init xen_setup_syscore_ops(void)
+{
+   if (xen_hvm_domain())
+   register_syscore_ops(_hvm_syscore_ops);
+}
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index be78f6f..65f25bd 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -43,6 +43,8 @@ bool xen_suspend_mode_is_xen_suspend(void);
 bool xen_suspend_mode_is_pm_suspend(void);
 bool xen_suspend_mode_is_pm_hibernation(void);
 
+void xen_setup_syscore_ops(void);
+
 extern unsigned long *xen_contiguous_bitmap;
 
 #ifdef CONFIG_XEN_PV
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 01/12] xen/manage: keep track of the on-going suspend mode

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

To differentiate between Xen suspend, PM suspend and PM hibernation,
keep track of the on-going suspend mode by mainly using a new PM
notifier. Since Xen suspend doesn't have corresponding PM event, its
main logic is modfied to acquire pm_mutex and set the current mode.

Note that we may see deadlock if PM suspend/hibernation is interrupted
by Xen suspend. PM suspend/hibernation depends on xenwatch thread to
process xenbus state transactions, but the thread will sleep to wait
pm_mutex which is already held by PM suspend/hibernation context in the
scenario. Though, acquirng pm_mutex is still right thing to do, and we
would need to modify Xen shutdown code to avoid the issue. This will be
fixed by a separate patch.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Sebastian Biemueller 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 drivers/xen/manage.c | 58 
 1 file changed, 58 insertions(+)

diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
index 8835065..8f9ea87 100644
--- a/drivers/xen/manage.c
+++ b/drivers/xen/manage.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -39,6 +40,16 @@ enum shutdown_state {
 /* Ignore multiple shutdown requests. */
 static enum shutdown_state shutting_down = SHUTDOWN_INVALID;
 
+enum suspend_modes {
+   NO_SUSPEND = 0,
+   XEN_SUSPEND,
+   PM_SUSPEND,
+   PM_HIBERNATION,
+};
+
+/* Protected by pm_mutex */
+static enum suspend_modes suspend_mode = NO_SUSPEND;
+
 struct suspend_info {
int cancelled;
 };
@@ -98,6 +109,10 @@ static void do_suspend(void)
int err;
struct suspend_info si;
 
+   lock_system_sleep();
+
+   suspend_mode = XEN_SUSPEND;
+
shutting_down = SHUTDOWN_SUSPEND;
 
err = freeze_processes();
@@ -161,6 +176,10 @@ static void do_suspend(void)
thaw_processes();
 out:
shutting_down = SHUTDOWN_INVALID;
+
+   suspend_mode = NO_SUSPEND;
+
+   unlock_system_sleep();
 }
 #endif /* CONFIG_HIBERNATE_CALLBACKS */
 
@@ -372,3 +391,42 @@ int xen_setup_shutdown_event(void)
 EXPORT_SYMBOL_GPL(xen_setup_shutdown_event);
 
 subsys_initcall(xen_setup_shutdown_event);
+
+static int xen_pm_notifier(struct notifier_block *notifier,
+  unsigned long pm_event, void *unused)
+{
+   switch (pm_event) {
+   case PM_SUSPEND_PREPARE:
+   suspend_mode = PM_SUSPEND;
+   break;
+   case PM_HIBERNATION_PREPARE:
+   case PM_RESTORE_PREPARE:
+   suspend_mode = PM_HIBERNATION;
+   break;
+   case PM_POST_SUSPEND:
+   case PM_POST_RESTORE:
+   case PM_POST_HIBERNATION:
+   /* Set back to the default */
+   suspend_mode = NO_SUSPEND;
+   break;
+   default:
+   pr_warn("Receive unknown PM event 0x%lx\n", pm_event);
+   return -EINVAL;
+   }
+
+   return 0;
+};
+
+static struct notifier_block xen_pm_notifier_block = {
+   .notifier_call = xen_pm_notifier
+};
+
+static int xen_setup_pm_notifier(void)
+{
+   if (!xen_hvm_domain())
+   return -ENODEV;
+
+   return register_pm_notifier(_pm_notifier_block);
+}
+
+subsys_initcall(xen_setup_pm_notifier);
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 09/12] x86/xen: save and restore steal clock

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Save steal clock values of all present CPUs in the system core ops
suspend callbacks. Also, restore a boot CPU's steal clock in the system
core resume callback. For non-boot CPUs, restore after they're brought
up, because runstate info for non-boot CPUs are not active until then.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 arch/x86/xen/suspend.c | 13 -
 arch/x86/xen/time.c|  3 +++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/suspend.c b/arch/x86/xen/suspend.c
index 784c448..dae0f74 100644
--- a/arch/x86/xen/suspend.c
+++ b/arch/x86/xen/suspend.c
@@ -91,12 +91,20 @@ void xen_arch_suspend(void)
 static int xen_syscore_suspend(void)
 {
struct xen_remove_from_physmap xrfp;
-   int ret;
+   int cpu, ret;
 
/* Xen suspend does similar stuffs in its own logic */
if (xen_suspend_mode_is_xen_suspend())
return 0;
 
+   for_each_present_cpu(cpu) {
+   /*
+* Nonboot CPUs are already offline, but the last copy of
+* runstate info is still accessible.
+*/
+   xen_save_steal_clock(cpu);
+   }
+
xrfp.domid = DOMID_SELF;
xrfp.gpfn = __pa(HYPERVISOR_shared_info) >> PAGE_SHIFT;
 
@@ -118,6 +126,9 @@ static void xen_syscore_resume(void)
 
pvclock_resume();
 
+   /* Nonboot CPUs will be resumed when they're brought up */
+   xen_restore_steal_clock(smp_processor_id());
+
gnttab_resume();
 }
 
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index e0f1bcf..85f8534 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -523,6 +523,9 @@ static void xen_hvm_setup_cpu_clockevents(void)
 {
int cpu = smp_processor_id();
xen_setup_runstate_info(cpu);
+   if (cpu)
+   xen_restore_steal_clock(cpu);
+
/*
 * xen_setup_timer(cpu) - snprintf is bad in atomic context. Hence
 * doing it xen_hvm_cpu_notify (which gets called by smp_init during
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Add freeze and restore callbacks for PM suspend and hibernation support.
The freeze handler stops a block-layer queue and disconnect the frontend
from the backend while freeing ring_info and associated resources. The
restore handler re-allocates ring_info and re-connect to the backedend,
so the rest of the kernel can continue to use the block device
transparently.Also, the handlers are used for both PM
suspend and hibernation so that we can keep the existing suspend/resume
callbacks for Xen suspend without modification.
If a backend doesn't have commit 12ea729645ac ("xen/blkback: unmap all
persistent grants when frontend gets disconnected"), the frontend may see
massive amount of grant table warning when freeing resources.

 [   36.852659] deferring g.e. 0xf9 (pfn 0x)
 [   36.855089] xen:grant_table: WARNING: g.e. 0x112 still in use!

In this case, persistent grants would need to be disabled.

Ensure no reqs/rsps in rings before disconnecting. When disconnecting
the frontend from the backend in blkfront_freeze(), there still may be
unconsumed requests or responses in the rings, especially when the
backend is backed by network-based device. If the frontend gets
disconnected with such reqs/rsps remaining there, it can cause
grant warnings and/or losing reqs/rsps by freeing pages afterward.
This can lead resumed kernel into unrecoverable state like unexpected
freeing of grant page and/or hung task due to the lost reqs or rsps.
Therefore we have to ensure that there is no unconsumed requests or
responses before disconnecting.

Actually, the frontend just needs to wait for some amount of time so that
the backend can process the requests, put responses and notify the
frontend back. Timeout used here is based on some heuristic. If we somehow
hit the timeout, it would mean something serious happens in the backend,
the frontend will just return an error to PM core and PM
suspend/hibernation will be aborted. This may be something should be
fixed by the backend side, but a frontend side fix is probably
still worth doing to work with broader backends.

Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 drivers/block/xen-blkfront.c | 158 +--
 1 file changed, 151 insertions(+), 7 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ae00a82f350b..a223864c2220 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -46,6 +46,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -78,6 +80,8 @@ enum blkif_state {
BLKIF_STATE_DISCONNECTED,
BLKIF_STATE_CONNECTED,
BLKIF_STATE_SUSPENDED,
+   BLKIF_STATE_FREEZING,
+   BLKIF_STATE_FROZEN
 };
 
 struct grant {
@@ -216,6 +220,7 @@ struct blkfront_info
/* Save uncomplete reqs and bios for migration. */
struct list_head requests;
struct bio_list bio_list;
+   struct completion wait_backend_disconnected;
 };
 
 static unsigned int nr_minors;
@@ -262,6 +267,16 @@ static DEFINE_SPINLOCK(minor_lock);
 static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
 static void blkfront_gather_backend_features(struct blkfront_info *info);
 static int negotiate_mq(struct blkfront_info *info);
+static void __blkif_free(struct blkfront_info *info);
+
+static inline bool blkfront_ring_is_busy(struct blkif_front_ring *ring)
+{
+   if (RING_SIZE(ring) > RING_FREE_REQUESTS(ring) ||
+   RING_HAS_UNCONSUMED_RESPONSES(ring))
+   return true;
+   else
+   return false;
+}
 
 static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
 {
@@ -996,6 +1011,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
blkif_set_queue_limits(info);
+   init_completion(>wait_backend_disconnected);
 
return 0;
 }
@@ -1219,6 +1235,8 @@ static void xlvbd_release_gendisk(struct blkfront_info 
*info)
 /* Already hold rinfo->ring_lock. */
 static inline void kick_pending_request_queues_locked(struct 
blkfront_ring_info *rinfo)
 {
+   if (unlikely(rinfo->dev_info->connected == BLKIF_STATE_FREEZING))
+   return;
if (!RING_FULL(>ring))
blk_mq_start_stopped_hw_queues(rinfo->dev_info->rq, true);
 }
@@ -1342,8 +1360,6 @@ static void blkif_free_ring(struct blkfront_ring_info 
*rinfo)
 
 static void blkif_free(struct blkfront_info *info, int suspend)
 {
-   unsigned int i;
-
/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
@@ -1351,6 +1367,13 @@ static void blkif_free(struct blkfront_info *info, int 
suspend)
if (info->rq)

[Xen-devel] [RFC PATCH 08/12] xen-time-introduce-xen_-save-restore-_steal_clock

2018-06-12 Thread Anchal Agarwal
From: Munehisa Kamata 

Currently, steal time accounting code in scheduler expects steal clock
callback to provide monotonically increasing value. If the accounting
code receives a smaller value than previous one, it uses a negative
value to calculate steal time and results in incorrectly updated idle
and steal time accounting. This breaks userspace tools which read
/proc/stat.

top - 08:05:35 up  2:12,  3 users,  load average: 0.00, 0.07, 0.23
Tasks:  80 total,   1 running,  79 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,30100.0%id,  0.0%wa,  0.0%hi, 
0.0%si,-1253874204672.0%st

This can actually happen when a Xen PVHVM guest gets restored from
hibernation, because such a restored guest is just a fresh domain from
Xen perspective and the time information in runstate info starts over
from scratch.

This patch introduces xen_save_steal_clock() which saves current values
in runstate info into per-cpu variables. Its couterpart,
xen_restore_steal_clock(), sets offset if it found the current values in
runstate info are smaller than previous ones. xen_steal_clock() is also
modified to use the offset to ensure that scheduler only sees
monotonically increasing number.

Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 drivers/xen/time.c| 28 +++-
 include/xen/xen-ops.h |  2 ++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/time.c b/drivers/xen/time.c
index 3e741cd..4756042 100644
--- a/drivers/xen/time.c
+++ b/drivers/xen/time.c
@@ -20,6 +20,8 @@
 
 /* runstate info updated by Xen */
 static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
+static DEFINE_PER_CPU(u64, xen_prev_steal_clock);
+static DEFINE_PER_CPU(u64, xen_steal_clock_offset);
 
 static DEFINE_PER_CPU(u64[4], old_runstate_time);
 
@@ -149,7 +151,7 @@ bool xen_vcpu_stolen(int vcpu)
return per_cpu(xen_runstate, vcpu).state == RUNSTATE_runnable;
 }
 
-u64 xen_steal_clock(int cpu)
+static u64 __xen_steal_clock(int cpu)
 {
struct vcpu_runstate_info state;
 
@@ -157,6 +159,30 @@ u64 xen_steal_clock(int cpu)
return state.time[RUNSTATE_runnable] + state.time[RUNSTATE_offline];
 }
 
+u64 xen_steal_clock(int cpu)
+{
+   return __xen_steal_clock(cpu) + per_cpu(xen_steal_clock_offset, cpu);
+}
+
+void xen_save_steal_clock(int cpu)
+{
+   per_cpu(xen_prev_steal_clock, cpu) = xen_steal_clock(cpu);
+}
+
+void xen_restore_steal_clock(int cpu)
+{
+   u64 steal_clock = __xen_steal_clock(cpu);
+
+   if (per_cpu(xen_prev_steal_clock, cpu) > steal_clock) {
+   /* Need to update the offset */
+   per_cpu(xen_steal_clock_offset, cpu) =
+   per_cpu(xen_prev_steal_clock, cpu) - steal_clock;
+   } else {
+   /* Avoid unnecessary steal clock warp */
+   per_cpu(xen_steal_clock_offset, cpu) = 0;
+   }
+}
+
 void xen_setup_runstate_info(int cpu)
 {
struct vcpu_register_runstate_memory_area area;
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 65f25bd..10330f8 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -36,6 +36,8 @@ void xen_time_setup_guest(void);
 void xen_manage_runstate_time(int action);
 void xen_get_runstate_snapshot(struct vcpu_runstate_info *res);
 u64 xen_steal_clock(int cpu);
+void xen_save_steal_clock(int cpu);
+void xen_restore_steal_clock(int cpu);
 
 int xen_setup_shutdown_event(void);
 
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 04/12] x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume

2018-06-12 Thread Anchal Agarwal
Introduce a small function which re-uses shared page's PA allocated
during guest initialization time in reserve_shared_info() and not
allocate new page during resume flow.
It also  does the mapping of shared_info_page by calling
xen_hvm_init_shared_info() to use the function.

Signed-off-by: Anchal Agarwal 
Reviewed-by: Sebastian Biemueller 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
CR: https://cr.amazon.com/r/8273203/
---
 arch/x86/xen/enlighten_hvm.c | 7 +++
 arch/x86/xen/xen-ops.h   | 1 +
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/xen/enlighten_hvm.c b/arch/x86/xen/enlighten_hvm.c
index 754d5391d9fa..2c4bcf92a90a 100644
--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -24,6 +24,13 @@
 
 static unsigned long shared_info_pfn;
 
+void xen_hvm_map_shared_info(void)
+{
+   xen_hvm_init_shared_info();
+   if (shared_info_pfn)
+   HYPERVISOR_shared_info = __va(PFN_PHYS(shared_info_pfn));
+}
+
 void xen_hvm_init_shared_info(void)
 {
struct xen_add_to_physmap xatp;
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index f377e1820c6c..94c8a009ab35 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -58,6 +58,7 @@ void xen_enable_syscall(void);
 void xen_vcpu_restore(void);
 
 void xen_callback_vector(void);
+void xen_hvm_map_shared_info(void);
 void xen_hvm_init_shared_info(void);
 void xen_unplug_emulated_devices(void);
 
-- 
2.14.3


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 12/12] PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

2018-06-12 Thread Anchal Agarwal
From: Aleksei Besogonov 

The SNAPSHOT_SET_SWAP_AREA is supposed to be used to set the hibernation
offset on a running kernel to enable hibernating to a swap file.
However, it doesn't actually update the swsusp_resume_block variable. As
a result, the hibernation fails at the last step (after all the data is
written out) in the validation of the swap signature in
mark_swapfiles().

Before this patch, the command line processing was the only place where
swsusp_resume_block was set.

Signed-off-by: Aleksei Besogonov 
Signed-off-by: Munehisa Kamata 
Signed-off-by: Anchal Agarwal 
Reviewed-by: Munehisa Kamata 
Reviewed-by: Eduardo Valentin 
---
 kernel/power/user.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/power/user.c b/kernel/power/user.c
index abd2255..b522a42 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -379,8 +379,12 @@ static long snapshot_ioctl(struct file *filp, unsigned int 
cmd,
if (swdev) {
offset = swap_area.offset;
data->swap = swap_type_of(swdev, offset, NULL);
-   if (data->swap < 0)
+   if (data->swap < 0) {
error = -ENODEV;
+   } else {
+   swsusp_resume_device = swdev;
+   swsusp_resume_block = offset;
+   }
} else {
data->swap = -1;
error = -EINVAL;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [RFC PATCH 00/12] Enable PM hibernation on guest VMs

2018-06-12 Thread Anchal Agarwal
Hello,
I am sending out a series of patches that implements guest
PM hibernation. These guests are running on xen hypervisor.
The patches had been tested against mainstream kernel and latest
xen version-4.11. EC2 instance hibernation feature is provided to
the AWS EC2 customers. PM hibernation uses swap space where
hibernation image is stored and restored from. I would like
the community to review and provide some feedback on the patch 
series and if they look good, merge them into 4.17 kernel.

Aleksei Besogonov (1):
  PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA

Anchal Agarwal (1):
  x86/xen: Introduce new function to map HYPERVISOR_shared_info on
Resume

Munehisa Kamata (10):
  xen/manage: keep track of the on-going suspend mode
  xen/manage: introduce helper function to know the on-going suspend
mode
  xenbus: add freeze/thaw/restore callbacks support
  x86/xen: add system core suspend and resume callbacks
  xen-blkfront: add callbacks for PM suspend and hibernation
  xen-netfront: add callbacks for PM suspend and hibernation support
  xen-time-introduce-xen_-save-restore-_steal_clock
  x86/xen: save and restore steal clock
  xen/events: add xen_shutdown_pirqs helper function
  x86/xen: close event channels for PIRQs in system core suspend
callback

 arch/x86/xen/enlighten_hvm.c  |   8 ++
 arch/x86/xen/suspend.c|  66 
 arch/x86/xen/time.c   |   3 +
 arch/x86/xen/xen-ops.h|   1 +
 drivers/block/xen-blkfront.c  | 158 --
 drivers/net/xen-netfront.c|  97 ++-
 drivers/xen/events/events_base.c  |  12 +++
 drivers/xen/manage.c  |  73 ++
 drivers/xen/time.c|  28 ++-
 drivers/xen/xenbus/xenbus_probe.c | 102 
 include/xen/events.h  |   1 +
 include/xen/xen-ops.h |   8 ++
 include/xen/xenbus.h  |   3 +
 kernel/power/user.c   |   6 +-
 14 files changed, 540 insertions(+), 26 deletions(-)

-- 
2.13.6


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] Revert xen: dont fiddle with event channel masking in suspend/resume

2018-05-15 Thread Anchal Agarwal

O Fri, Apr 20, 2018 at 07:43:31AM +0200, Juergen Gross wrote:
> On 20/04/18 01:04, Anchal Agarwal wrote:
> > 
> > Hello,
> > 
> > This patch reverts commit e91b2b1194335ca83d8a40fa4e0efd480bf2babe.
> > evtchn are supposed to be masked during resume by irq subsytem 
> > however, they are not. This causes special interrupts like PV 
> > spinlock to cause kernel BUG() as it expects the IRQ to be 
> > masked. This causes instances that are live migrated successfully 
> > to crash after few minutes.
> > 
> > Live Migration uses suspend resume and when xen_irq_resume is invoked, 
> > I saw event channels are not masked. Hence, I reverted this
> > commit to make LM work. Feelings? Recommendations? Things I missed?
> 
> The commit you are reverting was meant to repair suspend/resume handling
> for Xen. Instead of just reverting it the correct thing to do would be
> to find the reason why some event channels are not being masked and
> address that issue.
> 
> See https://lists.xen.org/archives/html/xen-devel/2017-07/msg00898.html
> 
> 
> Juergen
>

Hi Juergen,
The discussion you pointed out suggests to set a flag 
IRQCHIP_MASK_ON_SUSPEND(by tglx@) 
on device irq_chip for non-wakeup interrupts or even a generic flag as 
suggested by you 
on device irq_chip however, in this case I am experiencing issues with spinlock 
ipi handling 
and according to the xen code IPIs are not supposed to be masked during suspend 
resume. 
Hence, even if I set this flag on xen's per_cpu irq_chip, the flag 
IRQF_NO_SUSPEND is being 
set on binding ipi to irq handler in spinlock init code (xen_init_lock_cpu 
->bind_ipi_to_irqhandler)
and dummy handler assigned is throwing out BUG() if it's called at all. Once 
this flag is
set suspend_device_irq will not disable the irq and hence event channel is not 
masked.
Now on xen_irq_resume, during restore_cpu_ipis-> xen_irq_info_ipi_setup, in 
case of 
2 level ABI event channel handling (The issue is on xen 4.2), the port setup is 
a no-op 
however, while using fifo event channel it starts the setup with all event 
channels masked.
Hence if I revert the patch and mask everything in the beginning of 
xen_irq_resume, I don't see the issue.
To avoid this issue I can think of two things:
1. Revert the patch as mentioned before
2. Not to call BUG() in dummy_handler in spinlock code. Rather use 
xen_reschedule_interrupt and not 
the dummy_handler as was changed in commit d5de8841355a4. Moreover, it's not 
very much clear 
from the commit message why it was changed in the first place.

Any thoughts/suggestions?

Thanks,
Anchal
> > 
> > One such stack:
> >  [ cut here ]
> >  kernel BUG at arch/x86/xen/spinlock.c:75!
> >  CPU: 0 PID: 675 Comm: kauditd Not tainted 4.14.20-48.30.amzn2.x86_64 #1
> >  Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
> >  task: 880205eedac0 task.stack: c9e4c000
> >  RIP: 0010:dummy_handler+0x0/0x10
> >  RSP: 0018:880207203eb8 EFLAGS: 00010046
> >  RAX: 81027f10 RBX: 880206ce1b00 RCX: 0035
> >  RDX: 81a81560 RSI:  RDI: 0035
> >  RBP: 0035 R08: 880206800248 R09: 880206d03600
> >  R10:  R11: 0040 R12: 
> >  R13: 880207203f04 R14:  R15: 
> >  FS:  () GS:88020720() 
> >  knlGS:
> >  CS:  0010 DS:  ES:  CR0: 80050033
> >  CR2: 561cc8dbd2d0 CR3: 01e0a001 CR4: 001606f0
> >  Call Trace:
> >   
> >   __handle_irq_event_percpu+0x40/0x190
> >   handle_irq_event_percpu+0x30/0x70
> >   handle_percpu_irq+0x37/0x50
> >   generic_handle_irq+0x24/0x30
> >   evtchn_2l_handle_events+0x162/0x280
> >   __xen_evtchn_do_upcall+0x42/0x80
> >   xen_evtchn_do_upcall+0x27/0x40
> >   xen_hvm_callback_vector+0x98/0xa0
> >   
> >  RIP: 0010:finish_task_switch+0x7b/0x200
> >  RSP: 0018:c9e4fe08 EFLAGS: 0246 ORIG_RAX: ff0c
> >  RAX: 0001 RBX: 880205eedac0 RCX: 
> >  RDX:  RSI:  RDI: 8802072211c0
> >  RBP: c9e4fe30 R08: 00387606 R09: 
> >  R10:  R11: 0040 R12: 8802072211c0
> >  R13: 81e12480 R14: 880202e54c00 R15: 
> >   ? finish_task_switch+0x74/0x200
> >   __schedule+0x29c/0x8a0
> >   ? __wake_up_common_lock+0x89/0xc0
> >   ? kauditd_send_multicast_skb+0x90/0x90
> >   schedule+0x28/0x80
> >   kauditd_thread+0x177/0x220
> >   ? finish_wait+0x80/0x80
> >   ? audi

[Xen-devel] [PATCH] Revert xen: dont fiddle with event channel masking in suspend/resume

2018-04-19 Thread Anchal Agarwal

Hello,

This patch reverts commit e91b2b1194335ca83d8a40fa4e0efd480bf2babe.
evtchn are supposed to be masked during resume by irq subsytem 
however, they are not. This causes special interrupts like PV 
spinlock to cause kernel BUG() as it expects the IRQ to be 
masked. This causes instances that are live migrated successfully 
to crash after few minutes.

Live Migration uses suspend resume and when xen_irq_resume is invoked, 
I saw event channels are not masked. Hence, I reverted this
commit to make LM work. Feelings? Recommendations? Things I missed?

One such stack:
 [ cut here ]
 kernel BUG at arch/x86/xen/spinlock.c:75!
 CPU: 0 PID: 675 Comm: kauditd Not tainted 4.14.20-48.30.amzn2.x86_64 #1
 Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
 task: 880205eedac0 task.stack: c9e4c000
 RIP: 0010:dummy_handler+0x0/0x10
 RSP: 0018:880207203eb8 EFLAGS: 00010046
 RAX: 81027f10 RBX: 880206ce1b00 RCX: 0035
 RDX: 81a81560 RSI:  RDI: 0035
 RBP: 0035 R08: 880206800248 R09: 880206d03600
 R10:  R11: 0040 R12: 
 R13: 880207203f04 R14:  R15: 
 FS:  () GS:88020720() 
 knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 561cc8dbd2d0 CR3: 01e0a001 CR4: 001606f0
 Call Trace:
  
  __handle_irq_event_percpu+0x40/0x190
  handle_irq_event_percpu+0x30/0x70
  handle_percpu_irq+0x37/0x50
  generic_handle_irq+0x24/0x30
  evtchn_2l_handle_events+0x162/0x280
  __xen_evtchn_do_upcall+0x42/0x80
  xen_evtchn_do_upcall+0x27/0x40
  xen_hvm_callback_vector+0x98/0xa0
  
 RIP: 0010:finish_task_switch+0x7b/0x200
 RSP: 0018:c9e4fe08 EFLAGS: 0246 ORIG_RAX: ff0c
 RAX: 0001 RBX: 880205eedac0 RCX: 
 RDX:  RSI:  RDI: 8802072211c0
 RBP: c9e4fe30 R08: 00387606 R09: 
 R10:  R11: 0040 R12: 8802072211c0
 R13: 81e12480 R14: 880202e54c00 R15: 
  ? finish_task_switch+0x74/0x200
  __schedule+0x29c/0x8a0
  ? __wake_up_common_lock+0x89/0xc0
  ? kauditd_send_multicast_skb+0x90/0x90
  schedule+0x28/0x80
  kauditd_thread+0x177/0x220
  ? finish_wait+0x80/0x80
  ? auditd_reset+0x90/0x90
  kthread+0x11a/0x130
  ? kthread_create_on_node+0x70/0x70
  ? call_usermodehelper_exec_async+0x12a/0x160
  ret_from_fork+0x35/0x40
  RIP: dummy_handler+0x0/0x10 RSP: 880207203eb8

Signed-off-by: Anchal Agarwal <ancha...@amazon.com>
Signed-off-by: Eduardo Valentin <edu...@amazon.com>
Reviewed-by: Frank van der Linden <fllin...@amazon.com>
Reviewed-by: Alakesh Haloi <alake...@amazon.com>
Reviewed-by: Vallish Vaidyeshwara <vall...@amazon.com>

---
 drivers/xen/events/events_base.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index bc03f1a6ad1b..ae71cab207f7 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -343,6 +343,14 @@ static void bind_evtchn_to_cpu(unsigned int chn, unsigned 
int cpu)
info->cpu = cpu;
 }
 
+static void xen_evtchn_mask_all(void)
+{
+   unsigned int evtchn;
+
+   for (evtchn = 0; evtchn < xen_evtchn_nr_channels(); evtchn++)
+   mask_evtchn(evtchn);
+}
+
 /**
  * notify_remote_via_irq - send event to remote end of event channel via irq
  * @irq: irq of event channel to send event to
@@ -1565,6 +1573,7 @@ void xen_irq_resume(void)
struct irq_info *info;
 
/* New event-channel space is not 'live' yet. */
+   xen_evtchn_mask_all();
xen_evtchn_resume();
 
/* No IRQ <-> event-channel mappings. */
@@ -1682,7 +1691,6 @@ module_param(fifo_events, bool, 0);
 void __init xen_init_IRQ(void)
 {
int ret = -EINVAL;
-   unsigned int evtchn;
 
if (fifo_events)
ret = xen_evtchn_fifo_init();
@@ -1694,8 +1702,7 @@ void __init xen_init_IRQ(void)
BUG_ON(!evtchn_to_irq);
 
/* No event channels are 'live' right now. */
-   for (evtchn = 0; evtchn < xen_evtchn_nr_channels(); evtchn++)
-   mask_evtchn(evtchn);
+   xen_evtchn_mask_all();
 
pirq_needs_eoi = pirq_needs_eoi_flag;
 
-- 
2.14.3


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel