from:"Dexuan Cui"

RE: [PATCH 1/1 v9] use crc32 instead of md5 for hibernation e820 integrity check

2021-04-19 Thread Dexuan Cui

> From: Chris von Recklinghausen 
> Sent: Friday, April 16, 2021 6:17 AM
>  ...
> Hibernation fails on a system in fips mode because md5 is used for the e820
> integrity check and is not available. Use crc32 instead.
> 
> The check is intended to detect whether the E820 memory map provided
> by the firmware after cold boot unexpectedly differs from the one that
> was in use when the hibernation image was created. In this case, the
> hibernation image cannot be restored, as it may cover memory regions
> that are no longer available to the OS.
> 
> A non-cryptographic checksum such as CRC-32 is sufficient to detect such
> inadvertent deviations.
> 
> Fixes: 62a03defeabd ("PM / hibernate: Verify the consistent of e820 memory
> map
>by md5 digest")
> 
> Signed-off-by: Chris von Recklinghausen 
> ---

Tested-by: Dexuan Cui 
Reviewed-by: Dexuan Cui 

Thanks Chris and all for the patch!

[PATCH v8 net-next 1/2] hv_netvsc: Make netvsc/VF binding check both MAC and serial number

2021-04-16 Thread Dexuan Cui

Currently the netvsc/VF binding logic only checks the PCI serial number.

The upcoming Microsoft Azure Network Adapter (MANA) supports multiple
net_device interfaces (each such interface is called a "vPort", and has
its unique MAC address) which are backed by the same VF PCI device, so
the binding logic should check both the MAC address and the PCI serial
number.

The change should not break any other existing VF drivers, because
Hyper-V NIC SR-IOV implementation requires the netvsc network
interface and the VF network interface have the same MAC address.

Co-developed-by: Haiyang Zhang 
Signed-off-by: Haiyang Zhang 
Co-developed-by: Shachar Raindel 
Signed-off-by: Shachar Raindel 
Signed-off-by: Dexuan Cui 
---
 drivers/net/hyperv/netvsc_drv.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 7349a70af083..f682a5572d84 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2297,6 +2297,7 @@ static struct net_device *get_netvsc_byslot(const struct 
net_device *vf_netdev)
 {
struct device *parent = vf_netdev->dev.parent;
struct net_device_context *ndev_ctx;
+   struct net_device *ndev;
struct pci_dev *pdev;
u32 serial;
 
@@ -2319,8 +2320,17 @@ static struct net_device *get_netvsc_byslot(const struct 
net_device *vf_netdev)
if (!ndev_ctx->vf_alloc)
continue;
 
-   if (ndev_ctx->vf_serial == serial)
-   return hv_get_drvdata(ndev_ctx->device_ctx);
+   if (ndev_ctx->vf_serial != serial)
+   continue;
+
+   ndev = hv_get_drvdata(ndev_ctx->device_ctx);
+   if (ndev->addr_len != vf_netdev->addr_len ||
+   memcmp(ndev->perm_addr, vf_netdev->perm_addr,
+  ndev->addr_len) != 0)
+   continue;
+
+   return ndev;
+
}
 
netdev_notice(vf_netdev,
-- 
2.25.1

[PATCH v8 net-next 0/2] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-16 Thread Dexuan Cui

The patchset adds the VF driver for Microsoft Azure Network Adapter (MANA),
and also changes the hv_netvsc driver's netvsc/VF binding logic to check
both the MAC address and the serial number (this is required by the MANA VF
driver).

v7 contains both the netvsc change and the VF driver. This version (v8)
posts them in 2 separate patches, as suggested by Stephen Hemminger.

Please refer to "[PATCH v8 net-next 2/2]" for the history of v1~v7.

Thanks,
Dexuan

Dexuan Cui (2):
  hv_netvsc: Make netvsc/VF binding check both MAC and serial number
  net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

 MAINTAINERS   |4 +-
 drivers/net/ethernet/Kconfig  |1 +
 drivers/net/ethernet/Makefile |1 +
 drivers/net/ethernet/microsoft/Kconfig|   29 +
 drivers/net/ethernet/microsoft/Makefile   |5 +
 drivers/net/ethernet/microsoft/mana/Makefile  |6 +
 drivers/net/ethernet/microsoft/mana/gdma.h|  673 ++
 .../net/ethernet/microsoft/mana/gdma_main.c   | 1415 
 .../net/ethernet/microsoft/mana/hw_channel.c  |  843 
 .../net/ethernet/microsoft/mana/hw_channel.h  |  190 ++
 drivers/net/ethernet/microsoft/mana/mana.h|  533 +
 drivers/net/ethernet/microsoft/mana/mana_en.c | 1895 +
 .../ethernet/microsoft/mana/mana_ethtool.c|  250 +++
 .../net/ethernet/microsoft/mana/shm_channel.c |  291 +++
 .../net/ethernet/microsoft/mana/shm_channel.h |   21 +
 drivers/net/hyperv/netvsc_drv.c   |   14 +-
 16 files changed, 6168 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/microsoft/Kconfig
 create mode 100644 drivers/net/ethernet/microsoft/Makefile
 create mode 100644 drivers/net/ethernet/microsoft/mana/Makefile
 create mode 100644 drivers/net/ethernet/microsoft/mana/gdma.h
 create mode 100644 drivers/net/ethernet/microsoft/mana/gdma_main.c
 create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.c
 create mode 100644 drivers/net/ethernet/microsoft/mana/hw_channel.h
 create mode 100644 drivers/net/ethernet/microsoft/mana/mana.h
 create mode 100644 drivers/net/ethernet/microsoft/mana/mana_en.c
 create mode 100644 drivers/net/ethernet/microsoft/mana/mana_ethtool.c
 create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.c
 create mode 100644 drivers/net/ethernet/microsoft/mana/shm_channel.h

-- 
2.25.1

RE: [PATCH v7 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-16 Thread Dexuan Cui

> From: Stephen Hemminger 
> Sent: Friday, April 16, 2021 11:09 AM
>  ...
> On Fri, 16 Apr 2021 17:58:45 +0000
> Dexuan Cui  wrote:
> 
> > > >
> > > > This probably should be a separate patch.
> > > > I think it is trying to address the case of VF discovery in 
> > > > Hyper-V/Azure
> where
> > > > the reported
> > > > VF from Hypervisor is bogus or confused.
> > >
> > > This is for the Multi vPorts feature of MANA driver, which allows one VF 
> > > to
> > > create multiple vPorts (NICs). They have the same PCI device and same VF
> > > serial number, but different MACs.
> > >
> > > So we put the change in one patch to avoid distro vendors missing this
> > > change when backporting the MANA driver.
> > >
> > > Thanks,
> > > - Haiyang
> >
> > The netvsc change should come together in the same patch with this VF
> > driver, otherwise the multi-vPorts functionality doesn't work properly.
> >
> > The netvsc change should not break any other existing VF drivers, because
> > Hyper-V NIC SR-IOV implementation requires the the NetVSC network
> > interface and the VF network interface should have the same MAC address,
> > otherwise things won't work.
> >
> > Thanks,
> > Dexuan
> 
> Distro vendors should be able to handle a patch series.
> Don't see why this could not be two patch series.

Ok. Will split this into 2 patches (the first one is the netvsc change, and the
second is the Linux VF driver) and post v8 shortly.

RE: [PATCH v7 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-16 Thread Dexuan Cui

> From: Haiyang Zhang 
> Sent: Friday, April 16, 2021 10:11 AM
> > From: Stephen Hemminger 
> > > ...
> > > @@ -2319,8 +2320,17 @@ static struct net_device
> > *get_netvsc_byslot(const struct net_device *vf_netdev)
> > >   if (!ndev_ctx->vf_alloc)
> > >   continue;
> > >
> > > - if (ndev_ctx->vf_serial == serial)
> > > - return hv_get_drvdata(ndev_ctx->device_ctx);
> > > + if (ndev_ctx->vf_serial != serial)
> > > + continue;
> > > +
> > > + ndev = hv_get_drvdata(ndev_ctx->device_ctx);
> > > + if (ndev->addr_len != vf_netdev->addr_len ||
> > > + memcmp(ndev->perm_addr, vf_netdev->perm_addr,
> > > +ndev->addr_len) != 0)
> > > + continue;
> > > +
> > > + return ndev;
> > > +
> > >   }
> > >
> > >   netdev_notice(vf_netdev,
> >
> >
> > This probably should be a separate patch.
> > I think it is trying to address the case of VF discovery in Hyper-V/Azure 
> > where
> > the reported
> > VF from Hypervisor is bogus or confused.
> 
> This is for the Multi vPorts feature of MANA driver, which allows one VF to
> create multiple vPorts (NICs). They have the same PCI device and same VF
> serial number, but different MACs.
> 
> So we put the change in one patch to avoid distro vendors missing this
> change when backporting the MANA driver.
> 
> Thanks,
> - Haiyang

The netvsc change should come together in the same patch with this VF
driver, otherwise the multi-vPorts functionality doesn't work properly.

The netvsc change should not break any other existing VF drivers, because
Hyper-V NIC SR-IOV implementation requires the the NetVSC network
interface and the VF network interface should have the same MAC address,
otherwise things won't work.

Thanks,
Dexuan

RE: [PATCH v6 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-15 Thread Dexuan Cui

> From: Stephen Hemminger 
> Sent: Thursday, April 15, 2021 2:15 PM
> > ...
> > +   netif_carrier_off(ndev);
> > +
> > +   get_random_bytes(apc->hashkey, MANA_HASH_KEY_SIZE);
> 
> Current practice for network drivers is to use netdev_rss_key_fill() for this.

Will change to netdev_rss_key_fill(). Thanks!

RE: [PATCH v6 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-15 Thread Dexuan Cui

> From: Jakub Kicinski 
> Sent: Thursday, April 15, 2021 10:44 AM
>  ...
> On Wed, 14 Apr 2021 22:45:19 -0700 Dexuan Cui wrote:
> > +   buf = dma_alloc_coherent(gmi->dev, length, _handle,
> > +GFP_KERNEL | __GFP_ZERO);
> 
> No need for GFP_ZERO, dma_alloc_coherent() zeroes the memory these days.

Yes, indeed. Will remove __GFP_ZERO.

> 
> > +static int mana_gd_register_irq(struct gdma_queue *queue,
> > +   const struct gdma_queue_spec *spec)
> > ...
> > +   struct gdma_irq_context *gic;
> > +
> > +   struct gdma_context *gc;
> 
> Why the empty line?

No good reason. Will remove this line. I'll check the whole patch
for similar issues.

> 
> > +   queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> > +   if (!queue)
> > +   return -ENOMEM;
> > +
> > +   gmi = >mem_info;
> > +   err = mana_gd_alloc_memory(gc, spec->queue_size, gmi);
> > +   if (err)
> > +   return err;
> 
> Leaks the memory from 'queue'?

Sorry. This should be a bug I introduced when moving arouond some code.

> Same code in mana_gd_create_mana_eq(), ...wq_cq(), etc.

Will fix all of them, and check for the code similar issues.

> > +int mana_do_attach(struct net_device *ndev, enum mana_attach_caller
> caller)
> > +{
> > +   struct mana_port_context *apc = netdev_priv(ndev);
> > +   struct gdma_dev *gd = apc->ac->gdma_dev;
> > +   u32 max_txq, max_rxq, max_queues;
> > +   int port_idx = apc->port_idx;
> > +   u32 num_indirect_entries;
> > +   int err;
> > +
> > +   if (caller == MANA_OPEN)
> > +   goto start_open;
> > +
> > +   err = mana_init_port_context(apc);
> > +   if (err)
> > +   return err;
> > +
> > +   err = mana_query_vport_cfg(apc, port_idx, _txq, _rxq,
> > +  _indirect_entries);
> > +   if (err) {
> > +   netdev_err(ndev, "Failed to query info for vPort 0\n");
> > +   goto reset_apc;
> > +   }
> > +
> > +   max_queues = min_t(u32, max_txq, max_rxq);
> > +   if (apc->max_queues > max_queues)
> > +   apc->max_queues = max_queues;
> > +
> > +   if (apc->num_queues > apc->max_queues)
> > +   apc->num_queues = apc->max_queues;
> > +
> > +   memcpy(ndev->dev_addr, apc->mac_addr, ETH_ALEN);
> > +
> > +   if (caller == MANA_PROBE)
> > +   return 0;
> > +
> > +start_open:
> 
> Why keep this as a single function, there is no overlap between what's
> done for OPEN and PROBE, it seems.
> 
> Similarly detach should probably be split into clearly distinct parts.

Will improve the code. Thanks for the suggestion!

Thanks,
Dexuan

RE: [PATCH v5 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-13 Thread Dexuan Cui

> From: Jakub Kicinski 
> Sent: Tuesday, April 13, 2021 12:03 PM
> 
> On Mon, 12 Apr 2021 19:35:09 -0700 Dexuan Cui wrote:
> > +   apc->port_st_save = apc->port_is_up;
> > +   apc->port_is_up = false;
> > +   apc->start_remove = true;
> > +
> > +   /* Ensure port state updated before txq state */
> > +   smp_wmb();
> > +
> > +   netif_tx_disable(ndev);
> 
> In your napi poll method there is no barrier between port_is_up check
> and netif_tx_queue_stopped().

Thanks for spotting this! We'll make the below change:

--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -938,16 +938,19 @@ static void mana_poll_tx_cq(struct mana_cq *cq)
avail_space = mana_gd_wq_avail_space(gdma_wq);

/* Ensure tail updated before checking q stop */
smp_mb();

net_txq = txq->net_txq;
txq_stopped = netif_tx_queue_stopped(net_txq);

+   /* Ensure checking txq_stopped before apc->port_is_up. */
+   smp_rmb();
+
if (txq_stopped && apc->port_is_up && avail_space >= MAX_TX_WQE_SIZE) {
netif_tx_wake_queue(net_txq);
apc->eth_stats.wake_queue++;
}

if (atomic_sub_return(pkt_transmitted, >pending_sends) < 0)
WARN_ON_ONCE(1);
 }

> > +   netif_carrier_off(ndev);
> > +
> > +   /* No packet can be transmitted now since apc->port_is_up is false.
> > +* There is still a tiny chance that mana_poll_tx_cq() can re-enable
> > +* a txq because it may not timely see apc->port_is_up being cleared
> > +* to false, but it doesn't matter since mana_start_xmit() drops any
> > +* new packets due to apc->port_is_up being false.
> > +*
> > +* Drain all the in-flight TX packets
> > +*/
> > +   for (i = 0; i < apc->num_queues; i++) {
> > +   txq = >tx_qp[i].txq;
> > +
> > +   while (atomic_read(>pending_sends) > 0)
> > +   usleep_range(1000, 2000);
> > +   }
> 
> > +   /* All cleanup actions should stay after rtnl_lock(), otherwise
> > +* other functions may access partially cleaned up data.
> > +*/
> > +   rtnl_lock();
> > +
> > +   mana_detach(ndev);
> > +
> > +   unregister_netdevice(ndev);
> > +
> > +   rtnl_unlock();
> 
> I find the resource management somewhat strange. Why is mana_attach()
> and mana_detach() called at probe/remove time, and not when the
> interface is brought up? Presumably when the user ifdowns the interface
> there is no point holding the resources? Your open/close methods are
> rather empty.

Thanks for the suggestion! Will move the functions to open/close().

> > +   if ((eq_addr & PAGE_MASK) != eq_addr)
> > +   return -EINVAL;
> > +
> > +   if ((cq_addr & PAGE_MASK) != cq_addr)
> > +   return -EINVAL;
> > +
> > +   if ((rq_addr & PAGE_MASK) != rq_addr)
> > +   return -EINVAL;
> > +
> > +   if ((sq_addr & PAGE_MASK) != sq_addr)
> > +   return -EINVAL;

Will change to PAGE_ALIGNED(). 

Thanks,
Dexuan

RE: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-12 Thread Dexuan Cui

> From: Jakub Kicinski 
> Sent: Monday, April 12, 2021 11:21 AM
> ... 
> On Sun, 11 Apr 2021 19:34:55 -0700 Dexuan Cui wrote:
> > +   for (i = 0; i < ANA_INDIRECT_TABLE_SIZE; i++)
> > +   apc->indir_table[i] = i % apc->num_queues;
> 
> ethtool_rxfh_indir_default()

Will use ethtool_rxfh_indir_default().

> > +   err = mana_cfg_vport_steering(apc, rx, true, update_hash, update_tab);
> > +   return err;
> 
> return mana_...
> 
> please fix everywhere.

Will fix this one, and will review if there is any similar issue.

> > +   netif_set_real_num_tx_queues(ndev, apc->num_queues);
> > +
> > +   err = mana_add_rx_queues(apc, ndev);
> > +   if (err)
> > +   goto destroy_vport;
> > +
> > +   apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE :
> TRI_STATE_FALSE;
> > +
> > +   netif_set_real_num_rx_queues(ndev, apc->num_queues);
> 
> netif_set_real_num_.. can fail.

Will fix the error handling.

> > +   rtnl_lock();
> > +
> > +   netdev_lockdep_set_classes(ndev);
> > +
> > +   ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM |
> NETIF_F_IPV6_CSUM;
> > +   ndev->hw_features |= NETIF_F_RXCSUM;
> > +   ndev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6;
> > +   ndev->hw_features |= NETIF_F_RXHASH;
> > +   ndev->features = ndev->hw_features;
> > +   ndev->vlan_features = 0;
> > +
> > +   err = register_netdevice(ndev);
> > +   if (err) {
> > +   netdev_err(ndev, "Unable to register netdev.\n");
> > +   goto destroy_vport;
> > +   }
> > +
> > +   rtnl_unlock();
> > +
> > +   return 0;
> > +destroy_vport:
> > +   rtnl_unlock();
> 
> Why do you take rtnl_lock() explicitly around this code?

It looks like there is no good reason, and I guess we just copied
the code from netvsc_probe(), where the RTNL lock is indeed
explicitly needed.

Will change to directly use register_netdev(), which gets and
release the RTNL lock automatically.

> > +static int mana_set_channels(struct net_device *ndev,
> > +struct ethtool_channels *channels)
> > +{
> > +   struct ana_port_context *apc = netdev_priv(ndev);
> > +   unsigned int new_count;
> > +   unsigned int old_count;
> > +   int err, err2;
> > +
> > +   new_count = channels->combined_count;
> > +   old_count = apc->num_queues;
> > +
> > +   if (new_count < 1 || new_count > apc->max_queues ||
> > +   channels->rx_count || channels->tx_count ||
> channels->other_count)
> 
> All these checks should be done by the core already.
> 
> > +   return -EINVAL;
> > +
> > +   if (new_count == old_count)
> > +   return 0;
> 
> And so is this one.

Will change the code to avoid unnecessary checking.

Thanks,
Dexuan

RE: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-12 Thread Dexuan Cui

> From: Haiyang Zhang 
> Sent: Monday, April 12, 2021 7:40 AM
> > From: Andrew Lunn 
> > Sent: Monday, April 12, 2021 8:32 AM
> > > ...
> > > + /* At most num_online_cpus() + 1 interrupts are used. */
> > > + msix_index = queue->eq.msix_index;
> > > + if (WARN_ON(msix_index > num_online_cpus()))
> > > + return;
> >
> > Do you handle hot{un}plug of CPUs?
> We don't have hot{un}plug of CPU feature yet.

Hyper-V doesn't support vCPU hot plug/unplug for VMs running on it,
but potentially the VM is able to offline and online its virtual CPUs,
though this is not a typical usage at all in production system (to offline
a vCPU, the VM also needs to unbind all the para-virtualized devices'
vmbus channels from that vCPU, which is kind of undoable in a VM
that has less than about 32 vCPUs, because typically all the vCPUs are
bound to one of the vmbus channels of the NetVSC device(s) and
StorVSC device(s), and can't be (easily) unbound.

So I think the driver does need to handle cpu online/offlining properly,
but I think we can do that in some future patch, because IMO that would
need more careful consideration. IMO here the WARN_ON() is good as
it at least lets us know something unexpected (if any) happens.

> > > +static void mana_hwc_init_event_handler(void *ctx, struct gdma_queue
> > *q_self,
> > > + struct gdma_event *event)
> > > +{
> > > + struct hw_channel_context *hwc = ctx;
> > > + struct gdma_dev *gd = hwc->gdma_dev;
> > > + union hwc_init_type_data type_data;
> > > + union hwc_init_eq_id_db eq_db;
> > > + u32 type, val;
> > > +
> > > + switch (event->type) {
> > > + case GDMA_EQE_HWC_INIT_EQ_ID_DB:
> > > + eq_db.as_uint32 = event->details[0];
> > > + hwc->cq->gdma_eq->id = eq_db.eq_id;
> > > + gd->doorbell = eq_db.doorbell;
> > > + break;
> > > +
> > > + case GDMA_EQE_HWC_INIT_DATA:
> > > +
> > > + type_data.as_uint32 = event->details[0];
> > > +
> > > + case GDMA_EQE_HWC_INIT_DONE:
> > > + complete(>hwc_init_eqe_comp);
> > > + break;
> >
> > ...
> >
> > > + default:
> > > + WARN_ON(1);
> > > + break;
> > > + }
> >
> > Are these events from the firmware? If you have newer firmware with an
> > older driver, are you going to spam the kernel log with WARN_ON dumps?
> For protocol version mismatch, the host and guest will either negotiate the
> highest common version, or fail to probe. So this kind of warnings are not
> expected.

I agree, but I think we'd better remove the WARN_ON(1), which was mainly
for debugging purposem, and was added just in case.

Thanks,
Dexuan

RE: [PATCH v4 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-12 Thread Dexuan Cui

> From: Leon Romanovsky 
> Sent: Monday, April 12, 2021 12:46 AM
> To: Dexuan Cui 
> > ...
> > +#define ANA_MAJOR_VERSION  0
> > +#define ANA_MINOR_VERSION  1
> > +#define ANA_MICRO_VERSION  1
> 
> Please don't introduce drier versions.

This is not the usual "driver version", though it's called  "drv version" :-)
As you can see, the driver does not use the macro MODULE_VERSION().

Here the "drv version" actually means the version of the VF-to-PF protocol,
with which the Azure Network Adapter ethernet NIC driver (i.e. the VF driver)
talks to the PF driver.  The protocol version determines the formats of the
messages that are sent from the VF driver to the PF driver, e.g. query the
MAC address, create Send/Receive queues, configure RSS, etc.

Currently the protocol versin is 0.1.1 You may ask why it's called
"drv version" rather than "protocol version" -- it's because the PF driver
calls it that way, so I think here the VF driver may as well use the same
name. BTW, the "drv ver" info is passed to the PF driver in the below
function:

static int mana_query_client_cfg(struct ana_context *ac, u32 drv_major_ver,
 u32 drv_minor_ver, u32 drv_micro_ver,
 u16 *max_num_vports)
{
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct ana_query_client_cfg_resp resp = {};
struct ana_query_client_cfg_req req = {};
struct device *dev = gc->dev;
int err = 0;

mana_gd_init_req_hdr(, ANA_QUERY_CLIENT_CONFIG,
 sizeof(req), sizeof(resp));
req.drv_major_ver = drv_major_ver;
req.drv_minor_ver = drv_minor_ver;
req.drv_micro_ver = drv_micro_ver;

err = mana_send_request(ac, , sizeof(req), , sizeof(resp));

Thanks,
Dexuan

RE: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Dexuan Cui

> From: Andrew Lunn 
> Sent: Thursday, April 8, 2021 5:30 PM
> To: Stephen Hemminger 
> ...
> > Linux kernel doesn't do namespaces in the code, so every new driver needs
> > to worry about global symbols clashing
> 
> This driver is called mana, yet the code uses ana. It would be good to
> resolve this inconsistency as well. Ideally, you want to prefix
> everything with ana_ or mana_, depending on what you choose, so we
> have a clean namespace.
> 
>  Andrew

Thanks for the suggestion! Let me think about this and work out a solution.

RE: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Dexuan Cui

> From: David Miller 
> Sent: Thursday, April 8, 2021 5:41 PM
> > ...
> > In the driver code, all the structs/unions marked by __packed are used to
> > talk with the hardware, so I think __packed is necessary here?
> 
> It actually isan't in many cases, check with and without the __packed 
> directive
> and see if anything chasnges.

Will do.

> > Do you think if it's better if we remove all the __packed, and add
> > static_assert(sizeof(struct XXX) == YYY) instead? e.g.
> >
> > @@ -105,7 +105,8 @@ struct gdma_msg_hdr {
> > u16 msg_version;
> > u16 hwc_msg_id;
> > u32 msg_size;
> > -} __packed;
> > +};
> > +static_assert(sizeof(struct gdma_msg_hdr) == 16);
> 
> This won't make sure the structure member offsets are what you expect.
> 
> I think you'll have to go through the structures one-by-one by hand to
> figure out which ones really require the __packed attribute and which do not.

Got it. Let me see if I can remove all the __packed.

RE: [PATCH v3 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Dexuan Cui

> From: David Miller 
> Sent: Thursday, April 8, 2021 4:46 PM
> ...
> > +struct gdma_msg_hdr {
> > +   u32 hdr_type;
> > +   u32 msg_type;
> > +   u16 msg_version;
> > +   u16 hwc_msg_id;
> > +   u32 msg_size;
> > +} __packed;
> > +
> > +struct gdma_dev_id {
> > +   union {
> > +   struct {
> > +   u16 type;
> > +   u16 instance;
> > +   };
> > +
> > +   u32 as_uint32;
> > +   };
> > +} __packed;
> 
> Please don't  use __packed unless absolutely necessary.  It generates
> suboptimal code (byte at a time
> accesses etc.) and for many of these you don't even need it.

In the driver code, all the structs/unions marked by __packed are used to
talk with the hardware, so I think __packed is necessary here?

Do you think if it's better if we remove all the __packed, and add
static_assert(sizeof(struct XXX) == YYY) instead? e.g.

@@ -105,7 +105,8 @@ struct gdma_msg_hdr {
u16 msg_version;
u16 hwc_msg_id;
u32 msg_size;
-} __packed;
+};
+static_assert(sizeof(struct gdma_msg_hdr) == 16);

 struct gdma_dev_id {
union {

Now I'm trying to figure out how other NIC drivers define structs/unions.

Thanks,
Dexuan

RE: [PATCH v2 net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-08 Thread Dexuan Cui

> From: Stephen Hemminger 
> Sent: Thursday, April 8, 2021 9:52 AM

Thanks all for your input! We'll make the below changes as suggested:

Microsoft Azure Network Device ==> Microsoft Network Devices
Drop the default m
validated ==> supported

We'll also fix some warnings reported by "kernel test robot".

Will post v3 later today.

Thanks,
Dexuan

diff --git a/drivers/net/ethernet/microsoft/Kconfig 
b/drivers/net/ethernet/microsoft/Kconfig
index 12ef6b581566..e1ac0a5d808d 100644
--- a/drivers/net/ethernet/microsoft/Kconfig
+++ b/drivers/net/ethernet/microsoft/Kconfig
@@ -3,26 +3,25 @@
 #
 
 config NET_VENDOR_MICROSOFT
-   bool "Microsoft Azure Network Device"
+   bool "Microsoft Network Devices"
default y
help
  If you have a network (Ethernet) device belonging to this class, say 
Y.
 
  Note that the answer to this question doesn't directly affect the
  kernel: saying N will just cause the configurator to skip the
- question about Microsoft Azure network device. If you say Y, you
- will be asked for your specific device in the following question.
+ question about Microsoft network devices. If you say Y, you will be
+ asked for your specific device in the following question.
 
 if NET_VENDOR_MICROSOFT
 
 config MICROSOFT_MANA
tristate "Microsoft Azure Network Adapter (MANA) support"
-   default m
depends on PCI_MSI && X86_64
select PCI_HYPERV
help
  This driver supports Microsoft Azure Network Adapter (MANA).
- So far, the driver is only validated on X86_64.
+ So far, the driver is only supported on X86_64.
 
  To compile this driver as a module, choose M here.
  The module will be called mana.

RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Dexuan Cui

> From: Leon Romanovsky 
> Sent: Wednesday, April 7, 2021 5:45 AM
> > >
> > > BTW, you don't need to write { 0 }, the {} is enough.
> >
> > Thanks for the suggestion! I'll use {0} in v2.
> 
> You missed the point, "{ 0 }" change to be "{}" without 0.

Got it. Will make the suggested change.

FWIW, {0} and { 0 } are still widely used, but it looks like
{} is indeed more preferred:

$ grep "= {};" drivers/net/  -nr  | wc -l
829

$ grep "= {0};" drivers/net/  -nr  | wc -l
708

$ grep "= {};" kernel/  -nr  | wc -l
29

$ grep "= {0};" kernel/  -nr  | wc -l
4

RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Dexuan Cui

> From: Leon Romanovsky 
> Sent: Wednesday, April 7, 2021 1:10 AM
> 
> <...>
> 
> > +int gdma_verify_vf_version(struct pci_dev *pdev)
> > +{
> > +   struct gdma_context *gc = pci_get_drvdata(pdev);
> > +   struct gdma_verify_ver_req req = { 0 };
> > +   struct gdma_verify_ver_resp resp = { 0 };
> > +   int err;
> > +
> > +   gdma_init_req_hdr(, GDMA_VERIFY_VF_DRIVER_VERSION,
> > + sizeof(req), sizeof(resp));
> > +
> > +   req.protocol_ver_min = GDMA_PROTOCOL_FIRST;
> > +   req.protocol_ver_max = GDMA_PROTOCOL_LAST;
> > +
> > +   err = gdma_send_request(gc, sizeof(req), , sizeof(resp), );
> > +   if (err || resp.hdr.status) {
> > +   pr_err("VfVerifyVersionOutput: %d, status=0x%x\n", err,
> > +  resp.hdr.status);
> > +   return -EPROTO;
> > +   }
> > +
> > +   return 0;
> > +}
> 
> <...>
> > +   err = gdma_verify_vf_version(pdev);
> > +   if (err)
> > +   goto remove_irq;
> 
> Will this VF driver be used in the guest VM? What will prevent from users to
> change it?
> I think that such version negotiation scheme is not allowed.

Yes, the VF driver is expected to run in a Linux VM that runs on Azure.

Currently gdma_verify_vf_version() just tells the PF driver that the VF driver
is only able to support GDMA_PROTOCOL_V1, and want to use
GDMA_PROTOCOL_V1's message formats to talk to the PF driver later.

enum {
GDMA_PROTOCOL_UNDEFINED = 0,
GDMA_PROTOCOL_V1 = 1,
GDMA_PROTOCOL_FIRST = GDMA_PROTOCOL_V1,
GDMA_PROTOCOL_LAST = GDMA_PROTOCOL_V1,
GDMA_PROTOCOL_VALUE_MAX
};

The PF driver is supposed to always support GDMA_PROTOCOL_V1, so I expect
here gdma_verify_vf_version() should succeed. If a user changes the Linux VF
driver and try to use a protocol version not supported by the PF driver, then
gdma_verify_vf_version() will fail; later, if the VF driver tries to talk to 
the PF
driver using an unsupported message format, the PF driver will return a failure.

Thanks,
-- Dexuan

RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Dexuan Cui

> From: Leon Romanovsky 
> Sent: Wednesday, April 7, 2021 1:15 AM
> > ...
> > int gdma_test_eq(struct gdma_context *gc, struct gdma_queue *eq)
> > {
> > struct gdma_generate_test_event_req req = { 0 };
> > struct gdma_general_resp resp = { 0 };
> 
> BTW, you don't need to write { 0 }, the {} is enough.
 
Thanks for the suggestion! I'll use {0} in v2. 

BTW, looks like both are widely used in the kernel. Maybe someone
should update scripts/checkpatch.pl to add a warning agaist { 0 } in
new code, if {0} is preferred. :-)

dexuan@localhost:~/linux$ grep "= { 0 };" kernel/ -nr | wc -l
4
dexuan@localhost:~/linux$  grep "= {0};" kernel/ -nr | wc -l
4

dexuan@localhost:~/linux$ grep "= { 0 };" Documentation/  -nr
Documentation/networking/page_pool.rst:117:struct page_pool_params 
pp_params = { 0 };

dexuan@localhost:~/linux$ grep "= { 0 };" drivers/ -nr | wc -l
1085
dexuan@localhost:~/linux$ grep "= {0};" drivers/ -nr | wc -l
1321

dexuan@localhost:~/linux$ grep "= { 0 };" drivers/net/ -nr | wc -l
408
dexuan@localhost:~/linux$  grep "= {0};" drivers/net/ -nr | wc -l
708

RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Dexuan Cui

> From: kernel test robot 
> Sent: Tuesday, April 6, 2021 6:31 PM
> ...
> Hi Dexuan, 
> I love your patch! Perhaps something to improve:
> 
> All warnings (new ones prefixed by >>):
> 
>drivers/pci/controller/pci-hyperv.c: In function 'hv_irq_unmask':
>drivers/pci/controller/pci-hyperv.c:1220:2: error: implicit declaration of
> function 'hv_set_msi_entry_from_desc'
> [-Werror=implicit-function-declaration]
> 1220 |  hv_set_msi_entry_from_desc(>int_entry.msi_entry,
> msi_desc);

This build error looks strange, because the patch doesn't even touch the driver
drivers/pci/controller/pci-hyperv.c.

I'll try to repro the build failure, and the other 2 failures in 2 separate
emails, and figure out what's happening.

PS, I tested building the patch in a fresh Ubuntu 20.04 VM and it was 
successful.

Thanks,
-- Dexuan

RE: [PATCH net-next] net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

2021-04-07 Thread Dexuan Cui

> From: Andrew Lunn 
> Sent: Tuesday, April 6, 2021 6:08 PM
> To: Dexuan Cui 
> 
> > +static int gdma_query_max_resources(struct pci_dev *pdev)
> > +{
> > +   struct gdma_context *gc = pci_get_drvdata(pdev);
> > +   struct gdma_general_req req = { 0 };
> > +   struct gdma_query_max_resources_resp resp = { 0 };
> > +   int err;
> 
> Network drivers need to use reverse christmas tree. I spotted a number
> of functions getting this wrong. Please review the whole driver.

Hi Andrew, thanks for the quick comments!

I think In general the patch follows the reverse christmas tree style.

For the function you pointed out, I didn't follow the reverse
christmas tree style because I wanted to keep the variable defines
more natural, i.e. first define 'req' and then 'resp'.

I can swap the 2 lines here, i.e. define 'resp' first, but this looks a little
strange to me, because in some other functions the 'req' should be
defined first, e.g. 

int gdma_test_eq(struct gdma_context *gc, struct gdma_queue *eq)
{
struct gdma_generate_test_event_req req = { 0 };
struct gdma_general_resp resp = { 0 };


And, some variable defines can not follow the reverse christmas tree
style due to dependency, e.g.
static void hwc_init_event_handler(void *ctx, struct gdma_queue *q_self,
   struct gdma_event *event) 
{
struct hw_channel_context *hwc = ctx;
struct gdma_dev *gd = hwc->gdma_dev;
struct gdma_context *gc = gdma_dev_to_context(gd);

I failed to find the reverse christmas tree rule in the Documentation/ 
folder. I knew the rule and I thought it was documented there,
but it looks like it's not. So my understanding is that in general we
should follow the style, but there can be exceptions due to
dependencies or logical grouping.

> > +   gdma_init_req_hdr(, GDMA_QUERY_MAX_RESOURCES,
> > + sizeof(req), sizeof(resp));
> > +
> > +   err = gdma_send_request(gc, sizeof(req), , sizeof(resp), );
> > +   if (err || resp.hdr.status) {
> > +   pr_err("%s, line %d: err=%d, err=0x%x\n", __func__, __LINE__,
> > +  err, resp.hdr.status);
> 
> I expect checkpatch complained about this. Don't use pr_err(), use
> dev_err(pdev->dev, ...  of netdev_err(ndev, ... You should always have
> access to dev or ndev, so please change all pr_ calls.

I did run scripts/checkpatch.pl and it reported no error/warning. :-)

But I think you're correct. I'll change the pr_* to dev_* or netdev_*.
 
> > +static unsigned int num_queues = ANA_DEFAULT_NUM_QUEUE;
> > +module_param(num_queues, uint, 0444);
> 
> No module parameters please.
> 
>Andrew

This parameter was mostly for debugging purpose. I can remove it.

BTW, I do remember some maintainers ask people to not add module
parameters unless that's absolutely necessary, but I don't remember if
the rule applies to only the net subsystem or to the whole drivers/
folder. It looks like the rule was also not documented in the
Documentation/ folder? Please let me know if such a rule is explicitly
defined somewhere.

Thanks,
-- Dexuan

RE: [PATCH v2 1/1] use crc32 instead of md5 for hibernation e820 integrity check

2021-04-01 Thread Dexuan Cui

> From: Chris von Recklinghausen 
> Sent: Thursday, April 1, 2021 9:42 AM
> To: a...@kernel.org; s...@redhat.com; raf...@kernel.org; Dexuan Cui
> ; linux...@vger.kernel.org;
> linux-cry...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH v2 1/1] use crc32 instead of md5 for hibernation e820 
> integrity
> check
> 
> Suspend fails on a system in fips mode because md5 is used for the e820
> integrity check and is not available. Use crc32 instead.
> 
> Fixes: 62a03defeabd ("PM / hibernate: Verify the consistent of e820 memory
> map
>by md5 digest")
> Signed-off-by: Chris von Recklinghausen 
> ---
>  arch/x86/power/hibernate.c | 35 +++
>  1 file changed, 19 insertions(+), 16 deletions(-)

Thanks Chris and all! This patch looks good to me.

Tested-by: Dexuan Cui 
Reviewed-by: Dexuan Cui

RE: v5.12.0-rc5: the kernel panics if FIPS mode is on

2021-03-30 Thread Dexuan Cui

> From: Eric Biggers 
> Sent: Monday, March 29, 2021 6:26 PM
> ...
> It looks like your userspace is using tcrypt.ko to request that the kernel 
> test
> "ofb(aes)", but your kernel doesn't have CONFIG_CRYPTO_OFB enabled so the
> test fails as expected.  

Hi Eric,
Thanks for the explanation! Yes, that's it! 

Sorry for the false alarm! Actually the kernel is faultless here.

> Are you sure that anything changed on the kernel side
> besides the kconfig you are using? It looks like this was always the behavior
> when tcrypt.ko is used to test a non-existing algorithm.

After I rebuilt the kernel with the 3 options:
CONFIG_CRYPTO_OFB=y
CONFIG_CRYPTO_DEV_PADLOCK_AES=y
CONFIG_CRYPTO_ANSI_CPRNG=y

and generated the .hmac file:
sha512hmac /boot/vmlinuz-5.12.0-rc5+  > /boot/.vmlinuz-5.12.0-rc5+.hmac
 
now the kernel boots up successfully with fips=1. :-)

> Is your userspace code intentionally trying to test "ofb(aes)", or is it
> accidental?
> 
> - Eric

I'm not sure. This is a CentOS 8.3 VM, and I use the default configuration.
I have been trying to build & run a v5.12.0-rc5+ kernel with fips=1, and
now this is working for me, thanks to your explanation. Thanks again!

Thanks,
-- Dexuan

Fix hibernation in FIPS mode?

2021-03-29 Thread Dexuan Cui

Hi,
MD5 was marked incompliant with FIPS in 2009:
a3bef3a31a19 ("crypto: testmgr - Skip algs not flagged fips_allowed in fips 
mode")
a1915d51e8e7 ("crypto: testmgr - Mark algs allowed in fips mode")

But hibernation_e820_save() is still using MD5, and fails in FIPS mode
due to the 2018 patch:
749fa17093ff ("PM / hibernate: Check the success of generating md5 digest 
before hibernation")

As a result, hibernation doesn't work when FIPS is on.

Do you think if hibernation_e820_save() should be changed to use a
FIPS-compliant algorithm like SHA-1?

PS, currently it looks like FIPS mode is broken in the mainline:
https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg49414.html

Thanks,
-- Dexuan

v5.12.0-rc5: the kernel panics if FIPS mode is on

2021-03-29 Thread Dexuan Cui

Hi all,
The v5.12.0-rc5 kernel (1e43c377a79f) panics with fips=1.

Please refer to the below panic call-trace. The kernel config file and
the full kernel messages are also attached.

Is this a known issue?

Thanks,
-- Dexuan

 Starting dracut pre-udev hook...
[7.260424] alg: self-tests for sha512-generic (sha512) passed
[7.265917] alg: self-tests for sha384-generic (sha384) passed
[7.272426] alg: self-tests for sha512-ssse3 (sha512) passed
[7.276500] alg: self-tests for sha384-ssse3 (sha384) passed
[7.281722] alg: self-tests for sha512-avx (sha512) passed
[7.286579] alg: self-tests for sha384-avx (sha384) passed
[7.291631] alg: self-tests for sha512-avx2 (sha512) passed
[7.296950] alg: self-tests for sha384-avx2 (sha384) passed
[7.321040] alg: self-tests for sha3-224-generic (sha3-224) passed
[7.330291] alg: self-tests for sha3-256-generic (sha3-256) passed
[7.335918] alg: self-tests for sha3-384-generic (sha3-384) passed
[7.341508] alg: self-tests for sha3-512-generic (sha3-512) passed
[7.381918] alg: self-tests for crc32c-intel (crc32c) passed
[7.396694] alg: self-tests for crct10dif-pclmul (crct10dif) passed
[7.453515] alg: self-tests for ghash-clmulni (ghash) passed
[7.469558] alg: self-tests for des3_ede-asm (des3_ede) passed
[7.475355] alg: self-tests for ecb-des3_ede-asm (ecb(des3_ede)) passed
[7.481361] alg: self-tests for cbc-des3_ede-asm (cbc(des3_ede)) passed
[7.488656] alg: self-tests for des3_ede-generic (des3_ede) passed
[7.304930] dracut-pre-udev[502]: modprobe: ERROR: could not insert 
'padlock_aes': No such device
[7.579580] alg: No test for fips(ansi_cprng) (fips_ansi_cprng)
[7.606547] alg: self-tests for sha1 (sha1) passed
[7.610624] alg: self-tests for ecb(des3_ede) (ecb(des3_ede)) passed
[7.615746] alg: self-tests for cbc(des3_ede) (cbc(des3_ede)) passed
[7.638067] alg: self-tests for ctr(des3_ede-asm) (ctr(des3_ede)) passed
[7.644781] alg: self-tests for ctr(des3_ede) (ctr(des3_ede)) passed
[7.653810] alg: self-tests for sha256 (sha256) passed
[7.658945] alg: self-tests for ecb(aes) (ecb(aes)) passed
[7.663493] alg: self-tests for cbc(aes) (cbc(aes)) passed
[7.668421] alg: self-tests for xts(aes) (xts(aes)) passed
[7.672389] alg: self-tests for ctr(aes) (ctr(aes)) passed
[7.692973] alg: self-tests for rfc3686(ctr-aes-aesni) (rfc3686(ctr(aes))) 
passed
[7.699446] alg: self-tests for rfc3686(ctr(aes)) (rfc3686(ctr(aes))) passed
[7.730149] alg: skcipher: failed to allocate transform for ofb(aes): -2
[7.735959] Kernel panic - not syncing: alg: self-tests for ofb(aes) 
(ofb(aes)) failed in fips mode!
[7.736952] CPU: 13 PID: 560 Comm: modprobe Tainted: GW 
5.12.0-rc5+ #3
[7.736952] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090008  12/07/2018
[7.736952] Call Trace:
[7.736952]  dump_stack+0x64/0x7c
[7.736952]  panic+0xfb/0x2d7
[7.736952]  alg_test+0x42d/0x460
[7.736952]  ? __kernfs_new_node+0x175/0x1d0
[7.736952]  do_test+0x3248/0x57ea [tcrypt]
[7.736952]  do_test+0x1f2c/0x57ea [tcrypt]
[7.736952]  ? 0xc031d000
[7.736952]  tcrypt_mod_init+0x55/0x1000 [tcrypt]
[7.736952]  ? 0xc031d000
[7.736952]  do_one_initcall+0x44/0x1d0
[7.736952]  ? __cond_resched+0x15/0x30
[7.736952]  ? kmem_cache_alloc_trace+0x3d/0x410
[7.736952]  do_init_module+0x5a/0x230
[7.736952]  load_module+0x1a5b/0x1bc0
[7.736952]  ? __do_sys_finit_module+0xad/0x110
[7.736952]  __do_sys_finit_module+0xad/0x110
[7.736952]  do_syscall_64+0x33/0x40
[7.736952]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[7.736952] RIP: 0033:0x7ff2e760978d
[7.736952] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8 64 89 01 48
[7.736952] RSP: 002b:7ffd80204308 EFLAGS: 0246 ORIG_RAX: 
0139
[7.736952] RAX: ffda RBX: 563dcfe8f030 RCX: 7ff2e760978d
[7.736952] RDX:  RSI: 563dcf41d7b6 RDI: 0003
[7.736952] RBP: 563dcf41d7b6 R08:  R09: 
[7.736952] R10: 0003 R11: 0246 R12: 
[7.736952] R13: 563dcfe934c0 R14: 0004 R15: 
[7.736952] Kernel Offset: 0x1080 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[7.736952] ---[ end Kernel panic - not syncing: alg: self-tests for 
ofb(aes) (ofb(aes)) failed in fips mode! ]---


v5.12-rc5.kernel.config.txt.tar.gz
Description: v5.12-rc5.kernel.config.txt.tar.gz
[0.00] Linux version 5.12.0-rc5+ (root@decui-co83-fips) (gcc (GCC) 
8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #3 SMP Mon Mar 29 
21:26:58 UTC 2021
[

RE: [PATCH net-next] hv_netvsc: Add a comment clarifying batching logic

2021-03-12 Thread Dexuan Cui

> From: LKML haiyangz  On Behalf Of Haiyang Zhang
> Sent: Friday, March 12, 2021 3:45 PM
> To: linux-hyp...@vger.kernel.org; net...@vger.kernel.org
> Cc: Haiyang Zhang ; KY Srinivasan
> ; Stephen Hemminger ;
> o...@aepfle.de; vkuznets ; da...@davemloft.net;
> linux-kernel@vger.kernel.org; Shachar Raindel 
> Subject: [PATCH net-next] hv_netvsc: Add a comment clarifying batching logic
> 
> From: Shachar Raindel 
> 
> The batching logic in netvsc_send is non-trivial, due to
> a combination of the Linux API and the underlying hypervisor
> interface. Add a comment explaining why the code is written this
> way.
> 
> Signed-off-by: Shachar Raindel 
> Signed-off-by: Haiyang Zhang 
> ---

Reviewed-by: Dexuan Cui

RE: How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?

2021-02-09 Thread Dexuan Cui

> From: James Morse 
> Sent: Tuesday, February 9, 2021 10:15 AM
> To: Dexuan Cui 
> Cc: Rafael J. Wysocki ; linux-a...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org; Michael Kelley
> ; Leandro Pereira 
> Subject: Re: How can a userspace program tell if the system supports the ACPI
> S4 state (Suspend-to-Disk)?
> 
> Hi Dexuan,
> 
> On 05/02/2021 19:36, Dexuan Cui wrote:
> >> From: Rafael J. Wysocki 
> >> Sent: Friday, February 5, 2021 5:06 AM
> >> To: Dexuan Cui 
> >> Cc: linux-a...@vger.kernel.org; linux-kernel@vger.kernel.org;
> >> linux-hyp...@vger.kernel.org; Michael Kelley 
> >> Subject: Re: How can a userspace program tell if the system supports the
> ACPI
> >> S4 state (Suspend-to-Disk)?
> >>
> >> On Sat, Dec 12, 2020 at 2:22 AM Dexuan Cui  wrote:
> >>>
> >>> Hi all,
> >>> It looks like Linux can hibernate even if the system does not support the
> ACPI
> >>> S4 state, as long as the system can shut down, so "cat /sys/power/state"
> >>> always contains "disk", unless we specify the kernel parameter
> "nohibernate"
> >>> or we use LOCKDOWN_HIBERNATION.
> 
> >>> In some scenarios IMO it can still be useful if the userspace is able to
> >>> detect if the ACPI S4 state is supported or not, e.g. when a Linux guest
> >>> runs on Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator
> >>> of the proper support of the tool stack on the host, i.e. the guest is
> >>> discouraged from trying hibernation if the state is not supported.
> 
> What goes wrong? This sounds like a funny way of signalling hypervisor policy.

Hi James,
Sorry if I have caused any confusion. My above description only applies to
x86 Linux VM on Hyper-V.

For ARM64 Hyper-V, I suspect the mainline Linux kernel still can't work in a
Linux VM running on ARM64 Hyper-V, so AFAIK the VM hibernation hasn't
been tested: it may work or may not work. This is in our To-Do list.

Linux VM on Hyper-V needs to know if hibernation is supported/enabled
for the VM, mainly due to 2 reasons:

1. In the VM, the hv_balloon driver should disable the balloon up/down
operations if hibernation is enabled for the VM, otherwise bad things can
happen, e.g. the hv_balloon driver allocates some pages and gives the
pages to the hyperpvisor; now if the VM is allowed to hibernate, later
when the VM resumes back, the VM loses the pages for ever, because
Hyper-V doesn't save the info of the pages that were from the VM, so
Hyper-V thinks no pages need to be returned to the VM.

2. If hibernation is enabled for a VM, the VM has a Linux agent program
that automatically creates and sets up the swap file for hibernation. If
hibernation is not enabled for the VM, the agent should not try to create
and set up the swap file for hibernation.

> >>> I know we can check the S4 state by 'dmesg':
> >>>
> >>> # dmesg |grep ACPI: | grep support
> >>> [3.034134] ACPI: (supports S0 S4 S5)
> >>>
> >>> But this method is unreliable because the kernel msg buffer can be filled
> >>> and overwritten. Is there any better method? If not, do you think if the
> >>> below patch is appropriate? Thanks!
> >>
> >> Sorry for the delay.
> >>
> >> If ACPI S4 is supported, /sys/power/disk will list "platform" as one
> >> of the options (and it will be the default one then).  Otherwise,
> >> "platform" is not present in /sys/power/disk, because ACPI is the only
> >> user of hibernation_ops.
> 
> > This works on x86. Thanks a lot!
> >
> > BTW, does this also work on ARM64?
> 
> Not today. The S4/S5 stuff is part of 'ACPI_SYSTEM_POWER_STATES_SUPPORT',
> which arm64
> doesn't enable as it has a firmware mechanism that covers this on both DT and
> ACPI
> systems. That code is what calls hibernation_set_ops() to enable ACPI's
> platform mode.

Thanks for the explanation!

> Regardless: hibernate works fine. What does your hypervisor do that causes
> problems?
> (I think all we expect from firmware is it doesn't randomise the placement of
> the ACPI
> tables as they aren't necessarily part of the hibernate image)
> 
> Thanks, 
> James

I have explained the problems above for Linux VM on ARM64 Hyper-V.

I suppose you mean hibernation works fine for ARM64 bare metal. 
For Linux VM on ARM64 Hyper-V, I suspect some Hyper-V specific states may
have to be saved and restored for hibernation. This is in our To-Do lsit.

Thanks,
Dexuan

RE: How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?

2021-02-05 Thread Dexuan Cui

> From: Rafael J. Wysocki 
> Sent: Friday, February 5, 2021 5:06 AM
> To: Dexuan Cui 
> Cc: linux-a...@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-hyp...@vger.kernel.org; Michael Kelley 
> Subject: Re: How can a userspace program tell if the system supports the ACPI
> S4 state (Suspend-to-Disk)?
> 
> On Sat, Dec 12, 2020 at 2:22 AM Dexuan Cui  wrote:
> >
> > Hi all,
> > It looks like Linux can hibernate even if the system does not support the 
> > ACPI
> > S4 state, as long as the system can shut down, so "cat /sys/power/state"
> > always contains "disk", unless we specify the kernel parameter "nohibernate"
> > or we use LOCKDOWN_HIBERNATION.
> >
> > In some scenarios IMO it can still be useful if the userspace is able to 
> > detect
> > if the ACPI S4 state is supported or not, e.g. when a Linux guest runs on
> > Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator of the 
> > proper
> > support of the tool stack on the host, i.e. the guest is discouraged from
> > trying hibernation if the state is not supported.
> >
> > I know we can check the S4 state by 'dmesg':
> >
> > # dmesg |grep ACPI: | grep support
> > [3.034134] ACPI: (supports S0 S4 S5)
> >
> > But this method is unreliable because the kernel msg buffer can be filled
> > and overwritten. Is there any better method? If not, do you think if the
> > below patch is appropriate? Thanks!
> 
> Sorry for the delay.
> 
> If ACPI S4 is supported, /sys/power/disk will list "platform" as one
> of the options (and it will be the default one then).  Otherwise,
> "platform" is not present in /sys/power/disk, because ACPI is the only
> user of hibernation_ops.
> 
> HTH

This works on x86. Thanks a lot!

BTW, does this also work on ARM64?

Thanks,
-- Dexuan

RE: kdump always hangs in rcu_barrier() -> wait_for_completion()

2021-01-27 Thread Dexuan Cui

> From: Paul E. McKenney 
> Sent: Thursday, November 26, 2020 3:55 PM
> To: Dexuan Cui 
> Cc: boqun.f...@gmail.com; Ingo Molnar ;
> r...@vger.kernel.org; vkuznets ; Michael Kelley
> ; linux-kernel@vger.kernel.org
> Subject: Re: kdump always hangs in rcu_barrier() -> wait_for_completion()
> 
> On Thu, Nov 26, 2020 at 10:59:19PM +, Dexuan Cui wrote:
> > > From: Paul E. McKenney 
> > > Sent: Thursday, November 26, 2020 1:42 PM
> > >
> > > > > Another possibility is that rcu_state.gp_kthread is non-NULL, but that
> > > > > something else is preventing RCU grace periods from completing, but in
> > > >
> > > > It looks like somehow the scheduling is not working here: in 
> > > > rcu_barrier()
> > > > , if I replace the wait_for_completion() with
> > > > wait_for_completion_timeout(_state.barrier_completion, 30*HZ),
> the
> > > > issue persists.
> > >
> > > Have you tried using sysreq-t to see what the various tasks are doing?
> >
> > Will try it.
> >
> > BTW, this is a "Generation 2" VM on Hyper-V, meaning sysrq only starts to
> > work after the Hyper-V para-virtualized keyboard driver loads... So, at this
> > early point, sysrq is not working. :-( I'll have to hack the code and use a
> > virtual NMI interrupt to force the sysrq handler to be called.
> 
> Whatever works!
> 
> > > Having interrupts disabled on all CPUs would have the effect of disabling
> > > the RCU CPU stall warnings.
> > >   Thanx, Paul
> >
> > I'm sure the interrupts are not disabled. Here the VM only has 1 virtual 
> > CPU,
> > and when the hang issue happens the virtual serial console is still 
> > responding
> > when I press Enter (it prints a new line) or Ctrl+C (it prints ^C).
> >
> > Here the VM does not use the "legacy timers" (PIT, Local APIC timer, etc.) 
> > at
> all.
> > Instead, the VM uses the Hyper-V para-virtualized timers. It looks the
> Hyper-V
> > timer never fires in the kdump kernel when the hang issue happens. I'm
> > looking into this... I suspect this hang issue may only be specific to 
> > Hyper-V.
> 
> Fair enough, given that timers not working can also suppress RCU CPU
> stall warnings.  ;-)
> 
>   Thanx, Paul

FYI: the issue has been fixed by this fix:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff7b5e6ee63c5d20406a131b260c619cdd24fd1

Thanks,
-- Dexuan

[PATCH] x86/hyperv: Initialize clockevents after LAPIC is initialized

2021-01-16 Thread Dexuan Cui

With commit 4df4cb9e99f8, the Hyper-V direct-mode STIMER is actually
initialized before LAPIC is initialized: see

  apic_intr_mode_init()

x86_platform.apic_post_init()
  hyperv_init()
hv_stimer_alloc()

apic_bsp_setup()
  setup_local_APIC()

setup_local_APIC() temporarily disables LAPIC, initializes it and
re-eanble it.  The direct-mode STIMER depends on LAPIC, and when it's
registered, it can be programmed immediately and the timer can fire
very soon:

  hv_stimer_init
clockevents_config_and_register
  clockevents_register_device
tick_check_new_device
  tick_setup_device
tick_setup_periodic(), tick_setup_oneshot()
  clockevents_program_event

When the timer fires in the hypervisor, if the LAPIC is in the
disabled state, new versions of Hyper-V ignore the event and don't inject
the timer interrupt into the VM, and hence the VM hangs when it boots.

Note: when the VM starts/reboots, the LAPIC is pre-enabled by the
firmware, so the window of LAPIC being temporarily disabled is pretty
small, and the issue can only happen once out of 100~200 reboots for
a 40-vCPU VM on one dev host, and on another host the issue doesn't
reproduce after 2000 reboots.

The issue is more noticeable for kdump/kexec, because the LAPIC is
disabled by the first kernel, and stays disabled until the kdump/kexec
kernel enables it. This is especially an issue to a Generation-2 VM
(for which Hyper-V doesn't emulate the PIT timer) when CONFIG_HZ=1000
(rather than CONFIG_HZ=250) is used.

Fix the issue by moving hv_stimer_alloc() to a later place where the
LAPIC timer is initialized.

Fixes: 4df4cb9e99f8 ("x86/hyperv: Initialize clockevents earlier in CPU 
onlining")
Signed-off-by: Dexuan Cui 
---
 arch/x86/hyperv/hv_init.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 4638a52d8eae..6375967a8244 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -315,6 +315,25 @@ static struct syscore_ops hv_syscore_ops = {
.resume = hv_resume,
 };
 
+static void (* __initdata old_setup_percpu_clockev)(void);
+
+static void __init hv_stimer_setup_percpu_clockev(void)
+{
+   /*
+* Ignore any errors in setting up stimer clockevents
+* as we can run with the LAPIC timer as a fallback.
+*/
+   (void)hv_stimer_alloc();
+
+   /*
+* Still register the LAPIC timer, because the direct-mode STIMER is
+* not supported by old versions of Hyper-V. This also allows users
+* to switch to LAPIC timer via /sys, if they want to.
+*/
+   if (old_setup_percpu_clockev)
+   old_setup_percpu_clockev();
+}
+
 /*
  * This function is to be invoked early in the boot sequence after the
  * hypervisor has been detected.
@@ -393,10 +412,14 @@ void __init hyperv_init(void)
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
 
/*
-* Ignore any errors in setting up stimer clockevents
-* as we can run with the LAPIC timer as a fallback.
+* hyperv_init() is called before LAPIC is initialized: see
+* apic_intr_mode_init() -> x86_platform.apic_post_init() and
+* apic_bsp_setup() -> setup_local_APIC(). The direct-mode STIMER
+* depends on LAPIC, so hv_stimer_alloc() should be called from
+* x86_init.timers.setup_percpu_clockev.
 */
-   (void)hv_stimer_alloc();
+   old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
+   x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
 
hv_apic_init();
 
-- 
2.19.1

RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2021-01-09 Thread Dexuan Cui

> From: Andy Shevchenko  
> Sent: Saturday, January 9, 2021 12:52 AM
>> 
>> Hi Rafael, Len, and all,
>> Can you please take a look at the v2 patch?
>> 
>> The Linux mainline has been broken for several weeks when it
>> runs as a guest on Hyper-V, so we'd like this to be fixed ASAP,
>> as more people are being affected
> 
> I would like to see a warning printed when the dupped
> string violates the spec.

Hi Andy,
Do you want a simple strlen() check like the below, or a full
check of the AAA or  format?

Can we have the v2 (https://lkml.org/lkml/2021/1/8/53) merged 
first, and then we can add another patch for the format checking?

I'm trying to do one thing in one patch so the patch is small enough
for easy reviewing.

diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
index cb229e24c563..e6a5d997241c 100644
--- a/drivers/acpi/internal.h
+++ b/drivers/acpi/internal.h
@@ -97,7 +97,7 @@ void acpi_scan_table_handler(u32 event, void *table, void 
*context);
 extern struct list_head acpi_bus_id_list;
 
 struct acpi_device_bus_id {
-   char bus_id[15];
+   const char *bus_id;
unsigned int instance_no;
struct list_head node;
 };
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index a1b226eb2ce2..3b9902e5d965 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -486,6 +486,7 @@ static void acpi_device_del(struct acpi_device *device)
acpi_device_bus_id->instance_no--;
else {
list_del(_device_bus_id->node);
+   kfree_const(acpi_device_bus_id->bus_id);
kfree(acpi_device_bus_id);
}
break;
@@ -674,7 +675,23 @@ int acpi_device_add(struct acpi_device *device,
}
if (!found) {
acpi_device_bus_id = new_bus_id;
-   strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device));
+   acpi_device_bus_id->bus_id =
+   kstrdup_const(acpi_device_hid(device), GFP_KERNEL);
+   if (!acpi_device_bus_id->bus_id) {
+   pr_err(PREFIX "Memory allocation error for bus id\n");
+   result = -ENOMEM;
+   goto err_free_new_bus_id;
+   }
+
+   /*
+*  ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID): if the
+* ID is a string, it must be of the form AAA or ,
+* i.e. 7 chars or 8 characters.
+*/
+   if (strlen(acpi_device_bus_id->bus_id) > 8)
+   pr_warn(PREFIX "too long HID name: %s\n",
+   acpi_device_bus_id->bus_id);
+
acpi_device_bus_id->instance_no = 0;
list_add_tail(_device_bus_id->node, _bus_id_list);
}
@@ -709,6 +726,10 @@ int acpi_device_add(struct acpi_device *device,
if (device->parent)
list_del(>node);
list_del(>wakeup_list);
+
+ err_free_new_bus_id:
+   if (!found)
+   kfree(new_bus_id);
mutex_unlock(_device_lock);
 
  err_detach:

RE: [PATCH v2] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2021-01-08 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Thursday, January 7, 2021 11:24 PM
> ...
> Linux VM on Hyper-V crashes with the latest mainline:
> ...
> 
> Changes in v2:
> strlcpy -> kstrdup_const. Thanks Rafael J. Wysocki!
> Change commit log accordingly.

Hi Rafael, Len, and all,
Can you please take a look at the v2 patch?

The Linux mainline has been broken for several weeks when it
runs as a guest on Hyper-V, so we'd like this to be fixed ASAP,
as more people are being affected, e.g.
https://bugzilla.kernel.org/show_bug.cgi?id=210449

Thanks,
-- Dexuan

[PATCH v2] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2021-01-07 Thread Dexuan Cui

Linux VM on Hyper-V crashes with the latest mainline:

[4.069624] detected buffer overflow in strcpy
[4.077733] kernel BUG at lib/string.c:1149!
..
[4.085819] RIP: 0010:fortify_panic+0xf/0x11
...
[4.085819] Call Trace:
[4.085819]  acpi_device_add.cold.15+0xf2/0xfb
[4.085819]  acpi_add_single_object+0x2a6/0x690
[4.085819]  acpi_bus_check_add+0xc6/0x280
[4.085819]  acpi_ns_walk_namespace+0xda/0x1aa
[4.085819]  acpi_walk_namespace+0x9a/0xc2
[4.085819]  acpi_bus_scan+0x78/0x90
[4.085819]  acpi_scan_init+0xfa/0x248
[4.085819]  acpi_init+0x2c1/0x321
[4.085819]  do_one_initcall+0x44/0x1d0
[4.085819]  kernel_init_freeable+0x1ab/0x1f4

This is because of the recent buffer overflow detection in the
commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified 
string functions")

Here acpi_device_bus_id->bus_id can only hold 14 characters, while the
the acpi_device_hid(device) returns a 22-char string
"HYPER_V_GEN_COUNTER_V1".

Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a
string, it must be of the form AAA or , i.e. 7 chars or 8
chars.

The field bus_id in struct acpi_device_bus_id was originally defined as
char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the
commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI devices")

Fix the issue by changing the field bus_id to const char *, and use
kstrdup_const() to initialize it.

Signed-off-by: Dexuan Cui 
---

Changes in v2:
strlcpy -> kstrdup_const. Thanks Rafael J. Wysocki!
Change commit log accordingly.

 drivers/acpi/internal.h |  2 +-
 drivers/acpi/scan.c | 14 +-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h
index cb229e24c563..e6a5d997241c 100644
--- a/drivers/acpi/internal.h
+++ b/drivers/acpi/internal.h
@@ -97,7 +97,7 @@ void acpi_scan_table_handler(u32 event, void *table, void 
*context);
 extern struct list_head acpi_bus_id_list;
 
 struct acpi_device_bus_id {
-   char bus_id[15];
+   const char *bus_id;
unsigned int instance_no;
struct list_head node;
 };
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index a1b226eb2ce2..8d49d192d1c1 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -486,6 +486,7 @@ static void acpi_device_del(struct acpi_device *device)
acpi_device_bus_id->instance_no--;
else {
list_del(_device_bus_id->node);
+   kfree_const(acpi_device_bus_id->bus_id);
kfree(acpi_device_bus_id);
}
break;
@@ -674,7 +675,14 @@ int acpi_device_add(struct acpi_device *device,
}
if (!found) {
acpi_device_bus_id = new_bus_id;
-   strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device));
+   acpi_device_bus_id->bus_id =
+   kstrdup_const(acpi_device_hid(device), GFP_KERNEL);
+   if (!acpi_device_bus_id->bus_id) {
+   pr_err(PREFIX "Memory allocation error for bus id\n");
+   result = -ENOMEM;
+   goto err_free_new_bus_id;
+   }
+
acpi_device_bus_id->instance_no = 0;
list_add_tail(_device_bus_id->node, _bus_id_list);
}
@@ -709,6 +717,10 @@ int acpi_device_add(struct acpi_device *device,
if (device->parent)
list_del(>node);
list_del(>wakeup_list);
+
+ err_free_new_bus_id:
+   if (!found)
+   kfree(new_bus_id);
mutex_unlock(_device_lock);
 
  err_detach:
-- 
2.19.1

RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2021-01-07 Thread Dexuan Cui

> From: Rafael J. Wysocki 
> Sent: Thursday, January 7, 2021 5:48 AM
> > > > Linux VM on Hyper-V crashes with the latest mainline:
> > > > ...
> The root cause is a VM issue AFAICS, though.

Yes.
 
> It doesn't look like the right fix to me, though.
> 
> The problem appears to be that the string coming from _HID is too long
> (which is a spec violation). 

Yes. We have requested Hyper-V team to fix the buggy BIOS/firmware,
but we have to cope with the existing buggy Hyper-V hosts, at least
before the Hyper-V fix is deployed to the hosts, and some old versions
of the hosts may not get updated. :-(

> The patch truncates it to match the
> length of the target buffer, but that is not particularly useful.
> 
> It would be better to use something like kstrdup_const() to initialize
> acpi_device_bus_id->bus_id IMV.

Makes sense. I'll submit v2 shortly.

Thanks,
-- Dexuan

[PATCH v3] Drivers: hv: vmbus: Add /sys/bus/vmbus/hibernation

2021-01-06 Thread Dexuan Cui

When a Linux VM runs on Hyper-V, if the host toolstack doesn't support
hibernation for the VM (this happens on old Hyper-V hosts like Windows
Server 2016, or new Hyper-V hosts if the admin or user doesn't declare
the hibernation intent for the VM), the VM is discouraged from trying
hibernation (because the host doesn't guarantee that the VM's virtual
hardware configuration will remain exactly the same across hibernation),
i.e. the VM should not try to set up the swap partition/file for
hibernation, etc.

x86 Hyper-V uses the presence of the virtual ACPI S4 state as the
indication of the host toolstack support for a VM. Currently there is
no easy and reliable way for the userspace to detect the presence of
the state (see https://lkml.org/lkml/2020/12/11/1097).  Add
/sys/bus/vmbus/hibernation for this purpose.

Signed-off-by: Dexuan Cui 
---

This v3 is similar to v1, and the changes are:
  Updated the documentation changes.
  Updated the commit log.
  /sys/bus/vmbus/supported_features -> /sys/bus/vmbus/hibernation

The patch is targeted at the Hyper-V tree's hyperv-next branch.

 Documentation/ABI/stable/sysfs-bus-vmbus |  7 +++
 drivers/hv/vmbus_drv.c   | 18 ++
 2 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus 
b/Documentation/ABI/stable/sysfs-bus-vmbus
index c27b7b89477c..42599d9fa161 100644
--- a/Documentation/ABI/stable/sysfs-bus-vmbus
+++ b/Documentation/ABI/stable/sysfs-bus-vmbus
@@ -1,3 +1,10 @@
+What:  /sys/bus/vmbus/hibernation
+Date:  Jan 2021
+KernelVersion: 5.12
+Contact:   Dexuan Cui 
+Description:   Whether the host supports hibernation for the VM.
+Users: Daemon that sets up swap partition/file for hibernation.
+
 What:  /sys/bus/vmbus/devices//id
 Date:  Jul 2009
 KernelVersion: 2.6.31
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index d491fdcee61f..4c544473b1d9 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -678,6 +678,23 @@ static const struct attribute_group vmbus_dev_group = {
 };
 __ATTRIBUTE_GROUPS(vmbus_dev);
 
+/* Set up the attribute for /sys/bus/vmbus/hibernation */
+static ssize_t hibernation_show(struct bus_type *bus, char *buf)
+{
+   return sprintf(buf, "%d\n", !!hv_is_hibernation_supported());
+}
+
+static BUS_ATTR_RO(hibernation);
+
+static struct attribute *vmbus_bus_attrs[] = {
+   _attr_hibernation.attr,
+   NULL,
+};
+static const struct attribute_group vmbus_bus_group = {
+   .attrs = vmbus_bus_attrs,
+};
+__ATTRIBUTE_GROUPS(vmbus_bus);
+
 /*
  * vmbus_uevent - add uevent for our device
  *
@@ -1024,6 +1041,7 @@ static struct bus_type  hv_bus = {
.uevent =   vmbus_uevent,
.dev_groups =   vmbus_dev_groups,
.drv_groups =   vmbus_drv_groups,
+   .bus_groups =   vmbus_bus_groups,
.pm =   _pm,
 };
 
-- 
2.19.1

RE: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features

2021-01-06 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Wednesday, January 6, 2021 9:38 AM
> From: Dexuan Cui 
> Sent: Tuesday, December 22, 2020 4:12 PM
> >
> > When a Linux VM runs on Hyper-V, if the host toolstack doesn't support
> > hibernation for the VM (this happens on old Hyper-V hosts like Windows
> > Server 2016, or new Hyper-V hosts if the admin or user doesn't declare
> > the hibernation intent for the VM), the VM is discouraged from trying
> > hibernation (because the host doesn't guarantee that the VM's virtual
> > hardware configuration will remain exactly the same across hibernation),
> > i.e. the VM should not try to set up the swap partition/file for
> > hibernation, etc.
> >
> > x86 Hyper-V uses the presence of the virtual ACPI S4 state as the
> > indication of the host toolstack support for a VM. Currently there is
> > no easy and reliable way for the userspace to detect the presence of
> > the state (see ...).  Add
> > /sys/bus/vmbus/supported_features for this purpose.
>
> I'm OK with surfacing the hibernation capability via an entry in
> /sys/bus/vmbus.  Correct me if I'm wrong, but I think the concept
> being surfaced is not "ACPI S4 state" precisely, but slightly more
> generally whether hibernation is supported for the VM.  While
> those two concepts may be 1:1 for the moment, there might be
> future configurations where "hibernation is supported" depends
> on other factors as well.

For x86, I believe the virtual ACPI S4 state exists only when the
admin/user declares the intent of "enable hibernation for the VM" via
some PowwerShell/WMI command. On Azure, if a VM size is not suitable
for hibernation (e.g. an existing VM has an ephemeral local disk),
the toolstack on the host should not enable the ACPI S4 state for the
VM. That's why we implemented hv_is_hibernation_supported() for x86 by
checking the ACPI S4 state, and we have used the function
hv_is_hibernation_supported() in hv_utils and hv_balloon for quite a
while.

For ARM, IIRC there is no concept of ACPI S4 state, so currently
hv_is_hibernation_supported() is actually not implemented. Not sure
why hv_utils and hv_balloon can build successfully... :-) Probably
Boqun can help to take a look.

>
> The guidance for things in /sys is that they generally should
> be single valued (see Documentation/filesystems/sysfs.rst).  So my
> recommendation is to create a "hibernation" entry that has a value
> of 0 or 1.
>
> Michael

Got it. Then let's use /sys/bus/vmbus/hibernation.

Will post v3.

Thanks,
-- Dexuan

RE: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features

2021-01-06 Thread Dexuan Cui

> From: Wei Liu 
> Sent: Wednesday, January 6, 2021 8:24 AM
> To: Dexuan Cui 
> >
> > That said, I don't know if there is any hard rule in regard of
> > the timing here. If there is, then v5.12 is OK to me. :-)
> >
> 
> By the time you posted this (Dec 22), 5.11 was already more or less
> "frozen". Linus wanted -next patches to be merged before Christmas.

Got it. Thanks for the explanation!

> The way I see it this is a new sysfs interface so I think this is
> something new, which is for 5.12.

Ok. 

> Do you think this should be considered a bug fix?
> 
> Wei.

Then let's not consider it for v5.11. For now I think the userspace
tool/daemon can check 'dmesg' for the existence of ACPI S4 state
as a workaround. This is not ideal, but it should work reasonably
well, assuming the tool/daemon runs early enough, so the kernel
msg buffer is not yet filled up and overwritten. :-)

Thanks,
-- Dexuan

[PATCH v2] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features

2021-01-05 Thread Dexuan Cui

When a Linux VM runs on Hyper-V, if the host toolstack doesn't support
hibernation for the VM (this happens on old Hyper-V hosts like Windows
Server 2016, or new Hyper-V hosts if the admin or user doesn't declare
the hibernation intent for the VM), the VM is discouraged from trying
hibernation (because the host doesn't guarantee that the VM's virtual
hardware configuration will remain exactly the same across hibernation),
i.e. the VM should not try to set up the swap partition/file for
hibernation, etc.

x86 Hyper-V uses the presence of the virtual ACPI S4 state as the
indication of the host toolstack support for a VM. Currently there is
no easy and reliable way for the userspace to detect the presence of
the state (see https://lkml.org/lkml/2020/12/11/1097).  Add
/sys/bus/vmbus/supported_features for this purpose.

Signed-off-by: Dexuan Cui 
---

Change in v2:
Added a "Values:" section
Updated "Date:"

 Documentation/ABI/stable/sysfs-bus-vmbus |  9 +
 drivers/hv/vmbus_drv.c   | 20 
 2 files changed, 29 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus 
b/Documentation/ABI/stable/sysfs-bus-vmbus
index c27b7b89477c..c8d56389b7be 100644
--- a/Documentation/ABI/stable/sysfs-bus-vmbus
+++ b/Documentation/ABI/stable/sysfs-bus-vmbus
@@ -1,3 +1,12 @@
+What:  /sys/bus/vmbus/supported_features
+Date:  Jan 2021
+KernelVersion: 5.11
+Contact:   Dexuan Cui 
+Description:   Features specific to VMs running on Hyper-V
+Values:A list of strings.
+   hibernation: the host toolstack supports hibernation for the VM.
+Users: Daemon that sets up swap partition/file for hibernation
+
 What:  /sys/bus/vmbus/devices//id
 Date:  Jul 2009
 KernelVersion: 2.6.31
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index d491fdcee61f..958487a40a18 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -678,6 +678,25 @@ static const struct attribute_group vmbus_dev_group = {
 };
 __ATTRIBUTE_GROUPS(vmbus_dev);
 
+/* Set up bus attribute(s) for /sys/bus/vmbus/supported_features */
+static ssize_t supported_features_show(struct bus_type *bus, char *buf)
+{
+   bool hb = hv_is_hibernation_supported();
+
+   return sprintf(buf, "%s\n", hb ? "hibernation" : "");
+}
+
+static BUS_ATTR_RO(supported_features);
+
+static struct attribute *vmbus_bus_attrs[] = {
+   _attr_supported_features.attr,
+   NULL,
+};
+static const struct attribute_group vmbus_bus_group = {
+   .attrs = vmbus_bus_attrs,
+};
+__ATTRIBUTE_GROUPS(vmbus_bus);
+
 /*
  * vmbus_uevent - add uevent for our device
  *
@@ -1024,6 +1043,7 @@ static struct bus_type  hv_bus = {
.uevent =   vmbus_uevent,
.dev_groups =   vmbus_dev_groups,
.drv_groups =   vmbus_drv_groups,
+   .bus_groups =   vmbus_bus_groups,
.pm =   _pm,
 };
 
-- 
2.19.1

RE: [PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features

2021-01-05 Thread Dexuan Cui

> From: Wei Liu 
> Sent: Tuesday, January 5, 2021 4:58 AM
> > ...
> >  Documentation/ABI/stable/sysfs-bus-vmbus |  7 +++
> >  drivers/hv/vmbus_drv.c   | 20
> 
> >  2 files changed, 27 insertions(+)
> >
> > diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus
> b/Documentation/ABI/stable/sysfs-bus-vmbus
> > index c27b7b89477c..3ba765ae6695 100644
> > --- a/Documentation/ABI/stable/sysfs-bus-vmbus
> > +++ b/Documentation/ABI/stable/sysfs-bus-vmbus
> > @@ -1,3 +1,10 @@
> > +What:  /sys/bus/vmbus/supported_features
> > +Date:  Dec 2020
> > +KernelVersion: 5.11
> 
> Too late for 5.11 now.

The patch was posted 2 weeks ago, before 5.11-rc1. :-)

If possible, we still want this patch to be in 5.11, because:
1. The patch is small and IMO pretty safe.
2. The patch adds new code and doesn't really change any old code.
3. The patch adds a new sys file, which is needed by some
user space tool to auto-setup the configuration for hibernation.
We'd like to integrate the patch into the Linux distros ASAP and
as we know some distros don't accept a patch if it's not in the
mainline.

So, if the patch itself looks good, IMO it would be better to be
in v5.11 rather than in v5.12, which will need extra time of
2~3 months.

That said, I don't know if there is any hard rule in regard of
the timing here. If there is, then v5.12 is OK to me. :-)
 
> Given this is a list of strings, do you want to enumerate them in a
> Values section or the Description section?
> 
> Wei.

Currently there is only one possible string "hibernation".
Not sure if we would need to add more strings in future, but yes
it's a good idea to list the string in a "Value" section. 

Will post v2 shortly.

Thanks,
-- Dexuan

RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2021-01-05 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Tuesday, December 22, 2020 5:56 AM
> From: Dexuan Cui 
> Sent: Thursday, December 17, 2020
> 8:08 PM
> >
> > Linux VM on Hyper-V crashes with the latest mainline:
> > ...
> > --- a/drivers/acpi/scan.c
> > +++ b/drivers/acpi/scan.c
> > @@ -674,7 +674,8 @@ int acpi_device_add(struct acpi_device *device,
> > }
> > if (!found) {
> > acpi_device_bus_id = new_bus_id;
> > -   strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device));
> > +   strlcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device),
> > +   sizeof(acpi_device_bus_id->bus_id));
> > acpi_device_bus_id->instance_no = 0;
> > list_add_tail(_device_bus_id->node, _bus_id_list);
> > }
> 
> Reviewed-by: Michael Kelley 

Hi, ACPI maintainers,
Would you please take a look at the small fix? Currently the mainline Linux
kernel, running in a VM on Hyper-V, has been broken for almost 3 weeks,
i.e. the VM always panics when it boots.

The patch has already had Michael's Reviewed-by.

BTW, the patch should have a stable tag:
Cc: 

Or, do you want the patch to go through the Hyper-V tree?
https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes

The small patch is unlikely to cause a merge conflict, and it only affects
Linux VMs on Hyper-V so far.

Thanks,
-- Dexuan

RE: How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?

2020-12-23 Thread Dexuan Cui

> From: Pavel Machek 
> Sent: Monday, December 21, 2020 11:08 AM
> 
> On Sat 2020-12-12 01:20:30, Dexuan Cui wrote:
> > Hi all,
> > It looks like Linux can hibernate even if the system does not support the 
> > ACPI
> > S4 state, as long as the system can shut down, so "cat /sys/power/state"
> > always contains "disk", unless we specify the kernel parameter "nohibernate"
> > or we use LOCKDOWN_HIBERNATION.
> >
> > In some scenarios IMO it can still be useful if the userspace is able to 
> > detect
> > if the ACPI S4 state is supported or not, e.g. when a Linux guest runs on
> > Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator of the 
> > proper
> > support of the tool stack on the host, i.e. the guest is discouraged from
> > trying hibernation if the state is not supported.
> 
> Umm. Does not sound like exactly strong reason to me.

Hi Pavel,
Thanks for the reply. I realized that it may be better for me to add the code
to Hyper-V specific driver: https://lkml.org/lkml/2020/12/22/719

Let's see how it goes that way. :-)

Thanks,
Dexuan

RE: [PATCH -tip V2 00/10] workqueue: break affinity initiatively

2020-12-23 Thread Dexuan Cui

> From: Dexuan Cui
> Sent: Wednesday, December 23, 2020 12:27 PM
> ...
> The warning only repros if there are more than 1 node, and it only prints once
> for the first vCPU of the second node (i.e. node #1).

A correction: if I configure the 32 vCPUs evenly into 4 nodes, I get the warning
once for node #1~#3, respectively.

Thanks,
-- Dexuan

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2376,9 +2376,14 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
 * For kernel threads that do indeed end up on online &&
 * !active we want to ensure they are strict per-CPU threads.
 */
-   WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
+   WARN(cpumask_intersects(new_mask, cpu_online_mask) &&
!cpumask_intersects(new_mask, cpu_active_mask) &&
-   p->nr_cpus_allowed != 1);
+   p->nr_cpus_allowed != 1, "%*pbl, %*pbl, %*pbl, %d\n",
+   cpumask_pr_args(new_mask),
+   cpumask_pr_args(cpu_online_mask),
+   cpumask_pr_args(cpu_active_mask),
+   p->nr_cpus_allowed
+   );
}

[1.791611] smp: Bringing up secondary CPUs ...
[1.795225] x86: Booting SMP configuration:
[1.798964]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7
[1.807068]  node  #1, CPUs:#8
[1.094226] smpboot: CPU 8 Converting physical 0 to logical die 1
[1.895211] [ cut here ]
[1.899058] 8-15, 0-8, 0-7, 8
[1.899058] WARNING: CPU: 8 PID: 50 at kernel/sched/core.c:2386 
__set_cpus_allowed_ptr+0x1c7/0x1e0
[1.899058] CPU: 8 PID: 50 Comm: cpuhp/8 Not tainted 5.10.0+ #4
[1.899058] RIP: 0010:__set_cpus_allowed_ptr+0x1c7/0x1e0
[1.899058] Call Trace:
[1.899058]  worker_attach_to_pool+0x53/0xd0
[1.899058]  create_worker+0xf9/0x190
[1.899058]  alloc_unbound_pwq+0x3a5/0x3b0
[1.899058]  wq_update_unbound_numa+0x112/0x1c0
[1.899058]  workqueue_online_cpu+0x1d0/0x220
[1.899058]  cpuhp_invoke_callback+0x82/0x4a0
[1.899058]  cpuhp_thread_fun+0xb8/0x120
[1.899058]  smpboot_thread_fn+0x198/0x230
[1.899058]  kthread+0x13d/0x160
[1.899058]  ret_from_fork+0x22/0x30
[1.903058]   #9 #10 #11 #12 #13 #14 #15
[1.907092]  node  #2, CPUs:   #16
[1.094226] smpboot: CPU 16 Converting physical 0 to logical die 2
[1.995205] [ cut here ]
[1.999058] 16-23, 0-16, 0-15, 8
[1.999058] WARNING: CPU: 16 PID: 91 at kernel/sched/core.c:2386 
__set_cpus_allowed_ptr+0x1c7/0x1e0
[1.999058] CPU: 16 PID: 91 Comm: cpuhp/16 Tainted: GW 
5.10.0+ #4
[1.999058] RIP: 0010:__set_cpus_allowed_ptr+0x1c7/0x1e0
[1.999058] Call Trace:
[1.999058]  worker_attach_to_pool+0x53/0xd0
[1.999058]  create_worker+0xf9/0x190
[1.999058]  alloc_unbound_pwq+0x3a5/0x3b0
[1.999058]  wq_update_unbound_numa+0x112/0x1c0
[1.999058]  workqueue_online_cpu+0x1d0/0x220
[1.999058]  cpuhp_invoke_callback+0x82/0x4a0
[1.999058]  cpuhp_thread_fun+0xb8/0x120
[1.999058]  smpboot_thread_fn+0x198/0x230
[1.999058]  kthread+0x13d/0x160
[1.999058]  ret_from_fork+0x22/0x30
[2.003058]  #17 #18 #19 #20 #21 #22 #23
[2.007092]  node  #3, CPUs:   #24
[1.094226] smpboot: CPU 24 Converting physical 0 to logical die 3
[2.095220] [ cut here ]
[2.099058] 24-31, 0-24, 0-23, 8
[2.099058] WARNING: CPU: 24 PID: 132 at kernel/sched/core.c:2386 
__set_cpus_allowed_ptr+0x1c7/0x1e0
[2.099058] CPU: 24 PID: 132 Comm: cpuhp/24 Tainted: GW 
5.10.0+ #4
[2.099058] Call Trace:
[2.099058]  worker_attach_to_pool+0x53/0xd0
[2.099058]  create_worker+0xf9/0x190
[2.099058]  alloc_unbound_pwq+0x3a5/0x3b0
[2.099058]  wq_update_unbound_numa+0x112/0x1c0
[2.099058]  workqueue_online_cpu+0x1d0/0x220
[2.099058]  cpuhp_invoke_callback+0x82/0x4a0
[2.099058]  cpuhp_thread_fun+0xb8/0x120
[2.099058]  smpboot_thread_fn+0x198/0x230
[2.099058]  kthread+0x13d/0x160
[2.099058]  ret_from_fork+0x22/0x30
[2.103058]  #25 #26 #27 #28 #29 #30 #31
[2.108091] smp: Brought up 4 nodes, 32 CPUs
[2.115065] smpboot: Max logical packages: 4
[2.119067] smpboot: Total of 32 processors activated (146992.31 BogoMIPS)

RE: [PATCH -tip V2 00/10] workqueue: break affinity initiatively

2020-12-23 Thread Dexuan Cui

> From: Lai Jiangshan 
> Sent: Wednesday, December 23, 2020 7:02 AM
> >
> > Hi,
> > I tested this patchset on today's tip.git's master branch
> > (981316394e35 ("Merge branch 'locking/urgent'")).
> >
> > Every time the kernel boots with 32 CPUs (I'm running the Linux VM on
> > Hyper-V), I get the below warning.
> > (BTW, with 8 or 16 CPUs, I don't see the warning).
> > By printing the cpumasks with "%*pbl", I know the warning happens
> > because:
> > new_mask = 16-31
> > cpu_online_mask= 0-16
> > cpu_active_mask= 0-15
> > p->nr_cpus_allowed=16
> >
> 
> Hello, Dexuan
> 
> Could you omit patch4 of the patchset and test it again, please?
> ("workqueue: don't set the worker's cpumask when kthread_bind_mask()")
> 
> kthread_bind_mask() set the worker task to the pool's cpumask without
> any check. And set_cpus_allowed_ptr() finds that the task's cpumask
> is unchanged (already set by kthread_bind_mask()) and skips all the checks.
> 
> And I found that numa=fake=2U seems broken on cpumask_of_node() in my
> box.
> 
> Thanks,
> Lai

Looks like your analysis is correct: the warning can't repro if I configure all
the 32 vCPUs into 1 virtual NUMA node (and I don't see the message
"smpboot: CPU 16 Converting physical 0 to logical die 1"):

[1.495440] smp: Bringing up secondary CPUs ...
[1.499207] x86: Booting SMP configuration:
[1.503038]  node  #0, CPUs:#1  #2  #3  #4  #5  #6  #7 
#8  #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26
#27 #28 #29 #30 #31
[1.531930] smp: Brought up 1 node, 32 CPUs
[1.538779] smpboot: Max logical packages: 1
[1.539041] smpboot: Total of 32 processors activated (146859.90 BogoMIPS)

The warning only repros if there are more than 1 node, and it only prints once
for the first vCPU of the second node (i.e. node #1).

With more than 1 node, if I don't use patch4, the warning does not repro.

Thanks,
-- Dexuan

[PATCH] Drivers: hv: vmbus: Add /sys/bus/vmbus/supported_features

2020-12-22 Thread Dexuan Cui

When a Linux VM runs on Hyper-V, if the host toolstack doesn't support
hibernation for the VM (this happens on old Hyper-V hosts like Windows
Server 2016, or new Hyper-V hosts if the admin or user doesn't declare
the hibernation intent for the VM), the VM is discouraged from trying
hibernation (because the host doesn't guarantee that the VM's virtual
hardware configuration will remain exactly the same across hibernation),
i.e. the VM should not try to set up the swap partition/file for
hibernation, etc.

x86 Hyper-V uses the presence of the virtual ACPI S4 state as the
indication of the host toolstack support for a VM. Currently there is
no easy and reliable way for the userspace to detect the presence of
the state (see https://lkml.org/lkml/2020/12/11/1097).  Add
/sys/bus/vmbus/supported_features for this purpose.

Signed-off-by: Dexuan Cui 
---
 Documentation/ABI/stable/sysfs-bus-vmbus |  7 +++
 drivers/hv/vmbus_drv.c   | 20 
 2 files changed, 27 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-bus-vmbus 
b/Documentation/ABI/stable/sysfs-bus-vmbus
index c27b7b89477c..3ba765ae6695 100644
--- a/Documentation/ABI/stable/sysfs-bus-vmbus
+++ b/Documentation/ABI/stable/sysfs-bus-vmbus
@@ -1,3 +1,10 @@
+What:  /sys/bus/vmbus/supported_features
+Date:  Dec 2020
+KernelVersion: 5.11
+Contact:   Dexuan Cui 
+Description:   Features specific to VMs running on Hyper-V
+Users: Daemon that sets up swap partition/file for hibernation
+
 What:  /sys/bus/vmbus/devices//id
 Date:  Jul 2009
 KernelVersion: 2.6.31
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index d491fdcee61f..958487a40a18 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -678,6 +678,25 @@ static const struct attribute_group vmbus_dev_group = {
 };
 __ATTRIBUTE_GROUPS(vmbus_dev);
 
+/* Set up bus attribute(s) for /sys/bus/vmbus/supported_features */
+static ssize_t supported_features_show(struct bus_type *bus, char *buf)
+{
+   bool hb = hv_is_hibernation_supported();
+
+   return sprintf(buf, "%s\n", hb ? "hibernation" : "");
+}
+
+static BUS_ATTR_RO(supported_features);
+
+static struct attribute *vmbus_bus_attrs[] = {
+   _attr_supported_features.attr,
+   NULL,
+};
+static const struct attribute_group vmbus_bus_group = {
+   .attrs = vmbus_bus_attrs,
+};
+__ATTRIBUTE_GROUPS(vmbus_bus);
+
 /*
  * vmbus_uevent - add uevent for our device
  *
@@ -1024,6 +1043,7 @@ static struct bus_type  hv_bus = {
.uevent =   vmbus_uevent,
.dev_groups =   vmbus_dev_groups,
.drv_groups =   vmbus_drv_groups,
+   .bus_groups =   vmbus_bus_groups,
.pm =   _pm,
 };
 
-- 
2.19.1

RE: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!

2020-12-22 Thread Dexuan Cui

> From: Valentin Schneider 
> Sent: Tuesday, December 22, 2020 5:40 AM
> To: Dexuan Cui 
> Cc: mi...@redhat.com; pet...@infradead.org; juri.le...@redhat.com;
> vincent.guit...@linaro.org; dietmar.eggem...@arm.com;
> rost...@goodmis.org; bseg...@google.com; mgor...@suse.de;
> bris...@redhat.com; x...@kernel.org; linux...@vger.kernel.org;
> linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org; Michael Kelley
> 
> Subject: Re: v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel
> BUG at kernel/sched/core.c:7596!
> 
> 
> Hi,
> 
> On 22/12/20 09:13, Dexuan Cui wrote:
> > Hi,
> > I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020)
> on Hyper-V.
> > When I test hibernation, the VM can easily hit the below BUG_ON during the
> resume
> > procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has
> 40 vCPUs.
> >
> > I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may
> be broken?
> >
> > In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and
> rq->nr_pinned==0:
> >
> > 7587 int sched_cpu_dying(unsigned int cpu)
> > 7588 {
> > 7589 struct rq *rq = cpu_rq(cpu);
> > 7590 struct rq_flags rf;
> > 7591
> > 7592 /* Handle pending wakeups and then migrate everything off
> */
> > 7593 sched_tick_stop(cpu);
> > 7594
> > 7595 rq_lock_irqsave(rq, );
> > 7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> > 7597 rq_unlock_irqrestore(rq, );
> > 7598
> > 7599 calc_load_migrate(rq);
> > 7600 update_max_interval();
> > 7601 nohz_balance_exit_idle(rq);
> > 7602 hrtick_clear(rq);
> > 7603 return 0;
> > 7604 }
> >
> > The last commit that touches the BUG_ON line is the commit
> > 3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug
> cooperative")
> > but the commit looks good to me.
> >
> > Any idea?
> >
> 
> I'd wager this extra task is a kworker; could you give this series a try?
> 
> 
> https ://lore.kernel.org/lkml/20201218170919.2950-1-jiangshan...@gmail.com/

Thanks, Valentin! It looks like the patchset can fix the BUG_ON, though I see
a warning, which I reported here: https://lkml.org/lkml/2020/12/22/648

Thanks,
-- Dexuan

v5.10: sched_cpu_dying() hits BUG_ON during hibernation: kernel BUG at kernel/sched/core.c:7596!

2020-12-22 Thread Dexuan Cui

Hi,
I'm running a Linux VM with the recent mainline (48342fc07272, 12/20/2020) on 
Hyper-V.
When I test hibernation, the VM can easily hit the below BUG_ON during the 
resume
procedure (I estimate this can repro about 1/5 of the time). BTW, my VM has 40 
vCPUs.

I can't repro the BUG_ON with v5.9.0, so I suspect something in v5.10.0 may be 
broken?

In v5.10.0, when the BUG_ON happens, rq->nr_running==2, and rq->nr_pinned==0:

7587 int sched_cpu_dying(unsigned int cpu)
7588 {
7589 struct rq *rq = cpu_rq(cpu);
7590 struct rq_flags rf;
7591
7592 /* Handle pending wakeups and then migrate everything off */
7593 sched_tick_stop(cpu);
7594
7595 rq_lock_irqsave(rq, );
7596 BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
7597 rq_unlock_irqrestore(rq, );
7598
7599 calc_load_migrate(rq);
7600 update_max_interval();
7601 nohz_balance_exit_idle(rq);
7602 hrtick_clear(rq);
7603 return 0;
7604 }

The last commit that touches the BUG_ON line is the commit
3015ef4b98f5 ("sched/core: Make migrate disable and CPU hotplug cooperative")
but the commit looks good to me.

Any idea?

Thanks,
-- Dexuan

[5.383378] PM: Image signature found, resuming
[5.388027] PM: hibernation: resume from hibernation
[5.395794] Freezing user space processes ... (elapsed 0.001 seconds) done.
[5.397058] OOM killer disabled.
[5.397059] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
done.
[5.413013] PM: hibernation: Marking nosave pages: [mem 
0x-0x0fff]
[5.419331] PM: hibernation: Marking nosave pages: [mem 
0x0009f000-0x000f]
[5.425502] PM: hibernation: Marking nosave pages: [mem 
0xf7ff-0x]
[5.431706] PM: hibernation: Basic memory bitmaps created
[5.465205] PM: Using 3 thread(s) for decompression
[5.469590] PM: Loading and decompressing image data (505553 pages)...
[5.492790] PM: Image loading progress:   0%
[6.532641] PM: Image loading progress:  10%
[6.899683] PM: Image loading progress:  20%
[7.251672] PM: Image loading progress:  30%
[7.598808] PM: Image loading progress:  40%
[8.056153] PM: Image loading progress:  50%
[8.534077] PM: Image loading progress:  60%
[9.029886] PM: Image loading progress:  70%
[9.542015] PM: Image loading progress:  80%
[   10.025326] PM: Image loading progress:  90%
[   10.525804] PM: Image loading progress: 100%
[   10.530241] PM: Image loading done
[   10.533295] PM: hibernation: Read 2022212 kbytes in 5.05 seconds (400.43 
MB/s)
[   10.540827] PM: Image successfully loaded
[   10.599243] serial 00:04: disabled
[   10.619935] vmbus 242ff919-07db-4180-9c2e-b86cb68c8c55: parent VMBUS:01 
should not be sleeping
[   10.646542] Disabling non-boot CPUs ...
[   10.694380] smpboot: CPU 1 is now offline
[   10.729044] smpboot: CPU 2 is now offline
[   10.756819] smpboot: CPU 3 is now offline
[   10.776784] smpboot: CPU 4 is now offline
[   10.800866] smpboot: CPU 5 is now offline
[   10.828923] smpboot: CPU 6 is now offline
[   10.849013] smpboot: CPU 7 is now offline
[   10.876722] smpboot: CPU 8 is now offline
[   10.909426] smpboot: CPU 9 is now offline
[   10.929360] smpboot: CPU 10 is now offline
[   10.957059] smpboot: CPU 11 is now offline
[   10.976542] smpboot: CPU 12 is now offline
[   11.008470] smpboot: CPU 13 is now offline
[   11.036356] smpboot: CPU 14 is now offline
[   11.072396] smpboot: CPU 15 is now offline
[   11.100229] smpboot: CPU 16 is now offline
[   11.128638] smpboot: CPU 17 is now offline
[   11.148479] smpboot: CPU 18 is now offline
[   11.173537] smpboot: CPU 19 is now offline
[   11.205680] smpboot: CPU 20 is now offline
[   11.231799] smpboot: CPU 21 is now offline
[   11.259562] smpboot: CPU 22 is now offline
[   11.288672] smpboot: CPU 23 is now offline
[   11.319691] smpboot: CPU 24 is now offline
[   11.355523] smpboot: CPU 25 is now offline
[   11.399469] smpboot: CPU 26 is now offline
[   11.435438] smpboot: CPU 27 is now offline
[   11.471402] smpboot: CPU 28 is now offline
[   11.515488] smpboot: CPU 29 is now offline
[   11.551355] smpboot: CPU 30 is now offline
[   11.591326] smpboot: CPU 31 is now offline
[   11.624004] smpboot: CPU 32 is now offline
[   11.659326] smpboot: CPU 33 is now offline
[   11.687478] smpboot: CPU 34 is now offline
[   11.719243] smpboot: CPU 35 is now offline
[   11.747252] smpboot: CPU 36 is now offline
[   11.771224] smpboot: CPU 37 is now offline
[   11.795089] [ cut here ]
[   11.798052] kernel BUG at kernel/sched/core.c:7596!
[   11.798052] invalid opcode:  [#1] PREEMPT SMP PTI
[   11.798052] CPU: 38 PID: 202 Comm: migration/38 Tainted: GE 
5.10.0+ #6
[   11.798052] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090008  12/07/2018
[   11.798052] Stopper: multi_cpu_stop+0x0/0xf0 <- 0x0
[   11.798052] RIP: 0010:sched_cpu_dying+0xa3/0xc0
[

[PATCH v2] x86/hyperv: Fix kexec panic/hang issues

2020-12-21 Thread Dexuan Cui

Currently the kexec kernel can panic or hang due to 2 causes:

1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the
old VP Assist Pages when the kexec kernel runs. The same issue is fixed
for hibernation in commit 421f090c819d ("x86/hyperv: Suspend/resume the
VP assist page for hibernation"). Now fix it for kexec.

2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs
are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so
between hv_kexec_handler() and native_machine_shutdown(), the other CPUs
can still try to access the hypercall page and cause panic. The workaround
"hv_hypercall_pg = NULL;" in hyperv_cleanup() is unreliabe. Move
hyperv_cleanup() to a better place.

Signed-off-by: Dexuan Cui 
---

Changes in v2:
Improved the commit log as Michael Kelley suggested.
No change to v1 otherwise.

 arch/x86/hyperv/hv_init.c   |  4 
 arch/x86/include/asm/mshyperv.h |  2 ++
 arch/x86/kernel/cpu/mshyperv.c  | 18 ++
 drivers/hv/vmbus_drv.c  |  2 --
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27..4638a52d8eae 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,8 @@
 #include 
 #include 
 
+int hyperv_init_cpuhp;
+
 void *hv_hypercall_pg;
 EXPORT_SYMBOL_GPL(hv_hypercall_pg);
 
@@ -401,6 +404,7 @@ void __init hyperv_init(void)
 
register_syscore_ops(_syscore_ops);
 
+   hyperv_init_cpuhp = cpuhp;
return;
 
 remove_cpuhp_state:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..30f76b966857 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -74,6 +74,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {}
 
 
 #if IS_ENABLED(CONFIG_HYPERV)
+extern int hyperv_init_cpuhp;
+
 extern void *hv_hypercall_pg;
 extern void  __percpu  **hyperv_pcpu_input_arg;
 
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f628e3dc150f..43b54bef5448 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -135,14 +135,32 @@ static void hv_machine_shutdown(void)
 {
if (kexec_in_progress && hv_kexec_handler)
hv_kexec_handler();
+
+   /*
+* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
+* corrupts the old VP Assist Pages and can crash the kexec kernel.
+*/
+   if (kexec_in_progress && hyperv_init_cpuhp > 0)
+   cpuhp_remove_state(hyperv_init_cpuhp);
+
+   /* The function calls stop_other_cpus(). */
native_machine_shutdown();
+
+   /* Disable the hypercall page when there is only 1 active CPU. */
+   if (kexec_in_progress)
+   hyperv_cleanup();
 }
 
 static void hv_machine_crash_shutdown(struct pt_regs *regs)
 {
if (hv_crash_handler)
hv_crash_handler(regs);
+
+   /* The function calls crash_smp_send_stop(). */
native_machine_crash_shutdown(regs);
+
+   /* Disable the hypercall page when there is only 1 active CPU. */
+   hyperv_cleanup();
 }
 #endif /* CONFIG_KEXEC_CORE */
 #endif /* CONFIG_HYPERV */
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 502f8cd95f6d..d491fdcee61f 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2550,7 +2550,6 @@ static void hv_kexec_handler(void)
/* Make sure conn_state is set as hv_synic_cleanup checks for it */
mb();
cpuhp_remove_state(hyperv_cpuhp_online);
-   hyperv_cleanup();
 };
 
 static void hv_crash_handler(struct pt_regs *regs)
@@ -2566,7 +2565,6 @@ static void hv_crash_handler(struct pt_regs *regs)
cpu = smp_processor_id();
hv_stimer_cleanup(cpu);
hv_synic_disable_regs(cpu);
-   hyperv_cleanup();
 };
 
 static int hv_synic_suspend(void)
-- 
2.19.1

RE: [PATCH] x86/hyperv: Fix kexec panic/hang issues

2020-12-21 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Monday, December 21, 2020 7:36 PM
> ...
> Since we don't *need* to do this, I think it might be less risky to just leave
> the code "as is".   But I'm OK either way.

Ok, then I'll leave it as is in v2.

Thanks,
-- Dexuan

RE: [PATCH] x86/hyperv: Fix kexec panic/hang issues

2020-12-21 Thread Dexuan Cui

> From: Michael Kelley
> Sent: Monday, December 21, 2020 3:33 PM
> From: Dexuan Cui
> Sent: Sunday, December 20, 2020 2:30 PM
> >
> > Currently the kexec kernel can panic or hang due to 2 causes:
> >
> > 1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the
> > VP Assist Pages when the kexec kernel runs. We ever fixed the same issue
> 
> Spurious word "ever".  And avoid first person "we".  Perhaps:
> 
> The same issue is fixed for hibernation in commit . .  Now fix it for
> kexec.

Thanks! Will use this in v2.

> > for hibernation in
> > commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for
> hibernation")
> > and now let's fix it for kexec.
> 
> Is the VP Assist Page getting cleared anywhere on the panic path?  We can

It's not.

> only do it for the CPU that runs panic(), but I don't think it is getting 
> cleared
> even for that CPU.   It is cleared only in hv_cpu_die(), and that's not 
> called on
> the panic path.

IMO kdump is different from the non-kdump kexec in that the kdump kernel
runs without depending on the memory used by the first kernel, so it looks 
unnecessary to clear the first kernel's VP Assist Page (and the hypercallpage).
According to my test, the second kernel can re-enble the VP Asist Page and 
the hypercall page using different GPAs, without disabling the old pages first.
Of course, in the future Hyper-V may require the guest to disable the pages 
first
before trying to re-enabling them, so I agree we'd better clear the pages in the
first kernell like this:

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 4638a52d8eae..8022f51c9c05 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -202,7 +202,7 @@ void clear_hv_tscchange_cb(void)
 }
 EXPORT_SYMBOL_GPL(clear_hv_tscchange_cb);

-static int hv_cpu_die(unsigned int cpu)
+int hv_cpu_die(unsigned int cpu)
 {
struct hv_reenlightenment_control re_ctrl;
unsigned int new_cpu;
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 30f76b966857..d090e781d216 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -76,6 +76,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {}
 #if IS_ENABLED(CONFIG_HYPERV)
 extern int hyperv_init_cpuhp;

+int hv_cpu_die(unsigned int cpu);
+
 extern void *hv_hypercall_pg;
 extern void  __percpu  **hyperv_pcpu_input_arg;

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 43b54bef5448..e54f8262bfe0 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -156,6 +156,9 @@ static void hv_machine_crash_shutdown(struct pt_regs *regs)
if (hv_crash_handler)
hv_crash_handler(regs);

+   /* Only call this on the faulting CPU. */
+   hv_cpu_die(raw_smp_processor_id());
+
/* The function calls crash_smp_send_stop(). */
native_machine_crash_shutdown(regs);

> > 2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs
> > are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so
> > between hv_kexec_handler() and native_machine_shutdown(), the other
> CPUs
> > can still try to access the hypercall page and cause panic. The workaround
> > "hv_hypercall_pg = NULL;" in hyperv_cleanup() can't work reliably.
> 
> I would note that the comment in hv_suspend() is also incorrect on this
> point.  Setting hv_hypercall_pg to NULL does not cause subsequent
> hypercalls to fail safely.  The fast hypercalls don't test for it, and even 
> if they
> did test like hv_do_hypercall(), the test just creates a race condition.

The comment in hv_suspend() should be correct because hv_suspend()
is only called during hibernation from the syscore_ops path where only
one CPU is active, e.g. for the suspend operation, it's called from
state_store
  hibernate
hibernation_snapshot
  create_image
suspend_disable_secondary_cpus 
syscore_suspend
  hv_suspend

It's similar for the resume operation:
resume_store
  software_resume
load_image_and_restore
  hibernation_restore
resume_target_kernel
  hibernate_resume_nonboot_cpu_disable
  syscore_suspend
hv_suspend
 
> >  static void hv_machine_crash_shutdown(struct pt_regs *regs)
> >  {
> > if (hv_crash_handler)
> > hv_crash_handler(regs);
> > +
> > +   /* The function calls crash_smp_send_stop(). */
> 
> Actually, crash_smp_send_stop() or smp_send_stop() has already been
> called earlier by panic(),

This is true only when the Hyper-V host supports the feature
HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE. On an old Hyper-V host 
without the feature, ms_hyperv_init_platform()

[PATCH] x86/hyperv: Fix kexec panic/hang issues

2020-12-20 Thread Dexuan Cui

Currently the kexec kernel can panic or hang due to 2 causes:

1) hv_cpu_die() is not called upon kexec, so the hypervisor corrupts the
VP Assist Pages when the kexec kernel runs. We ever fixed the same issue
for hibernation in
commit 421f090c819d ("x86/hyperv: Suspend/resume the VP assist page for 
hibernation")
and now let's fix it for kexec.

2) hyperv_cleanup() is called too early. In the kexec path, the other CPUs
are stopped in hv_machine_shutdown() -> native_machine_shutdown(), so
between hv_kexec_handler() and native_machine_shutdown(), the other CPUs
can still try to access the hypercall page and cause panic. The workaround
"hv_hypercall_pg = NULL;" in hyperv_cleanup() can't work reliably.
Move hyperv_cleanup() to the right place.

Signed-off-by: Dexuan Cui 
---
 arch/x86/hyperv/hv_init.c   |  4 
 arch/x86/include/asm/mshyperv.h |  2 ++
 arch/x86/kernel/cpu/mshyperv.c  | 18 ++
 drivers/hv/vmbus_drv.c  |  2 --
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index e04d90af4c27..4638a52d8eae 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,8 @@
 #include 
 #include 
 
+int hyperv_init_cpuhp;
+
 void *hv_hypercall_pg;
 EXPORT_SYMBOL_GPL(hv_hypercall_pg);
 
@@ -401,6 +404,7 @@ void __init hyperv_init(void)
 
register_syscore_ops(_syscore_ops);
 
+   hyperv_init_cpuhp = cpuhp;
return;
 
 remove_cpuhp_state:
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ffc289992d1b..30f76b966857 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -74,6 +74,8 @@ static inline void hv_disable_stimer0_percpu_irq(int irq) {}
 
 
 #if IS_ENABLED(CONFIG_HYPERV)
+extern int hyperv_init_cpuhp;
+
 extern void *hv_hypercall_pg;
 extern void  __percpu  **hyperv_pcpu_input_arg;
 
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f628e3dc150f..43b54bef5448 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -135,14 +135,32 @@ static void hv_machine_shutdown(void)
 {
if (kexec_in_progress && hv_kexec_handler)
hv_kexec_handler();
+
+   /*
+* Call hv_cpu_die() on all the CPUs, otherwise later the hypervisor
+* corrupts the old VP Assist Pages and can crash the kexec kernel.
+*/
+   if (kexec_in_progress && hyperv_init_cpuhp > 0)
+   cpuhp_remove_state(hyperv_init_cpuhp);
+
+   /* The function calls stop_other_cpus(). */
native_machine_shutdown();
+
+   /* Disable the hypercall page when there is only 1 active CPU. */
+   if (kexec_in_progress)
+   hyperv_cleanup();
 }
 
 static void hv_machine_crash_shutdown(struct pt_regs *regs)
 {
if (hv_crash_handler)
hv_crash_handler(regs);
+
+   /* The function calls crash_smp_send_stop(). */
native_machine_crash_shutdown(regs);
+
+   /* Disable the hypercall page when there is only 1 active CPU. */
+   hyperv_cleanup();
 }
 #endif /* CONFIG_KEXEC_CORE */
 #endif /* CONFIG_HYPERV */
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 502f8cd95f6d..d491fdcee61f 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2550,7 +2550,6 @@ static void hv_kexec_handler(void)
/* Make sure conn_state is set as hv_synic_cleanup checks for it */
mb();
cpuhp_remove_state(hyperv_cpuhp_online);
-   hyperv_cleanup();
 };
 
 static void hv_crash_handler(struct pt_regs *regs)
@@ -2566,7 +2565,6 @@ static void hv_crash_handler(struct pt_regs *regs)
cpu = smp_processor_id();
hv_stimer_cleanup(cpu);
hv_synic_disable_regs(cpu);
-   hyperv_cleanup();
 };
 
 static int hv_synic_suspend(void)
-- 
2.19.1

RE: [PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2020-12-18 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Thursday, December 17, 2020 8:08 PM
> 
> Linux VM on Hyper-V crashes with the latest mainline:
>  ...
> This is because of the recent buffer overflow detection in the
> commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified
> string functions")
> 
> Here acpi_device_bus_id->bus_id can only hold 14 characters, while the
> the acpi_device_hid(device) returns a 22-char string
> "HYPER_V_GEN_COUNTER_V1".
> 
> Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a
> string, it must be of the form AAA or , i.e. 7 chars or 8
> chars.
> 
> The field bus_id in struct acpi_device_bus_id was originally defined as
> char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the
> commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI
> devices")
> 
> It looks like so far an ID string of >=15 chars is only seen in the guest
> BIOS/firmware by Hyper-V, and AFAIK the ID string
> "HYPER_V_GEN_COUNTER_V1"
> is never used by Linux VM on Hyper-V, so let's just truncate the string to
> fix the panic.
> 
> Signed-off-by: Dexuan Cui 

IMO this patch should also go to the stable trees, so please add
Cc: 

Thanks,
-- Dexuan

[PATCH] ACPI: scan: Fix a Hyper-V Linux VM panic caused by buffer overflow

2020-12-17 Thread Dexuan Cui

Linux VM on Hyper-V crashes with the latest mainline:

[4.069624] detected buffer overflow in strcpy
[4.077733] kernel BUG at lib/string.c:1149!
..
[4.085819] RIP: 0010:fortify_panic+0xf/0x11
...
[4.085819] Call Trace:
[4.085819]  acpi_device_add.cold.15+0xf2/0xfb
[4.085819]  acpi_add_single_object+0x2a6/0x690
[4.085819]  acpi_bus_check_add+0xc6/0x280
[4.085819]  acpi_ns_walk_namespace+0xda/0x1aa
[4.085819]  acpi_walk_namespace+0x9a/0xc2
[4.085819]  acpi_bus_scan+0x78/0x90
[4.085819]  acpi_scan_init+0xfa/0x248
[4.085819]  acpi_init+0x2c1/0x321
[4.085819]  do_one_initcall+0x44/0x1d0
[4.085819]  kernel_init_freeable+0x1ab/0x1f4

This is because of the recent buffer overflow detection in the
commit 6a39e62abbaf ("lib: string.h: detect intra-object overflow in fortified 
string functions")

Here acpi_device_bus_id->bus_id can only hold 14 characters, while the
the acpi_device_hid(device) returns a 22-char string
"HYPER_V_GEN_COUNTER_V1".

Per ACPI Spec v6.2, Section 6.1.5 _HID (Hardware ID), if the ID is a
string, it must be of the form AAA or , i.e. 7 chars or 8
chars.

The field bus_id in struct acpi_device_bus_id was originally defined as
char bus_id[9], and later was enlarged to char bus_id[15] in 2007 in the
commit bb0958544f3c ("ACPI: use more understandable bus_id for ACPI devices")

It looks like so far an ID string of >=15 chars is only seen in the guest
BIOS/firmware by Hyper-V, and AFAIK the ID string "HYPER_V_GEN_COUNTER_V1"
is never used by Linux VM on Hyper-V, so let's just truncate the string to
fix the panic.

Signed-off-by: Dexuan Cui 
---
 drivers/acpi/scan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index a1b226eb2ce2..b801442b6b1b 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -674,7 +674,8 @@ int acpi_device_add(struct acpi_device *device,
}
if (!found) {
acpi_device_bus_id = new_bus_id;
-   strcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device));
+   strlcpy(acpi_device_bus_id->bus_id, acpi_device_hid(device),
+   sizeof(acpi_device_bus_id->bus_id));
acpi_device_bus_id->instance_no = 0;
list_add_tail(_device_bus_id->node, _bus_id_list);
}
-- 
2.19.1

RE: [PATCH 2/3] scsi: storvsc: Resolve data race in storvsc_probe()

2020-12-17 Thread Dexuan Cui

> From: Andrea Parri (Microsoft) 
> Sent: Thursday, December 17, 2020 12:33 PM
> 
> vmscsi_size_delta can be written concurrently by multiple instances of
> storvsc_probe(), corresponding to multiple synthetic IDE/SCSI devices;
> cf. storvsc_drv's probe_type == PROBE_PREFER_ASYNCHRONOUS.  Change
> the
> global variable vmscsi_size_delta to per-synthetic-IDE/SCSI-device.
> 
> Suggested-by: Dexuan Cui 
> Signed-off-by: Andrea Parri (Microsoft) 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: linux-s...@vger.kernel.org

Reviewed-by: Dexuan Cui

RE: [PATCH 3/3] scsi: storvsc: Validate length of incoming packet in storvsc_on_channel_callback()

2020-12-17 Thread Dexuan Cui

> From: Andrea Parri (Microsoft) 
> Sent: Thursday, December 17, 2020 12:33 PM
> 
> Check that the packet is of the expected size at least, don't copy data
> past the packet.
> 
> Reported-by: Saruhan Karademir 
> Signed-off-by: Andrea Parri (Microsoft) 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: linux-s...@vger.kernel.org

Reviewed-by: Dexuan Cui

RE: [PATCH 1/3] scsi: storvsc: Fix max_outstanding_req_per_channel for Win8 and newer

2020-12-17 Thread Dexuan Cui

> From: Andrea Parri (Microsoft) 
> Sent: Thursday, December 17, 2020 12:33 PM
> 
> Current code overestimates the value of max_outstanding_req_per_channel
> for Win8 and newer hosts, since vmscsi_size_delta is set to the initial
> value of sizeof(vmscsi_win8_extension) rather than zero.  This may lead
> to wrong decisions when using ring_avail_percent_lowater equals to zero.
> The estimate of max_outstanding_req_per_channel is 'exact' for Win7 and
> older hosts.  A better choice, keeping the algorithm for the estimation
> simple, is to err the other way around, i.e., to underestimate for Win7
> and older but to use the exact value for Win8 and newer.
> 
> Suggested-by: Dexuan Cui 
> Signed-off-by: Andrea Parri (Microsoft) 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: linux-s...@vger.kernel.org

Reviewed-by: Dexuan Cui

RE: static_branch_enable() does not work from a __init function?

2020-12-16 Thread Dexuan Cui

> From: Peter Zijlstra 
> Sent: Wednesday, December 16, 2020 2:59 AM
> ...
> So I think the reason your above module doesn't work, while the one in
> vmx_init() does work (for 5.10) should be fixed by the completely
> untested below.
> 
> I've no clue about 5.4 and no desire to investigate. That's what distro
> people are for.
> 
> Can you verify?
> 
> ---
> diff --git a/kernel/jump_label.c b/kernel/jump_label.c
> index 015ef903ce8c..c6a39d662935 100644
> --- a/kernel/jump_label.c
> +++ b/kernel/jump_label.c
> @@ -793,6 +793,7 @@ int jump_label_text_reserved(void *start, void *end)
>  static void jump_label_update(struct static_key *key)
>  {
>   struct jump_entry *stop = __stop___jump_table;
> + bool init = system_state < SYSTEM_RUNNING;
>   struct jump_entry *entry;
>  #ifdef CONFIG_MODULES
>   struct module *mod;
> @@ -804,15 +805,16 @@ static void jump_label_update(struct static_key
> *key)
> 
>   preempt_disable();
>   mod = __module_address((unsigned long)key);
> - if (mod)
> + if (mod) {
>   stop = mod->jump_entries + mod->num_jump_entries;
> + init = mod->state == MODULE_STATE_COMING;
> + }
>   preempt_enable();
>  #endif
>   entry = static_key_entries(key);
>   /* if there are no users, entry can be NULL */
>   if (entry)
> - __jump_label_update(key, entry, stop,
> - system_state < SYSTEM_RUNNING);
> + __jump_label_update(key, entry, stop, init);
>  }
> 
>  #ifdef CONFIG_STATIC_KEYS_SELFTEST

Yes, this patch fixes the issue found by the test module for both
v5.10 and v5.4. 

Thank you, Peter!

Dexuan

static_branch_enable() does not work from a __init function?

2020-12-15 Thread Dexuan Cui

Hi,
The below init_module() prints "foo: false". This is strange since
static_branch_enable() is called before the static_branch_unlikely().
This strange behavior happens to v5.10 and an old v5.4 kernel.

If I remove the "__init" marker from the init_module() function, then
I get the expected output of "foo: true"! I guess here I'm missing
something with Static Keys?

#include 
#include 
#include 

static DEFINE_STATIC_KEY_FALSE(enable_foo);

int __init init_module(void)
{
static_branch_enable(_foo);

if (static_branch_unlikely(_foo))
printk("foo: true\n");
else
printk("foo: false\n");

return 0;
}

void cleanup_module(void)
{
static_branch_disable(_foo);
}

MODULE_LICENSE("GPL");


PS, I originally found: in arch/x86/kvm/vmx/vmx.c: vmx_init(), it looks
like the line "static_branch_enable(_evmcs);" does not take effect
in a v5.4-based kernel, but does take effect in the v5.10 kernel in the
same x86-64 virtual machine on Hyper-V, so I made the above test module
to test static_branch_enable(), and found that static_branch_enable() in
the test module does not work with both v5.10 and my v5.4 kernel, if the
__init marker is used.

Thanks,
-- Dexuan

How can a userspace program tell if the system supports the ACPI S4 state (Suspend-to-Disk)?

2020-12-11 Thread Dexuan Cui

Hi all,
It looks like Linux can hibernate even if the system does not support the ACPI
S4 state, as long as the system can shut down, so "cat /sys/power/state"
always contains "disk", unless we specify the kernel parameter "nohibernate"
or we use LOCKDOWN_HIBERNATION.

In some scenarios IMO it can still be useful if the userspace is able to detect
if the ACPI S4 state is supported or not, e.g. when a Linux guest runs on 
Hyper-V, Hyper-V uses the virtual ACPI S4 state as an indicator of the proper
support of the tool stack on the host, i.e. the guest is discouraged from 
trying hibernation if the state is not supported.

I know we can check the S4 state by 'dmesg':

# dmesg |grep ACPI: | grep support
[3.034134] ACPI: (supports S0 S4 S5)

But this method is unreliable because the kernel msg buffer can be filled
and overwritten. Is there any better method? If not, do you think if the
below patch is appropriate? Thanks!

diff --git a/kernel/power/main.c b/kernel/power/main.c
index 0aefd6f57e0a..931a1526ea69 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -600,8 +601,12 @@ static ssize_t state_show(struct kobject *kobj, struct 
kobj_attribute *attr,
s += sprintf(s,"%s ", pm_states[i]);

 #endif
-   if (hibernation_available())
-   s += sprintf(s, "disk ");
+   if (hibernation_available()) {
+   if (acpi_sleep_state_supported(ACPI_STATE_S4))
+   s += sprintf(s, "disk+ ");
+   else
+   s += sprintf(s, "disk ");
+   }
if (s != buf)
/* convert the last space to a newline */
*(s-1) = '\n';

Thanks,
-- Dexuan

RE: [PATCH] iommu/hyper-v: Fix panic on a host without the 15-bit APIC ID support

2020-12-02 Thread Dexuan Cui

> From: Thomas Gleixner 
> Sent: Wednesday, December 2, 2020 1:56 AM
> 
> On Tue, Dec 01 2020 at 16:45, Dexuan Cui wrote:
> > The commit f36a74b9345a itself is good, but it causes a panic in a
> > Linux VM that runs on a Hyper-V host that doesn't have the 15-bit
> > Extended APIC ID support:
> > kernel BUG at arch/x86/kernel/apic/io_apic.c:2408!
> 
> This has nothing to do with the 15bit APIC ID support, really.
> 
> The point is that the select() function only matches when I/O APIC ID is
> 0, which is not guaranteed. That's independent of the 15bit extended
> APIC ID feature. But the I/O-APIC ID is irrelevant on hyperv because
> there is only one.
> 
> > This happens because the Hyper-V ioapic_ir_domain (which is defined in
> > drivers/iommu/hyperv-iommu.c) can not be found. Fix the panic by
> > properly claiming the only I/O APIC emulated by Hyper-V.
> 
> We don't fix a panic. We fix a bug in the code :)
> 
> I'll amend the changelog.
> 
> Thanks,
> 
> tglx

Thank you for reworking the commit log, tglx!

Dexuan

[tip: x86/apic] iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()

2020-12-02 Thread tip-bot2 for Dexuan Cui

The following commit has been merged into the x86/apic branch of tip:

Commit-ID: 26ab12bb9d96133b7880141d68b5e01a8783de9d
Gitweb:
https://git.kernel.org/tip/26ab12bb9d96133b7880141d68b5e01a8783de9d
Author:Dexuan Cui 
AuthorDate:Tue, 01 Dec 2020 16:45:10 -08:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 11:22:55 +01:00

iommu/hyper-v: Remove I/O-APIC ID check from hyperv_irq_remapping_select()

commit a491bb19f728 ("iommu/hyper-v: Implement select() method on remapping
irqdomain") restricted the irq_domain_ops::select() callback to match on
I/O-APIC index 0, which was correct until the parameter was changed to
carry the I/O APIC ID in commit f36a74b9345a.

If the ID is not 0 then the match fails. Therefore I/O-APIC init fails to
retrieve the parent irqdomain for the I/O-APIC resulting in a boot panic:

kernel BUG at arch/x86/kernel/apic/io_apic.c:2408!

Fix it by matching the I/O-APIC independent of the ID as there is only one
I/O APIC emulated by Hyper-V.

[ tglx: Amended changelog ]

Fixes: f36a74b9345a ("x86/ioapic: Use I/O-APIC ID for finding irqdomain, not 
index")
Signed-off-by: Dexuan Cui 
Signed-off-by: Thomas Gleixner 
Reviewed-by: David Woodhouse 
Link: https://lore.kernel.org/r/20201202004510.1818-1-de...@microsoft.com
---
 drivers/iommu/hyperv-iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 9438daa..1d21a0b 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -105,8 +105,8 @@ static int hyperv_irq_remapping_select(struct irq_domain *d,
   struct irq_fwspec *fwspec,
   enum irq_domain_bus_token bus_token)
 {
-   /* Claim only the first (and only) I/OAPIC */
-   return x86_fwspec_is_ioapic(fwspec) && fwspec->param[0] == 0;
+   /* Claim the only I/O APIC emulated by Hyper-V */
+   return x86_fwspec_is_ioapic(fwspec);
 }
 
 static const struct irq_domain_ops hyperv_ir_domain_ops = {

[PATCH] iommu/hyper-v: Fix panic on a host without the 15-bit APIC ID support

2020-12-01 Thread Dexuan Cui

The commit f36a74b9345a itself is good, but it causes a panic in a
Linux VM that runs on a Hyper-V host that doesn't have the 15-bit
Extended APIC ID support:
kernel BUG at arch/x86/kernel/apic/io_apic.c:2408!

This happens because the Hyper-V ioapic_ir_domain (which is defined in
drivers/iommu/hyperv-iommu.c) can not be found. Fix the panic by
properly claiming the only I/O APIC emulated by Hyper-V.

Cc: David Woodhouse 
Cc: Vitaly Kuznetsov 
Fixes: f36a74b9345a ("x86/ioapic: Use I/O-APIC ID for finding irqdomain, not 
index")
Signed-off-by: Dexuan Cui 
---
 drivers/iommu/hyperv-iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


This patch is for the tip.git tree's x86/apic branch.

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 9438daa24fdb..1d21a0b5f724 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -105,8 +105,8 @@ static int hyperv_irq_remapping_select(struct irq_domain *d,
   struct irq_fwspec *fwspec,
   enum irq_domain_bus_token bus_token)
 {
-   /* Claim only the first (and only) I/OAPIC */
-   return x86_fwspec_is_ioapic(fwspec) && fwspec->param[0] == 0;
+   /* Claim the only I/O APIC emulated by Hyper-V */
+   return x86_fwspec_is_ioapic(fwspec);
 }
 
 static const struct irq_domain_ops hyperv_ir_domain_ops = {

base-commit: d1adcfbb520c43c10fc22fcdccdd4204e014fb53
-- 
2.27.0

RE: kdump always hangs in rcu_barrier() -> wait_for_completion()

2020-11-26 Thread Dexuan Cui

> From: Paul E. McKenney 
> Sent: Thursday, November 26, 2020 1:42 PM
> 
> > > Another possibility is that rcu_state.gp_kthread is non-NULL, but that
> > > something else is preventing RCU grace periods from completing, but in
> >
> > It looks like somehow the scheduling is not working here: in rcu_barrier()
> > , if I replace the wait_for_completion() with
> > wait_for_completion_timeout(_state.barrier_completion, 30*HZ), the
> > issue persists.
> 
> Have you tried using sysreq-t to see what the various tasks are doing?

Will try it.

BTW, this is a "Generation 2" VM on Hyper-V, meaning sysrq only starts to
work after the Hyper-V para-virtualized keyboard driver loads... So, at this
early point, sysrq is not working. :-( I'll have to hack the code and use a 
virtual NMI interrupt to force the sysrq handler to be called.
 
> Having interrupts disabled on all CPUs would have the effect of disabling
> the RCU CPU stall warnings.
>   Thanx, Paul

I'm sure the interrupts are not disabled. Here the VM only has 1 virtual CPU,
and when the hang issue happens the virtual serial console is still responding
when I press Enter (it prints a new line) or Ctrl+C (it prints ^C).

Here the VM does not use the "legacy timers" (PIT, Local APIC timer, etc.) at 
all.
Instead, the VM uses the Hyper-V para-virtualized timers. It looks the Hyper-V
timer never fires in the kdump kernel when the hang issue happens. I'm 
looking into this... I suspect this hang issue may only be specific to Hyper-V.

Thanks,
-- Dexuan

RE: kdump always hangs in rcu_barrier() -> wait_for_completion()

2020-11-26 Thread Dexuan Cui

> From: Paul E. McKenney 
> Sent: Thursday, November 26, 2020 7:47 AM
>  ...
> The rcu_segcblist_n_cbs() function returns non-zero because something
> invoked call_rcu() some time previously.  The ftrace facility (or just
> a printk) should help you work out where that call_rcu() is located.

call_rcu() is indeed called multiple times, but as you said, this should
be normal.

> My best guess is that the underlying bug is that you are invoking
> rcu_barrier() before the RCU grace-period kthread has been created.
> This means that RCU grace periods cannot complete, which in turn means
> that if there has been even one invocation of call_rcu() since boot,
> rcu_barrier() cannot complete, which is what you are in fact seeing.
> Please note that it is perfectly legal to invoke call_rcu() very early in
> the boot process, as in even before the call to rcu_init().  Therefore,
> if this is the case, the bug is the early call to rcu_barrier(), not
> the early calls to call_rcu().
>
> To check this, at the beginning of rcu_barrier(), check the value of
> rcu_state.gp_kthread.  If my guess is correct, it will be NULL.

Unluckily, it's not NULL here. :-)

>
> Another possibility is that rcu_state.gp_kthread is non-NULL, but that
> something else is preventing RCU grace periods from completing, but in

It looks like somehow the scheduling is not working here: in rcu_barrier()
, if I replace the wait_for_completion() with
wait_for_completion_timeout(_state.barrier_completion, 30*HZ), the
issue persists.

> that case you should see RCU CPU stall warnings.  Unless of course they
> have been disabled.
>   Thanx, Paul

I guess I didn't disable the wanrings (I don't even know how to do that :)

grep RCU .config
# RCU Subsystem
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_NOCB_CPU=y
# end of RCU Subsystem
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
# RCU Debugging
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=30
CONFIG_RCU_TRACE=y
CONFIG_RCU_EQS_DEBUG=y
# end of RCU Debugging

Thanks,
-- Dexuan

RE: [PATCH 1/2] x86/kexec: Use up-to-dated screen_info copy to fill boot params

2020-11-25 Thread Dexuan Cui

> From: Dexuan Cui
> Sent: Monday, November 16, 2020 7:40 PM
> > diff --git a/arch/x86/kernel/kexec-bzimage64.c
> > b/arch/x86/kernel/kexec-bzimage64.c
> > index 57c2ecf43134..ce831f9448e7 100644
> > --- a/arch/x86/kernel/kexec-bzimage64.c
> > +++ b/arch/x86/kernel/kexec-bzimage64.c
> > @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct
> > boot_params *params,
> > params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
> >
> > /* Copying screen_info will do? */
> > -   memcpy(>screen_info, _params.screen_info,
> > -   sizeof(struct screen_info));
> > +   memcpy(>screen_info, _info, sizeof(struct screen_info));
> >
> > /* Fill in memsize later */
> > params->screen_info.ext_mem_k = 0;
> > --
> 
> Hi Kairui,
> According to "man kexec", kdump/kexec can use 2 different syscalls to set up
> the
> kdump kernel:
> 
> -s (--kexec-file-syscall)
>Specify that the new KEXEC_FILE_LOAD syscall should be used
> exclusively.
> 
> -c (--kexec-syscall)
>Specify that the old KEXEC_LOAD syscall should be used exclusively
> (the default).
> 
> It looks I can only reproduce the call-trace
> (https://bugzilla.redhat.com/show_bug.cgi?id=1867887#c5) with
> KEXEC_FILE_LOAD:
> I did kdump tests in Ubuntu 20.04 VM and by default the VM used the
> KEXEC_LOAD
> syscall and I couldn't reproduce the call-trace; after I added the "-s" 
> parameter
> to use
> the KEXEC_FILE_LOAD syscall, I could reproduce the call-trace and I can 
> confirm
> your
> patch can eliminate the call-trace because the "efifb" driver doesn't even 
> load
> with
> your patch.
> 
> Your patch is only for the KEXEC_FILE_LOAD syscall, and I'm sure it's not 
> used in
> the code path of the KEXEC_LOAD syscall.
> 
> So, in the case of the KEXEC_LOAD syscall, do you know how the *kexec*
> kernel's boot_params.screen_info.lfb_base is intialized? I haven't figured it 
> out yet.

FYI: in the case of the KEXEC_LOAD syscall, I think the lfb_base of the kexec
kernel is pre-setup by the kexec tool (see the function setup_linux_vesafb()):

https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/i386/x86-linux-setup.c#n126
static int setup_linux_vesafb(struct x86_linux_param_header *real_mode)
{
struct fb_fix_screeninfo fix;
struct fb_var_screeninfo var;
int fd;

fd = open("/dev/fb0", O_RDONLY);
if (-1 == fd)
return -1;

if (-1 == ioctl(fd, FBIOGET_FSCREENINFO, ))
goto out;
if (-1 == ioctl(fd, FBIOGET_VSCREENINFO, ))
goto out;
if (0 == strcmp(fix.id, "VESA VGA")) {
/* VIDEO_TYPE_VLFB */
real_mode->orig_video_isVGA = 0x23;
} else if (0 == strcmp(fix.id, "EFI VGA")) {
/* VIDEO_TYPE_EFI */
real_mode->orig_video_isVGA = 0x70;
} else if (arch_options.reuse_video_type) {
int err;
off_t offset = offsetof(typeof(*real_mode), orig_video_isVGA);

/* blindly try old boot time video type */
err = get_bootparam(_mode->orig_video_isVGA, offset, 1);
if (err)
goto out;
} else {
real_mode->orig_video_isVGA = 0;
close(fd);
return 0;
}

When a Ubuntu 20.10 VM (kexec-tools-2.0.20) runs on Hyper-V, we should
fall into the last condition, i.e. setting
"real_mode->orig_video_isVGA = 0;", so the "efifb" driver does not load
in the kdump kernel.

Ubuntu 20.04 (kexec-tools-2.0.18) is a little old in that it does not have
Kairui's patch
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=fb5a8792e6e4ee7de7ae3e06d193ea5beaaececc
, so it re-uses the VRAM location set up by the hyperv_fb driver, which
is undesirable because the "efifb" driver doesn't know it's accessing
an "incompatible" framebuffer -- IMO this may be just a small issue,
but anyay I hope Ubuntu 20.04's kexec-tools will pick up your patch.

So, now we should cover all the combinations if we use the latest kernel
and the latest kexec-tools, and the "efifb" driver in the kdump kernel
doesn't load.

Thanks,
-- Dexuan

kdump always hangs in rcu_barrier() -> wait_for_completion()

2020-11-24 Thread Dexuan Cui

Hi,
I happened to hit a kdump hang issue in a Linux VM running on some
Hyper-V host. Please see the attached log: the kdump kernel always hangs,
even if I configure only 1 virtual CPU to the VM.

I firstly hit the issue in RHEL 8.3's 4.18.x kernel, but later I found that 
the latest upstream v5.10-rc5 also has the same issue (at least the
symptom is exactly the same), so I dug into v5.10-rc5 and found that the
kdump kernel always hangs in kernel_init() -> mark_readonly() ->
rcu_barrier() -> wait_for_completion(_state.barrier_completion).

Let's take the 1-vCPU case for example (refer to the attached log): in the
below code, somehow rcu_segcblist_n_cbs() returns true, so the call
smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1) increases
the counter by 1, and hence later the counter is still 1 after the
atomic_sub_and_test(), and the complete() is not called.

static void rcu_barrier_func(void *cpu_in)
{
...
if (rcu_segcblist_entrain(>cblist, >barrier_head)) {
atomic_inc(_state.barrier_cpu_count);
} else {
...
}

void rcu_barrier(void)
{
...
atomic_set(_state.barrier_cpu_count, 2);
...
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(_data, cpu);
...
if (rcu_segcblist_n_cbs(>cblist) && cpu_online(cpu)) {
...
smp_call_function_single(cpu, rcu_barrier_func, (void 
*)cpu, 1);
...
}
}
...
if (atomic_sub_and_test(2, _state.barrier_cpu_count))
complete(_state.barrier_completion);

...
wait_for_completion(_state.barrier_completion);

Sorry for my ignorance of RCU -- I'm not sure why the rcu_segcblist_n_cbs()
returns 1 here. In the normal kernel, it returns 0, so the normal kernel does 
not
hang.

Note: in the case of kdump kernel, if I remove the kernel parameter
console=ttyS0 OR if I build the kernel with CONFIG_HZ=250, the issue can
no longer reproduce. Currently my kernel uses CONFIG_HZ=1000 and I use
console=ttyS0, so I'm able to reproduce the isue every time.

Note: the same kernel binary can not reproduce the issue when the VM
runs on another Hyper-V host.

It looks there is some kind of race condition?

Looking forward to your insights!

I'm happy to test any patch or enable more tracing, if necessary. Thanks!

Thanks,
-- Dexuan


bad-hz-1000.log
Description: bad-hz-1000.log

RE: [PATCH] video: hyperv_fb: Directly use the MMIO VRAM

2020-11-24 Thread Dexuan Cui

Hi Wei Liu,
Please do not pick up this patch, because actually MMIO VRAM can not work
with fb_deferred_io.

Previously I didn't test Xorg -- sorry. As soon as I tested it, I got the below
warning and the Xorg program ternimated immediately:

[   28.148432] WARNING: CPU: 19 PID: 1410 at mm/vmalloc.c:383 
vmalloc_to_page+0x14b/0x150
...
[   28.192959] CPU: 19 PID: 1410 Comm: Xorg Tainted: GE 
5.10.0-rc1+ #4
...
[   28.208720] RIP: 0010:vmalloc_to_page+0x14b/0x150
...
[   28.299231] Call Trace:
[   28.301428]  fb_deferred_io_fault+0x3a/0xa0
[   28.305276]  __do_fault+0x36/0x120
[   28.308276]  handle_mm_fault+0x1144/0x1950
[   28.311963]  exc_page_fault+0x290/0x510
[   28.315551]  ? asm_exc_page_fault+0x8/0x30
[   28.319186]  asm_exc_page_fault+0x1e/0x30
[   28.322969] RIP: 0033:0x7fbeda3ec2f5

The issue is that fb_deferred_io_page() requires that the PFN be backed by a
struct page, but it looks the MMIO address does not have the struct page backed.

So I have to drop this patch. 
Thanks Wei Hu and Michael for pointing this out!

FYI: drivers/video/fbdev/core/fb_defio.c:
static struct page *fb_deferred_io_page(struct fb_info *info, unsigned long 
offs)
{
void *screen_base = (void __force *) info->screen_base;
struct page *page;

if (is_vmalloc_addr(screen_base + offs))
page = vmalloc_to_page(screen_base + offs);
else
page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT);

return page;
}

/* this is to find and return the vmalloc-ed fb pages */
static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf)
{
unsigned long offset;
struct page *page;
struct fb_info *info = vmf->vma->vm_private_data;

offset = vmf->pgoff << PAGE_SHIFT;
if (offset >= info->fix.smem_len)
return VM_FAULT_SIGBUS;

page = fb_deferred_io_page(info, offset);
if (!page)
return VM_FAULT_SIGBUS;

Thanks,
-- Dexuan

[PATCH] video: hyperv_fb: Directly use the MMIO VRAM

2020-11-20 Thread Dexuan Cui

Late in 2019, 2 commits (see the 2 Fixes tags) were introduced to
mitigate the slow framebuffer issue. Now that we have fixed the
slowness issue by correctly mapping the MMIO VRAM (see
commit 5f1251a48c17 ("video: hyperv_fb: Fix the cache type when mapping
the VRAM")), we can remove most of the code introduced by the 2 commits,
i.e. we no longer need to allocate physical memory and use it to back up
the VRAM in Generation-1 VM, and we also no longer need to allocate
physical memory to back up the framebuffer in a Generation-2 VM and copy
the framebuffer to the real VRAM.

synthvid_deferred_io() is kept, because it's still desirable to send the
SYNTHVID_DIRT message only for the exact dirty rectangle, and only when
needed.

Fixes: d21987d709e8 ("video: hyperv: hyperv_fb: Support deferred IO for Hyper-V 
frame buffer driver")
Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on 
HyperV Gen 1 VMs.")
Cc: Wei Hu 
Cc: Boqun Feng 
Signed-off-by: Dexuan Cui 
---

This patch changes drivers/video/fbdev/Kconfig, but I hope this can
still go through the Hyper-V tree
https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-next
because it's unlikely to cause any build issue to other fbdev drivers
(that line was introduced by 3a6fb6c4255c only for hyperv_fb.c)

Note: this patch is based on the Hyper-V tree's hyperv-fixes branch, but
it should also apply cleanly to the branch hyperv-next if the commit
5f1251a48c17 is applied first.  This patch is for v5.11 rather than
v5.10.

 drivers/video/fbdev/Kconfig |   1 -
 drivers/video/fbdev/hyperv_fb.c | 170 ++--
 2 files changed, 9 insertions(+), 162 deletions(-)

diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
index 402e85450bb5..05b37fb3c6d6 100644
--- a/drivers/video/fbdev/Kconfig
+++ b/drivers/video/fbdev/Kconfig
@@ -2205,7 +2205,6 @@ config FB_HYPERV
select FB_CFB_COPYAREA
select FB_CFB_IMAGEBLIT
select FB_DEFERRED_IO
-   select DMA_CMA if HAVE_DMA_CONTIGUOUS && CMA
help
  This framebuffer driver supports Microsoft Hyper-V Synthetic Video.
 
diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 58c74d2356ba..8131f4e66f98 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -31,16 +31,6 @@
  * "set-vmvideo" command. For example
  * set-vmvideo -vmname name -horizontalresolution:1920 \
  * -verticalresolution:1200 -resolutiontype single
- *
- * Gen 1 VMs also support direct using VM's physical memory for framebuffer.
- * It could improve the efficiency and performance for framebuffer and VM.
- * This requires to allocate contiguous physical memory from Linux kernel's
- * CMA memory allocator. To enable this, supply a kernel parameter to give
- * enough memory space to CMA allocator for framebuffer. For example:
- *cma=130m
- * This gives 130MB memory to CMA allocator that can be allocated to
- * framebuffer. For reference, 8K resolution (7680x4320) takes about
- * 127MB memory.
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -267,16 +257,8 @@ struct hvfb_par {
/* If true, the VSC notifies the VSP on every framebuffer change */
bool synchronous_fb;
 
-   /* If true, need to copy from deferred IO mem to framebuffer mem */
-   bool need_docopy;
-
struct notifier_block hvfb_panic_nb;
 
-   /* Memory for deferred IO and frame buffer itself */
-   unsigned char *dio_vp;
-   unsigned char *mmio_vp;
-   phys_addr_t mmio_pp;
-
/* Dirty rectangle, protected by delayed_refresh_lock */
int x1, y1, x2, y2;
bool delayed_refresh;
@@ -405,21 +387,6 @@ synthvid_update(struct fb_info *info, int x1, int y1, int 
x2, int y2)
return 0;
 }
 
-static void hvfb_docopy(struct hvfb_par *par,
-   unsigned long offset,
-   unsigned long size)
-{
-   if (!par || !par->mmio_vp || !par->dio_vp || !par->fb_ready ||
-   size == 0 || offset >= dio_fb_size)
-   return;
-
-   if (offset + size > dio_fb_size)
-   size = dio_fb_size - offset;
-
-   memcpy(par->mmio_vp + offset, par->dio_vp + offset, size);
-}
-
-/* Deferred IO callback */
 static void synthvid_deferred_io(struct fb_info *p,
 struct list_head *pagelist)
 {
@@ -444,10 +411,6 @@ static void synthvid_deferred_io(struct fb_info *p,
y2 = end / p->fix.line_length;
miny = min_t(int, miny, y1);
maxy = max_t(int, maxy, y2);
-
-   /* Copy from dio space to mmio address */
-   if (par->fb_ready && par->need_docopy)
-   hvfb_docopy(par, start, PAGE_SIZE);
}
 
if (par->fb_ready && par->update)
@@ -704,7 +667,7 @@ static in

[PATCH] video: hyperv_fb: Fix the cache type when mapping the VRAM

2020-11-17 Thread Dexuan Cui

x86 Hyper-V used to essentially always overwrite the effective cache type
of guest memory accesses to WB. This was problematic in cases where there
is a physical device assigned to the VM, since that often requires that
the VM should have control over cache types. Thus, on newer Hyper-V since
2018, Hyper-V always honors the VM's cache type, but unexpectedly Linux VM
users start to complain that Linux VM's VRAM becomes very slow, and it
turns out that Linux VM should not map the VRAM uncacheable by ioremap().
Fix this slowness issue by using ioremap_cache().

On ARM64, ioremap_cache() is also required as the host also maps the VRAM
cacheable, otherwise VM Connect can't display properly with ioremap() or
ioremap_wc().

With this change, the VRAM on new Hyper-V is as fast as regular RAM, so
it's no longer necessary to use the hacks we added to mitigate the
slowness, i.e. we no longer need to allocate physical memory and use
it to back up the VRAM in Generation-1 VM, and we also no longer need to
allocate physical memory to back up the framebuffer in a Generation-2 VM
and copy the framebuffer to the real VRAM. A further big change will
address these for v5.11.

Fixes: 68a2d20b79b1 ("drivers/video: add Hyper-V Synthetic Video Frame Buffer 
Driver")
Tested-by: Boqun Feng 
Signed-off-by: Dexuan Cui 
---

Hi Wei Liu, can you please pick this up into the hyperv/linux.git tree's
hyperv-fixes branch? I really hope this patch can be in v5.10 since it
fixes a longstanding issue: https://github.com/LIS/lis-next/issues/655

 drivers/video/fbdev/hyperv_fb.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c
index 5bc86f481a78..c8b0ae676809 100644
--- a/drivers/video/fbdev/hyperv_fb.c
+++ b/drivers/video/fbdev/hyperv_fb.c
@@ -1093,7 +1093,12 @@ static int hvfb_getmem(struct hv_device *hdev, struct 
fb_info *info)
goto err1;
}
 
-   fb_virt = ioremap(par->mem->start, screen_fb_size);
+   /*
+* Map the VRAM cacheable for performance. This is also required for
+* VM Connect to display properly for ARM64 Linux VM, as the host also
+* maps the VRAM cacheable.
+*/
+   fb_virt = ioremap_cache(par->mem->start, screen_fb_size);
if (!fb_virt)
goto err2;
 
-- 
2.19.1

RE: [PATCH 1/2] x86/kexec: Use up-to-dated screen_info copy to fill boot params

2020-11-16 Thread Dexuan Cui

> From: Kairui Song 
> Sent: Wednesday, October 14, 2020 2:24 AM
> To: linux-kernel@vger.kernel.org
> 
> kexec_file_load now just reuse the old boot_params.screen_info.
> But if drivers have change the hardware state, boot_param.screen_info
> could contain invalid info.
> 
> For example, the video type might be no longer VGA, or frame buffer
> address changed. If kexec kernel keep using the old screen_info,
> kexec'ed kernel may attempt to write to an invalid framebuffer
> memory region.
> 
> There are two screen_info globally available, boot_params.screen_info
> and screen_info. Later one is a copy, and could be updated by drivers.
> 
> So let kexec_file_load use the updated copy.
> 
> Signed-off-by: Kairui Song 
> ---
>  arch/x86/kernel/kexec-bzimage64.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c
> b/arch/x86/kernel/kexec-bzimage64.c
> index 57c2ecf43134..ce831f9448e7 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -200,8 +200,7 @@ setup_boot_parameters(struct kimage *image, struct
> boot_params *params,
>   params->hdr.hardware_subarch = boot_params.hdr.hardware_subarch;
> 
>   /* Copying screen_info will do? */
> - memcpy(>screen_info, _params.screen_info,
> - sizeof(struct screen_info));
> + memcpy(>screen_info, _info, sizeof(struct screen_info));
> 
>   /* Fill in memsize later */
>   params->screen_info.ext_mem_k = 0;
> --

Hi Kairui,
According to "man kexec", kdump/kexec can use 2 different syscalls to set up the
kdump kernel:

-s (--kexec-file-syscall)
   Specify that the new KEXEC_FILE_LOAD syscall should be used exclusively.

-c (--kexec-syscall)
   Specify that the old KEXEC_LOAD syscall should be used exclusively (the 
default).

It looks I can only reproduce the call-trace 
(https://bugzilla.redhat.com/show_bug.cgi?id=1867887#c5) with KEXEC_FILE_LOAD:
I did kdump tests in Ubuntu 20.04 VM and by default the VM used the KEXEC_LOAD
syscall and I couldn't reproduce the call-trace; after I added the "-s" 
parameter to use
the KEXEC_FILE_LOAD syscall, I could reproduce the call-trace and I can confirm 
your
patch can eliminate the call-trace because the "efifb" driver doesn't even load 
with 
your patch.

Your patch is only for the KEXEC_FILE_LOAD syscall, and I'm sure it's not used 
in the
code path of the KEXEC_LOAD syscall.

So, in the case of the KEXEC_LOAD syscall, do you know how the *kexec* kernel's
boot_params.screen_info.lfb_base is intialized? I haven't figured it out yet.

Thanks,
-- Dexuan

[tip: x86/apic] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

2020-11-04 Thread tip-bot2 for Dexuan Cui

The following commit has been merged into the x86/apic branch of tip:

Commit-ID: d981059e13ffa9ed03a73472e932d070323bd057
Gitweb:
https://git.kernel.org/tip/d981059e13ffa9ed03a73472e932d070323bd057
Author:Dexuan Cui 
AuthorDate:Mon, 02 Nov 2020 17:11:36 -08:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 04 Nov 2020 11:10:52 +01:00

x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs,
the CPUs can't be the destination of IOAPIC interrupts, because the
IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver
drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are
only routed to CPUs that don't have >255 APIC IDs. However, there is
an issue with kdump, because the kdump kernel can run on any CPU, and
hence IOAPIC interrupts can't work if the kdump kernel run on a CPU
with a >255 APIC ID.

The kdump issue can be fixed by the Extended Dest ID, which is introduced
recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in
struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the
support of the underlying hypervisor. The latest Hyper-V has added the
support recently: with this commit, on such a Hyper-V host, Linux VM
does not use hyperv-iommu.c because hyperv_prepare_irq_remapping()
returns -ENODEV; instead, Linux kernel's generic support of Extended Dest
ID from David is used, meaning that Linux VM is able to support up to
32K CPUs, and IOAPIC interrupts can be routed to all the CPUs.

On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
changes with this commit: Linux VM is still able to bring up the CPUs with
> 255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still
can not go to such CPUs, and the kdump kernel still can not work properly
on such CPUs.

[ tglx: Updated comment as suggested by David ]

Signed-off-by: Dexuan Cui 
Signed-off-by: Thomas Gleixner 
Acked-by: David Woodhouse 
Link: https://lore.kernel.org/r/20201103011136.59108-1-de...@microsoft.com
---
 arch/x86/include/asm/hyperv-tlfs.h |  7 +++-
 arch/x86/kernel/cpu/mshyperv.c | 29 +-
 2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 0ed20e8..6bf42ae 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -23,6 +23,13 @@
 #define HYPERV_CPUID_IMPLEMENT_LIMITS  0x4005
 #define HYPERV_CPUID_NESTED_FEATURES   0x400A
 
+#define HYPERV_CPUID_VIRT_STACK_INTERFACE  0x4081
+#define HYPERV_VS_INTERFACE_EAX_SIGNATURE  0x31235356  /* "VS#1" */
+
+#define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x4082
+/* Support for the extended IOAPIC RTE format */
+#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE   BIT(2)
+
 #define HYPERV_HYPERVISOR_PRESENT_BIT  0x8000
 #define HYPERV_CPUID_MIN   0x4005
 #define HYPERV_CPUID_MAX   0x4000
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 05ef1f4..f628e3d 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -366,9 +366,38 @@ static void __init ms_hyperv_init_platform(void)
 #endif
 }
 
+static bool __init ms_hyperv_x2apic_available(void)
+{
+   return x2apic_supported();
+}
+
+/*
+ * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping()
+ * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the
+ * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg().
+ *
+ * Note: for a VM on Hyper-V, the I/O-APIC is the only device which
+ * (logically) generates MSIs directly to the system APIC irq domain.
+ * There is no HPET, and PCI MSI/MSI-X interrupts are remapped by the
+ * pci-hyperv host bridge.
+ */
+static bool __init ms_hyperv_msi_ext_dest_id(void)
+{
+   u32 eax;
+
+   eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE);
+   if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE)
+   return false;
+
+   eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES);
+   return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE;
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
.name   = "Microsoft Hyper-V",
.detect = ms_hyperv_platform,
.type   = X86_HYPER_MS_HYPERV,
+   .init.x2apic_available  = ms_hyperv_x2apic_available,
+   .init.msi_ext_dest_id   = ms_hyperv_msi_ext_dest_id,
.init.init_platform = ms_hyperv_init_platform,
 };

RE: [PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

2020-11-03 Thread Dexuan Cui

> From: David Woodhouse 
> Sent: Tuesday, November 3, 2020 12:03 AM
> > +/*
> > + * If ms_hyperv_msi_ext_dest_id() returns true,
> > hyperv_prepare_irq_remapping()
> > + * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the
> > + * generic support of the 15-bit APIC ID is used: see
> > __irq_msi_compose_msg().
> > + *
> > + * Note: For a VM on Hyper-V, no emulated legacy device supports PCI
> MSI/MSI-X,
> > + * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and
> the
> > + * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here
> despite
> > + * the word "msi" in the name "msi_ext_dest_id", actually the callback only
> > + * affects how IOAPIC interrupts are routed.
> > + */
> 
> I named it like that on purpose to make the point that the I/OAPIC is
> just a device for turning line interrupts into MSIs. Some VMMs, just
> like real hardware, really do implement their I/OAPIC emulation that
> way. It makes a lot of sense to do so if you support interrupt
> remapping.

I totally agree.
 
> FWIW I might have phrased your last paragraph in that comment as
> 
>   Note: for a VM on Hyper-V, the I/OAPIC is the only device which
>   (logically) generates MSIs directly to the system APIC irq domain.
>   There is no HPET, and PCI MSI/MSI-X interrupts are remapped by the
>   pci-hyperv host bridge.

I agree. This version is much better.
 
> But don't bother to change it; I think I've made my point quite well
> enough with https://git.kernel.org/tip/tip/c/5d5a97133 :)
> 
> --
> dwmw2

Hi David,
This patch has been in the x86/apic branch (with a line missing in the commit
log). If possible, I hope tglx can help make this change you suggested, and add
the missing line in the commit log. :-)

Thanks,
-- Dexuan

RE: [PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

2020-11-03 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Monday, November 2, 2020 5:12 PM
> 
> ...

Hi tglx,
Now this patch is in the x86/apic branch, which is great! Thanks for the 
quick action! But the third line of the below paragraph of the commit log
is missing... Sorry I just realized I should have not prefixed that line with 
the
">255 APIC IDs" -- it looks a line is ignored if it starts with 2 chars of 
">>". :-(

> On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
> changes with this commit: Linux VM is still able to bring up the CPUs with
> >255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still
> can not go to such CPUs, and the kdump kernel still can not work properly
> on such CPUs.

[tip: x86/apic] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

2020-11-03 Thread tip-bot2 for Dexuan Cui

The following commit has been merged into the x86/apic branch of tip:

Commit-ID: af2abc92c5ddf5fc5a2036bc106c4d9a80a4d5f7
Gitweb:
https://git.kernel.org/tip/af2abc92c5ddf5fc5a2036bc106c4d9a80a4d5f7
Author:Dexuan Cui 
AuthorDate:Mon, 02 Nov 2020 17:11:36 -08:00
Committer: Thomas Gleixner 
CommitterDate: Tue, 03 Nov 2020 09:16:46 +01:00

x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs,
the CPUs can't be the destination of IOAPIC interrupts, because the
IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver
drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are
only routed to CPUs that don't have >255 APIC IDs. However, there is
an issue with kdump, because the kdump kernel can run on any CPU, and
hence IOAPIC interrupts can't work if the kdump kernel run on a CPU
with a >255 APIC ID.

The kdump issue can be fixed by the Extended Dest ID, which is introduced
recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in
struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the
support of the underlying hypervisor. The latest Hyper-V has added the
support recently: with this commit, on such a Hyper-V host, Linux VM
does not use hyperv-iommu.c because hyperv_prepare_irq_remapping()
returns -ENODEV; instead, Linux kernel's generic support of Extended Dest
ID from David is used, meaning that Linux VM is able to support up to
32K CPUs, and IOAPIC interrupts can be routed to all the CPUs.

On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
changes with this commit: Linux VM is still able to bring up the CPUs with
can not go to such CPUs, and the kdump kernel still can not work properly
on such CPUs.

Signed-off-by: Dexuan Cui 
Signed-off-by: Thomas Gleixner 
Acked-by: David Woodhouse


   
Link: https://lore.kernel.org/r/20201103011136.59108-1-de...@microsoft.com

---
 arch/x86/include/asm/hyperv-tlfs.h |  7 +++-
 arch/x86/kernel/cpu/mshyperv.c | 30 +-
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 0ed20e8..6bf42ae 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -23,6 +23,13 @@
 #define HYPERV_CPUID_IMPLEMENT_LIMITS  0x4005
 #define HYPERV_CPUID_NESTED_FEATURES   0x400A
 
+#define HYPERV_CPUID_VIRT_STACK_INTERFACE  0x4081
+#define HYPERV_VS_INTERFACE_EAX_SIGNATURE  0x31235356  /* "VS#1" */
+
+#define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x4082
+/* Support for the extended IOAPIC RTE format */
+#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE   BIT(2)
+
 #define HYPERV_HYPERVISOR_PRESENT_BIT  0x8000
 #define HYPERV_CPUID_MIN   0x4005
 #define HYPERV_CPUID_MAX   0x4000
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 05ef1f4..cc4037d 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -366,9 +366,39 @@ static void __init ms_hyperv_init_platform(void)
 #endif
 }
 
+static bool __init ms_hyperv_x2apic_available(void)
+{
+   return x2apic_supported();
+}
+
+/*
+ * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping()
+ * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the
+ * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg().
+ *
+ * Note: For a VM on Hyper-V, no emulated legacy device supports PCI MSI/MSI-X,
+ * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and the
+ * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here despite
+ * the word "msi" in the name "msi_ext_dest_id", actually the callback only
+ * affects how IOAPIC interrupts are routed.
+ */
+static bool __init ms_hyperv_msi_ext_dest_id(void)
+{
+   u32 eax;
+
+   eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE);
+   if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE)
+   return false;
+
+   eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES);
+   return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE;
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
.name   = "Microsoft Hyper-V",
.detect = ms_hyperv_platform,
.type   = X86_HYPER_MS_HYPERV,
+   .init.x2apic_available  = ms_hyperv_x2apic_available,
+   .init.msi_ext_dest_id   = ms_hyperv_msi_ext_dest_id,
.init.init_platform = ms_hyperv_init_platform,
 };

RE: [PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

2020-11-02 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Monday, November 2, 2020 5:12 PM

Sorry I forgot to mention that this patch is based on tip.git's x86/apic branch.

Thanks,
-- Dexuan

[PATCH] x86/hyperv: Enable 15-bit APIC ID if the hypervisor supports it

2020-11-02 Thread Dexuan Cui

When a Linux VM runs on Hyper-V, if the VM has CPUs with >255 APIC IDs,
the CPUs can't be the destination of IOAPIC interrupts, because the
IOAPIC RTE's Dest Field has only 8 bits. Currently the hackery driver
drivers/iommu/hyperv-iommu.c is used to ensure IOAPIC interrupts are
only routed to CPUs that don't have >255 APIC IDs. However, there is
an issue with kdump, because the kdump kernel can run on any CPU, and
hence IOAPIC interrupts can't work if the kdump kernel run on a CPU
with a >255 APIC ID.

The kdump issue can be fixed by the Extended Dest ID, which is introduced
recently by David Woodhouse (for IOAPIC, see the field virt_destid_8_14 in
struct IO_APIC_route_entry). Of course, the Extended Dest ID needs the
support of the underlying hypervisor. The latest Hyper-V has added the
support recently: with this commit, on such a Hyper-V host, Linux VM
does not use hyperv-iommu.c because hyperv_prepare_irq_remapping()
returns -ENODEV; instead, Linux kernel's generic support of Extended Dest
ID from David is used, meaning that Linux VM is able to support up to
32K CPUs, and IOAPIC interrupts can be routed to all the CPUs.

On an old Hyper-V host that doesn't support the Extended Dest ID, nothing
changes with this commit: Linux VM is still able to bring up the CPUs with
>255 APIC IDs with the help of hyperv-iommu.c, but IOAPIC interrupts still
can not go to such CPUs, and the kdump kernel still can not work properly
on such CPUs.

Signed-off-by: Dexuan Cui 
---
 arch/x86/include/asm/hyperv-tlfs.h |  7 +++
 arch/x86/kernel/cpu/mshyperv.c | 30 ++
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
b/arch/x86/include/asm/hyperv-tlfs.h
index 0ed20e8bba9e..6bf42aed387e 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -23,6 +23,13 @@
 #define HYPERV_CPUID_IMPLEMENT_LIMITS  0x4005
 #define HYPERV_CPUID_NESTED_FEATURES   0x400A
 
+#define HYPERV_CPUID_VIRT_STACK_INTERFACE  0x4081
+#define HYPERV_VS_INTERFACE_EAX_SIGNATURE  0x31235356  /* "VS#1" */
+
+#define HYPERV_CPUID_VIRT_STACK_PROPERTIES 0x4082
+/* Support for the extended IOAPIC RTE format */
+#define HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE   BIT(2)
+
 #define HYPERV_HYPERVISOR_PRESENT_BIT  0x8000
 #define HYPERV_CPUID_MIN   0x4005
 #define HYPERV_CPUID_MAX   0x4000
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 05ef1f4550cb..cc4037d841df 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -366,9 +366,39 @@ static void __init ms_hyperv_init_platform(void)
 #endif
 }
 
+static bool __init ms_hyperv_x2apic_available(void)
+{
+   return x2apic_supported();
+}
+
+/*
+ * If ms_hyperv_msi_ext_dest_id() returns true, hyperv_prepare_irq_remapping()
+ * returns -ENODEV and the Hyper-V IOMMU driver is not used; instead, the
+ * generic support of the 15-bit APIC ID is used: see __irq_msi_compose_msg().
+ *
+ * Note: For a VM on Hyper-V, no emulated legacy device supports PCI MSI/MSI-X,
+ * and PCI MSI/MSI-X only come from the assigned physical PCIe device, and the
+ * PCI MSI/MSI-X interrupts are handled by the pci-hyperv driver. Here despite
+ * the word "msi" in the name "msi_ext_dest_id", actually the callback only
+ * affects how IOAPIC interrupts are routed.
+ */
+static bool __init ms_hyperv_msi_ext_dest_id(void)
+{
+   u32 eax;
+
+   eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_INTERFACE);
+   if (eax != HYPERV_VS_INTERFACE_EAX_SIGNATURE)
+   return false;
+
+   eax = cpuid_eax(HYPERV_CPUID_VIRT_STACK_PROPERTIES);
+   return eax & HYPERV_VS_PROPERTIES_EAX_EXTENDED_IOAPIC_RTE;
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_ms_hyperv = {
.name   = "Microsoft Hyper-V",
.detect = ms_hyperv_platform,
.type   = X86_HYPER_MS_HYPERV,
+   .init.x2apic_available  = ms_hyperv_x2apic_available,
+   .init.msi_ext_dest_id   = ms_hyperv_msi_ext_dest_id,
.init.init_platform = ms_hyperv_init_platform,
 };
-- 
2.25.1

RE: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?

2020-10-06 Thread Dexuan Cui

> From: Thomas Gleixner 
> Sent: Tuesday, October 6, 2020 11:58 AM
> > ...
> > I pass through an MSI-X-capable PCI device to the Linux VM (which has
> > only 1 virtual CPU), and the below code does *not* report any error
> > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq()
> > returns 0), but the code does not work: the second MSI-X interrupt is not
> > happening while the first interrupt does work fine.
> >
> > int nr_irqs = 2;
> > int i, nvec, irq;
> >
> > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs,
> > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL);
> 
> Why should it return an error?

The above code returns -ENOSPC if num_possible_cpus() is also 1, and
returns 0 if num_possible_cpus() is 128. So it looks the above code is
not using the API correctly, and hence gets undefined results.

> > for (i = 0; i < nvec; i++) {
> > irq = pci_irq_vector(pdev, i);
> > err = request_irq(irq, test_intr, 0, "test_intr", _cxt[i]);
> > }
> 
> And why do you expect that the second interrupt works?
> 
> This is about managed interrupts and the spreading code has two vectors
> to which it can spread the interrupts. One is assigned to one half of
> the possible CPUs and the other one to the other half. Now you have only
> one CPU online so only the interrupt with has the online CPU in the
> assigned affinity mask is started up.
> 
> That's how managed interrupts work. If you don't want managed interrupts
> then don't use them.
> 
> Thanks,
> 
> tglx

Thanks for the clarification! It looks with PCI_IRQ_AFFINITY the kernel 
guarantees that the allocated interrutps are 1:1 bound to CPUs, and the
userspace is unable to change the affinities. This is very useful to support
per-CPU I/O queues.

Thanks,
-- Dexuan

irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?

2020-10-06 Thread Dexuan Cui

Hi all,
I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7
and I have CONFIG_NR_CPUS=256.

The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware,
which always reports 128 Local APIC entries in the ACPI MADT table. Here
only the first Local APIC entry's "Processor Enabled" is 1 since this
Linux VM is configured to have only 1 CPU. This means: in the Linux kernel,
the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0),
while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128.

I pass through an MSI-X-capable PCI device to the Linux VM (which has
only 1 virtual CPU), and the below code does *not* report any error
(i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq()
returns 0), but the code does not work: the second MSI-X interrupt is not
happening while the first interrupt does work fine.

int nr_irqs = 2;
int i, nvec, irq;

nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs,
PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL);

for (i = 0; i < nvec; i++) {
irq = pci_irq_vector(pdev, i);
err = request_irq(irq, test_intr, 0, "test_intr", _cxt[i]);
}

It turns out that pci_alloc_irq_vectors_affinity() -> ... ->
irq_create_affinity_masks() allocates an improper affinity for the second
interrupt. The below printk() shows that the second interrupt's affinity is
1-64, but only CPU0 is present in the system! As a result, later,
request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns
IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is 
empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 
0;"),
since __irq_startup() is only called for IRQ_STARTUP_MANAGED and
IRQ_STARTUP_NORMAL.

--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -484,6 +484,9 @@ struct irq_affinity_desc *
for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
masks[i].is_managed = 1;

+   for (i = 0; i < nvecs; i++)
+   printk("i=%d, affi = %*pbl\n", i,
+  cpumask_pr_args([i].mask));
return masks;
 }

[   43.770477] i=0, affi = 0,65-127
[   43.770484] i=1, affi = 1-64

Though here the issue happens to a Linux VM on Hyper-V, I think the same
issue can also happen to a physical machine, if the physical machine also
uses a lot of static MADT entries, of which only the entries of the present
CPUs are marked to be "Processor Enabled == 1".

I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() ->
irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should
use cpu_present_mask rather than cpu_possible_mask (), so here
irq_calc_affinity_vectors() would return 1, and
__pci_enable_msix_range() would immediately return -ENOSPC to avoid a
*silent* failure.

However, git-log shows that this 2018 commit intentionally changed the
cpu_present_mask to cpu_possible_mask:
84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")

so I'm not sure whether (and how?) we should address the *silent* failure.

BTW, here I use a single-CPU VM to simplify the discussion. Actually,
if the VM has n CPUs, with the above usage of
pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is
that it's really not good to have a silent failure, which makes it a lot more 
difficult to figure out what goes wrong), it looks only the first n MSI-X 
interrupts
can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated
improper affinity.

According to my tests, if we need n+1 MSI-X interrupts in such a VM that
has n CPUs, it looks we have 2 options (the second should be better):

1. Do not use the PCI_IRQ_AFFINITY flag, i.e.
pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL);

2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd,
which tells the API that we don't care about the first interrupt's affinity:

struct irq_affinity affd = {
.pre_vectors = 1,
...
};

pci_alloc_irq_vectors_affinity(pdev, n+1, n+1,
PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, );

PS, irq_create_affinity_masks() is complicated. Let me know if you're
interested to know how it allocates the invalid affinity "1-64" for the
second MSI-X interrupt.

PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM,
so the issue described above can not reproduce there.

Thanks,
-- Dexuan

[PATCH v3] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-10-02 Thread Dexuan Cui

pci_restore_msi_state() directly writes the MSI/MSI-X related registers
via MMIO. On a physical machine, this works perfectly; for a Linux VM
running on a hypervisor, which typically enables IOMMU interrupt remapping,
the hypervisor usually should trap and emulate the MMIO accesses in order
to re-create the necessary interrupt remapping table entries in the IOMMU,
otherwise the interrupts can not work in the VM after hibernation.

Hyper-V is different from other hypervisors in that it does not trap and
emulate the MMIO accesses, and instead it uses a para-virtualized method,
which requires the VM to call hv_compose_msi_msg() to notify the hypervisor
of the info that would be passed to the hypervisor in the case of the
trap-and-emulate method. This is not an issue to a lot of PCI device
drivers, which destroy and re-create the interrupts across hibernation, so
hv_compose_msi_msg() is called automatically. However, some PCI device
drivers (e.g. the in-tree GPU driver nouveau and the out-of-tree Nvidia
proprietary GPU driver) do not destroy and re-create MSI/MSI-X interrupts
across hibernation, so hv_pci_resume() has to call hv_compose_msi_msg(),
otherwise the PCI device drivers can no longer receive interrupts after
the VM resumes from hibernation.

Hyper-V is also different in that chip->irq_unmask() may fail in a
Linux VM running on Hyper-V (on a physical machine, chip->irq_unmask()
can not fail because unmasking an MSI/MSI-X register just means an MMIO
write): during hibernation, when a CPU is offlined, the kernel tries
to move the interrupt to the remaining CPUs that haven't been offlined
yet. In this case, hv_irq_unmask() -> hv_do_hypercall() always fails
because the vmbus channel has been closed: here the early "return" in
hv_irq_unmask() means the pci_msi_unmask_irq() is not called, i.e. the
desc->masked remains "true", so later after hibernation, the MSI interrupt
always remains masked, which is incorrect. Refer to cpu_disable_common()
-> fixup_irqs() -> irq_migrate_all_off_this_cpu() -> migrate_one_irq():

static bool migrate_one_irq(struct irq_desc *desc)
{
...
if (maskchip && chip->irq_mask)
chip->irq_mask(d);
...
err = irq_do_set_affinity(d, affinity, false);
...
if (maskchip && chip->irq_unmask)
chip->irq_unmask(d);

Fix the issue by calling pci_msi_unmask_irq() unconditionally in
hv_irq_unmask(). Also suppress the error message for hibernation because
the hypercall failure during hibernation does not matter (at this time
all the devices have been frozen). Note: the correct affinity info is
still updated into the irqdata data structure in migrate_one_irq() ->
irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
resumes, hv_pci_restore_msi_state() is able to correctly restore
the interrupt with the correct affinity.

Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
Reviewed-by: Jake Oshins 
Signed-off-by: Dexuan Cui 
---

Changes in v2:
Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
Added Jake's Reviewed-by.

Changes in v3:
Improved the commit log.
Improved the comments.
Improved the change in hv_irq_unmask(): removed the early "return"
and call pci_msi_unmask_irq() unconditionally.

 drivers/pci/controller/pci-hyperv.c | 50 +++--
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index fc4c3a15e570..a9df492fbffa 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1276,11 +1276,25 @@ static void hv_irq_unmask(struct irq_data *data)
 exit_unlock:
spin_unlock_irqrestore(>retarget_msi_interrupt_lock, flags);
 
-   if (res) {
+   /*
+* During hibernation, when a CPU is offlined, the kernel tries
+* to move the interrupt to the remaining CPUs that haven't
+* been offlined yet. In this case, the below hv_do_hypercall()
+* always fails since the vmbus channel has been closed:
+* refer to cpu_disable_common() -> fixup_irqs() ->
+* irq_migrate_all_off_this_cpu() -> migrate_one_irq().
+*
+* Suppress the error message for hibernation because the failure
+* during hibernation does not matter (at this time all the devices
+* have been frozen). Note: the correct affinity info is still updated
+* into the irqdata data structure in migrate_one_irq() ->
+* irq_do_set_affinity() -> hv_set_affinity(), so later when the VM
+* resumes, hv_pci_restore_msi_state() is able to correctly restore
+* the interrupt with the correct affinity.
+*/
+   if (res && hbus->state != hv_pcibus_removing)
dev_err(>hdev->device,
"%s() failed: %#llx"

RE: [PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-10-01 Thread Dexuan Cui

> From: Lorenzo Pieralisi 
> Sent: Thursday, October 1, 2020 3:13 AM
> > ...
> > I mean this is a Hyper-V specific problem, so IMO we should fix the
> > pci-hyperv driver rather than change the PCI device drivers, which 
> > work perfectly on a physical machine and on other hypervisors. 
> > Also it can be difficult or impossible to ask the authors of the 
> > aforementioned PCI device drivers to destroy and re-create 
> > MSI/MSI-X across hibernation, especially for the out-of-tree driver(s).
> 
> Good, so why did you mention PCI drivers in the commit log if they
> are not related to the problem you are fixing ?

I mentioned the names of the PCI device drivers because IMO people
want to know how the issue can reproduce (i.e. which PCI devices
are affected and which are not), so they know how to test this patch.

I'll remove the names of the unaffected PCI device drivers from the 
commit log, and only keep the name of the Nvidia GPU drivers (which
are so far the only drivers I have identified that are affected, when
Linux VM runs on Hyper-V and hibernates).
 
> > > Regardless, this commit log does not provide the information that
> > > it should.
> >
> > Hi Lozenzo, I'm happy to add more info. Can you please let me know
> > what extra info I should provide?
> 
> s/Lozenzo/Lorenzo
Sorry! Will fix the typo.
 
> The info you describe properly below, namely what the _actual_ problem
> is.

I will send v3 with the below info.
 
> > Here if hv_irq_unmask does not call pci_msi_unmask_irq(), the
> > desc->masked remains "true", so later after hibernation, the MSI
> > interrupt line always reamins masked, which is incorrect.
> >
> > Here the slient failure of hv_irq_unmask() does not matter since all the
> > non-boot CPUs are being offlined (meaning all the devices have been
> > frozen). Note: the correct affinity info is still updated into the
> > irqdata data structure in migrate_one_irq() -> irq_do_set_affinity() ->
> > hv_set_affinity(), so when the VM resumes, hv_pci_resume() ->
> > hv_pci_restore_msi_state() is able to correctly restore the irqs with
> > the correct affinity.
> >
> > I hope the explanation can help clarify things. I understand this is
> > not as natual as tht case that Linux runs on a physical machine, but
> > due to the unique PCI pass-through implementation of Hyper-V, IMO this
> > is the only viable fix for the problem here. BTW, this patch is only
> > confined to the pci-hyperv driver and I believe it can no cause any
> > regression.
> 
> Understood, write this in the commit log and I won't nag you any further.

Ok. I treat it as an opportunity to practise and improve my writing :-)
 
> Side note: this issue is there because the hypcall failure triggers
> an early return from hv_irq_unmask(). 

Yes.

> Is that early return really correct ? 

Good question. IMO it's incorrect, because hv_irq_unmask() is called 
when the interrupt is activated for the first time, and when the interrupt's
affinity is being changed. In these cases, we may as well call
pci_msi_unmask_irq() unconditionally, even if the hypercall fails.

BTW, AFAIK, in practice the hypercall only fails in 2 cases:
1. The device is removed when Linux VM has not finished the device's
initialization.
2. In hibernation, the device has been disabled while the generic
hibernation code tries to migrate the interrupt, as I explained.

In the 2 cases, the hypercall returns the same error code
HV_STATUS_INVALID_PARAMETER(0x5).

> Another possibility is just logging the error and let
> hv_irq_unmask() continue and call pci_msi_unmask_irq() in the exit
> path.

This is a good idea. I'll make this change in v3.
 
> Is there a hypcall return value that you can use to detect fatal vs
> non-fatal (ie hibernation) hypcall failures ?

Unluckily IMO there is not. The spec (v6.0b)'s section 10.5.4 (page 106)
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs
does define some return values, but IMO they're not applicable here.

> I was confused by reading the patch since it seemed that you call
> pci_msi_unmask_irq() _only_ while hibernating, which was certainly
> a bug.
> 
> Thank you for explaining.
> 
> Lorenzo

Thanks for reviewing! I'll post v3. Looking forward to your new comments!

Thanks,
-- Dexuan

RE: [PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-09-29 Thread Dexuan Cui

> From: Lorenzo Pieralisi 
> Sent: Monday, September 28, 2020 3:43 AM
>
> [+MarcZ - this patch needs IRQ maintainers vetting]

Sure. Hi MarkZ, please also review the patch. Thanks!

> On Tue, Sep 08, 2020 at 04:17:59PM -0700, Dexuan Cui wrote:
> > Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X
> > registers, and we must use hv_compose_msi_msg() to ask Hyper-V to
> > create the IOMMU Interrupt Remapping Table Entries. This is not an issue
> > for a lot of PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers),
> > which destroy and re-create the interrupts across hibernation, so
> > hv_compose_msi_msg() is called automatically. However, some other
> > PCI device drivers (e.g. the Nvidia driver) may not destroy and re-create
> > the interrupts across hibernation, so hv_pci_resume() has to call
> > hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
> > receive MSI/MSI-X interrupts after hibernation.
>
> This looks like drivers bugs and I don't think the HV controller
> driver is where you should fix them.

IMHO this is not a PCI device driver bug, because I think a PCI device driver
is allowed to keep and re-use the MSI/MSI-X interrupts across hibernation,
otherwise we would not have pci_restore_msi_state() in pci_restore_state().

The in-tree open-source Nvidia GPU driver drivers/gpu/drm/nouveau is such
a PCI device driver that re-uses the MSI-X interrupts across hibernation.
The out-of-tree proprietary Nvidia GPU driver also does the same thing.
It looks some other in-tree PCI device drivers also do the same thing, though
I don't remember their names offhand.

IMO it's much better to change the pci-hyperv driver once and for all, than
to change every such existing (and future?) PCI device driver.

pci_restore_msi_state() directly writes the MSI/MSI-X related registers
in __pci_write_msi_msg() and msix_mask_irq(). On a physical machine, this
works perfectly, but for a Linux VM running on a hypervisor, which typically
enables IOMMU interrupt remapping, the hypervisor usually should trap and
emulate the write accesses to the MSI/MSI-X registers, so the hypervisor
is able to create the necessary interrupt remapping table entries in the
IOMMU, and the MSI/MSI-X interrupts can work in the VM. Hyper-V is different
from other hypervisors in that it does not trap and emulate the write
accesses, and instead it uses a para-virtualized method, which requires
the VM to call hv_compose_msi_msg() to notify the hypervisor of the info
that would be passed to the hypervisor in the case of the trap-and-emulate
method.

I mean this is a Hyper-V specific problem, so IMO we should fix the pci-hyperv
driver rather than change the PCI device drivers, which work perfectly on a
physical machine and on other hypervisors. Also it can be difficult or 
impossible to ask the authors of the aforementioned PCI device drivers to
destry and re-create MSI/MSI-X acorss hibernation, especially for the
out-of-tree driver(s).

> Regardless, this commit log does not provide the information that
> it should.

Hi Lozenzo, I'm happy to add more info. Can you please let me know
what extra info I should provide?

> > Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> > Signed-off-by: Dexuan Cui 
> > Reviewed-by: Jake Oshins 
> >
> > ---
> >
> > Changes in v2:
> > Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
> > Added Jake's Reviewed-by.
> >
> >  drivers/pci/controller/pci-hyperv.c | 44 +
> >  1 file changed, 44 insertions(+)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> b/drivers/pci/controller/pci-hyperv.c
> > index fc4c3a15e570..dd21afb5d62b 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
> > pbus = pdev->bus;
> > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> >
> > +   if (hbus->state == hv_pcibus_removing) {
> > +   /*
> > +* During hibernation, when a CPU is offlined, the kernel tries
> > +* to move the interrupt to the remaining CPUs that haven't
> > +* been offlined yet. In this case, the below hv_do_hypercall()
> > +* always fails since the vmbus channel has been closed, so we
> > +* should not call the hypercall, but we still need
> > +* pci_msi_unmask_irq() to reset the mask bit in desc->masked:
> > +* see cpu_disable_common() -> fixup_irqs() ->
> > +* irq_migrate_all_off_this_cpu() -> migrate_one_irq().
> > +*/
> > +   pci_msi

RE: [PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-09-21 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Tuesday, September 8, 2020 4:18 PM
> 
> Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers,
> and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU
> Interrupt Remapping Table Entries. This is not an issue for a lot of
> PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which
> destroy and re-create the interrupts across hibernation, so
> hv_compose_msi_msg() is called automatically. However, some other PCI
> device drivers (e.g. the Nvidia driver) may not destroy and re-create
> the interrupts across hibernation, so hv_pci_resume() has to call
> hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
> receive MSI/MSI-X interrupts after hibernation.
> 
> Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
> Signed-off-by: Dexuan Cui 
> Reviewed-by: Jake Oshins 
> 
> ---
> 
> Changes in v2:
> Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
> Added Jake's Reviewed-by.
> 
>  drivers/pci/controller/pci-hyperv.c | 44 +
>  1 file changed, 44 insertions(+)

Hi Lorenzo, Bjorn,
Can you please take a look at this patch? 
I hope it still could have a chance to be in 5.9. :-)

Thanks,
-- Dexuan

RE: [PATCH 1/1] Drivers: hv: vmbus: Add timeout to vmbus_wait_for_unload

2020-09-13 Thread Dexuan Cui

> From: linux-hyperv-ow...@vger.kernel.org
>  On Behalf Of Michael Kelley
> Sent: Sunday, September 13, 2020 12:47 PM
> 
> vmbus_wait_for_unload() looks for a CHANNELMSG_UNLOAD_RESPONSE
> message
> coming from Hyper-V.  But if the message isn't found for some reason,
> the panic path gets hung forever.  Add a timeout of 10 seconds to prevent
> this.
> 
> Fixes: 415719160de3 ("Drivers: hv: vmbus: avoid scheduling in interrupt
> context in vmbus_initiate_unload()")
> Signed-off-by: Michael Kelley 

Reviewed-by: Dexuan Cui

[PATCH net 1/2] hv_netvsc: Switch the data path at the right time during hibernation

2020-09-08 Thread Dexuan Cui

When netvsc_resume() is called, the mlx5 VF NIC has not been resumed yet,
so in the future the host might sliently fail the call netvsc_vf_changed()
-> netvsc_switch_datapath() there, even if the call works now.

Call netvsc_vf_changed() in the NETDEV_CHANGE event handler: at that time
the mlx5 VF NIC has been resumed.

Fixes: 19162fd4063a ("hv_netvsc: Fix hibernation for mlx5 VF driver")
Signed-off-by: Dexuan Cui 
---
 drivers/net/hyperv/netvsc_drv.c | 11 +--
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 81c5c70b616a..4a25886e2346 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2619,7 +2619,6 @@ static int netvsc_resume(struct hv_device *dev)
struct net_device *net = hv_get_drvdata(dev);
struct net_device_context *net_device_ctx;
struct netvsc_device_info *device_info;
-   struct net_device *vf_netdev;
int ret;
 
rtnl_lock();
@@ -2632,15 +2631,6 @@ static int netvsc_resume(struct hv_device *dev)
netvsc_devinfo_put(device_info);
net_device_ctx->saved_netvsc_dev_info = NULL;
 
-   /* A NIC driver (e.g. mlx5) may keep the VF network interface across
-* hibernation, but here the data path is implicitly switched to the
-* netvsc NIC since the vmbus channel is closed and re-opened, so
-* netvsc_vf_changed() must be used to switch the data path to the VF.
-*/
-   vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev);
-   if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK)
-   ret = -EINVAL;
-
rtnl_unlock();
 
return ret;
@@ -2701,6 +2691,7 @@ static int netvsc_netdev_event(struct notifier_block 
*this,
return netvsc_unregister_vf(event_dev);
case NETDEV_UP:
case NETDEV_DOWN:
+   case NETDEV_CHANGE:
return netvsc_vf_changed(event_dev);
default:
return NOTIFY_DONE;
-- 
2.19.1

[PATCH net 2/2] hv_netvsc: Cache the current data path to avoid duplicate call and message

2020-09-08 Thread Dexuan Cui

The previous change "hv_netvsc: Switch the data path at the right time
during hibernation" adds the call of netvsc_vf_changed() upon
NETDEV_CHANGE, so it's necessary to avoid the duplicate call and message
when the VF is brought UP or DOWN.

Signed-off-by: Dexuan Cui 
---
 drivers/net/hyperv/hyperv_net.h |  3 +++
 drivers/net/hyperv/netvsc_drv.c | 21 -
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 2181d4538ab7..ff33f27cdcd3 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -974,6 +974,9 @@ struct net_device_context {
/* Serial number of the VF to team with */
u32 vf_serial;
 
+   /* Is the current data path through the VF NIC? */
+   bool  data_path_is_vf;
+
/* Used to temporarily save the config info across hibernation */
struct netvsc_device_info *saved_netvsc_dev_info;
 };
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 4a25886e2346..b7db3766f5b9 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2366,7 +2366,16 @@ static int netvsc_register_vf(struct net_device 
*vf_netdev)
return NOTIFY_OK;
 }
 
-/* VF up/down change detected, schedule to change data path */
+/* Change the data path when VF UP/DOWN/CHANGE are detected.
+ *
+ * Typically a UP or DOWN event is followed by a CHANGE event, so
+ * net_device_ctx->data_path_is_vf is used to cache the current data path
+ * to avoid the duplicate call of netvsc_switch_datapath() and the duplicate
+ * message.
+ *
+ * During hibernation, if a VF NIC driver (e.g. mlx5) preserves the network
+ * interface, there is only the CHANGE event and no UP or DOWN event.
+ */
 static int netvsc_vf_changed(struct net_device *vf_netdev)
 {
struct net_device_context *net_device_ctx;
@@ -2383,6 +2392,10 @@ static int netvsc_vf_changed(struct net_device 
*vf_netdev)
if (!netvsc_dev)
return NOTIFY_DONE;
 
+   if (net_device_ctx->data_path_is_vf == vf_is_up)
+   return NOTIFY_OK;
+   net_device_ctx->data_path_is_vf = vf_is_up;
+
netvsc_switch_datapath(ndev, vf_is_up);
netdev_info(ndev, "Data path switched %s VF: %s\n",
vf_is_up ? "to" : "from", vf_netdev->name);
@@ -2624,6 +2637,12 @@ static int netvsc_resume(struct hv_device *dev)
rtnl_lock();
 
net_device_ctx = netdev_priv(net);
+
+   /* Reset the data path to the netvsc NIC before re-opening the vmbus
+* channel. Later netvsc_netdev_event() will switch the data path to
+* the VF upon the UP or CHANGE event.
+*/
+   net_device_ctx->data_path_is_vf = false;
device_info = net_device_ctx->saved_netvsc_dev_info;
 
ret = netvsc_attach(net, device_info);
-- 
2.19.1

[PATCH v2] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-09-08 Thread Dexuan Cui

Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers,
and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU
Interrupt Remapping Table Entries. This is not an issue for a lot of
PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which
destroy and re-create the interrupts across hibernation, so
hv_compose_msi_msg() is called automatically. However, some other PCI
device drivers (e.g. the Nvidia driver) may not destroy and re-create
the interrupts across hibernation, so hv_pci_resume() has to call
hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
receive MSI/MSI-X interrupts after hibernation.

Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
Signed-off-by: Dexuan Cui 
Reviewed-by: Jake Oshins 

---

Changes in v2:
Fixed a typo in the comment in hv_irq_unmask. Thanks to Michael!
Added Jake's Reviewed-by.

 drivers/pci/controller/pci-hyperv.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index fc4c3a15e570..dd21afb5d62b 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
pbus = pdev->bus;
hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
 
+   if (hbus->state == hv_pcibus_removing) {
+   /*
+* During hibernation, when a CPU is offlined, the kernel tries
+* to move the interrupt to the remaining CPUs that haven't
+* been offlined yet. In this case, the below hv_do_hypercall()
+* always fails since the vmbus channel has been closed, so we
+* should not call the hypercall, but we still need
+* pci_msi_unmask_irq() to reset the mask bit in desc->masked:
+* see cpu_disable_common() -> fixup_irqs() ->
+* irq_migrate_all_off_this_cpu() -> migrate_one_irq().
+*/
+   pci_msi_unmask_irq(data);
+   return;
+   }
+
spin_lock_irqsave(>retarget_msi_interrupt_lock, flags);
 
params = >retarget_msi_interrupt_params;
@@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev)
return 0;
 }
 
+static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
+{
+   struct msi_desc *entry;
+   struct irq_data *irq_data;
+
+   for_each_pci_msi_entry(entry, pdev) {
+   irq_data = irq_get_irq_data(entry->irq);
+   if (WARN_ON_ONCE(!irq_data))
+   return -EINVAL;
+
+   hv_compose_msi_msg(irq_data, >msg);
+   }
+
+   return 0;
+}
+
+/*
+ * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
+ * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and
+ * emulate the accesses, we have to call hv_compose_msi_msg() to ask
+ * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries.
+ */
+static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
+{
+   pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
+}
+
 static int hv_pci_resume(struct hv_device *hdev)
 {
struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
@@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev)
 
prepopulate_bars(hbus);
 
+   hv_pci_restore_msi_state(hbus);
+
hbus->state = hv_pcibus_installed;
return 0;
 out:
-- 
2.19.1

RE: [PATCH net v2] hv_netvsc: Fix hibernation for mlx5 VF driver

2020-09-08 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Tuesday, September 8, 2020 1:49 PM
> > @@ -2635,6 +2632,15 @@ static int netvsc_resume(struct hv_device *dev)
> > netvsc_devinfo_put(device_info);
> > net_device_ctx->saved_netvsc_dev_info = NULL;
> >
> > +   /* A NIC driver (e.g. mlx5) may keep the VF network interface across
> > +* hibernation, but here the data path is implicitly switched to the
> > +* netvsc NIC since the vmbus channel is closed and re-opened, so
> > +* netvsc_vf_changed() must be used to switch the data path to the VF.
> > +*/
> > +   vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev);
> > +   if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK)
> > +   ret = -EINVAL;
> > +
> 
> I'm a little late looking at this code.  But a question:  Is it possible for
> netvsc_resume() to be called before the VF driver's resume function
> is called?  
Yes, actually this is what happens 100% of the time. :-)

Upon suspend:
1. the mlx5 driver suspends the VF NIC.
2. the pci-hyperv suspends the VF vmbus device, including closing the channel.
3. hv_netvsc suspends the netvsc vmbus device, including closing the channel.

Note: between 1) and 3), the data path is still the mlx5 VF, but obviously the 
VF
NIC is not working. IMO this is not an issue in practice, since typically we 
don't
really expect this to work once the suspending starts.

Upon resume:
1. hv_netvsc resumes the netvsc vmbus device, including opening the channel.

At this time, the data path should be the netvsc NIC since we close and re-open 
the netvsc vmbus channel, and I believe the default data path is netvsc.

With this patch, the data path is switched to the VF NIC in netvsc_resume() 
because "netif_running(vf_netdev)" is true for the mlx5 VF NIC (CX-4), though 
the VF NIC device is not resumed back yet. According to my test, I believe this
switching succeeds. Note: when the host receives the VM's 
NVSP_MSG4_TYPE_SWITCH_DATA_PATH request, it looks the host doesn't check
if the VF vmbus device and the VF PCI device are really "activated"[1], and it
looks the host simply programs the FPGA GFT for the newly-requested data path,
and the host doesn't send a response message to the VM, indicating if the
switching is a success or a failure.

So, at this time, any incoming Ethernet packets (except the broadcast/multicast
and TCP SYN packets, which always go through the netvsc NIC on Azure) can not
be received by the VF NIC, which has not been resumed yet. IMO this is not an
issue in practice, since typically we don't really expect this to work before 
the
resuming is fully finished. BTW, at this time the userspace is not thawed yet, 
so
no application can transmit packets.

2. pci-hyperv resumes the VF vmbus device, including opening the channel.

3. the mlx5 driver resumes the VF NIC, and now everything is backed to normal.


[1] In the future, if the host starts to check if the VF vmbus/PCI devices are 
activated upon the receipt of the VM's NVSP_MSG4_TYPE_SWITCH_DATA_PATH
request, and refuses to switch the data path if they're not activated, then
we'll be in trouble, but even if that happens, this patch itself doesn't make
the situation worse.

So ideally we need a mechanism to switch the data path to the mlx5 VF NIC
*after* the mlx5 NIC is resumed. Unluckily it looks there is not a standard
notification mechanism for this since the mlx5 driver preserves the VF network
interface. I'll discuss with the Mellanox developer who implemented mlx5
hibernation support, and probably mlx5 should also destroy/re-create the
VF network interface across hibernation, just as the mlx4 driver does.


> If so, is it possible for netvsc_vf_changed() to find that the VF
> is not up, and hence to switch the data path away from the VF instead of
> to the VF?
> 
> Michael

When we are in netvsc_resume(), in my test "netif_running(vf_netdev)" is 
always true for the mlx5 VF NIC (CX-4), so netvsc_vf_changed() should switch
the data path to the VF.

static inline bool netif_running(const struct net_device *dev)
{
return test_bit(__LINK_STATE_START, >state);
}

The flag __LINK_STATE_START is only cleared when the NIC is brought "DOWN",
and that case is already automatically handled by netvsc_netdev_event().

For mlx4 (CX-3), net_device_ctx->vf_netdev is NULL in netvsc_resume(), so the
CX-3 VF NIC is not affected by this patch.

Thanks,
-- Dexuan

RE: [PATCH] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-09-08 Thread Dexuan Cui

> From: Michael Kelley 
> Sent: Tuesday, September 8, 2020 2:16 PM
> > @@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
> > pbus = pdev->bus;
> > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> >
> > +   if (hbus->state == hv_pcibus_removing) {
> > +   /*
> > +* During hibernatin, when a CPU is offlined, the kernel tries
> 
> s/hiberatin/hibernation/
Thanks! I'll post a v2 shortly with this typo fixed, and with Jake's 
Reviewed-by.

Thanks,
-- Dexuan

[PATCH net v2] hv_netvsc: Fix hibernation for mlx5 VF driver

2020-09-07 Thread Dexuan Cui

mlx5_suspend()/resume() keep the network interface, so during hibernation
netvsc_unregister_vf() and netvsc_register_vf() are not called, and hence
netvsc_resume() should call netvsc_vf_changed() to switch the data path
back to the VF after hibernation. Note: after we close and re-open the
vmbus channel of the netvsc NIC in netvsc_suspend() and netvsc_resume(),
the data path is implicitly switched to the netvsc NIC. Similarly,
netvsc_suspend() should not call netvsc_unregister_vf(), otherwise the VF
can no longer be used after hibernation.

For mlx4, since the VF network interafce is explicitly destroyed and
re-created during hibernation (see mlx4_suspend()/resume()), hv_netvsc
already explicitly switches the data path from and to the VF automatically
via netvsc_register_vf() and netvsc_unregister_vf(), so mlx4 doesn't need
this fix. Note: mlx4 can still work with the fix because in
netvsc_suspend()/resume() ndev_ctx->vf_netdev is NULL for mlx4.

Fixes: 0efeea5fb153 ("hv_netvsc: Add the support of hibernation")
Signed-off-by: Dexuan Cui 
---

Changes in v2 (Thanks Jakub Kicinski ):
Added coments in the changelog and the code about the implicit
data path switching to the netvsc when we close/re-open the vmbus
channels.
Used reverse xmas order ordering in netvsc_remove().

 drivers/net/hyperv/netvsc_drv.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 64b0a74c1523..81c5c70b616a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2587,8 +2587,8 @@ static int netvsc_remove(struct hv_device *dev)
 static int netvsc_suspend(struct hv_device *dev)
 {
struct net_device_context *ndev_ctx;
-   struct net_device *vf_netdev, *net;
struct netvsc_device *nvdev;
+   struct net_device *net;
int ret;
 
net = hv_get_drvdata(dev);
@@ -2604,10 +2604,6 @@ static int netvsc_suspend(struct hv_device *dev)
goto out;
}
 
-   vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
-   if (vf_netdev)
-   netvsc_unregister_vf(vf_netdev);
-
/* Save the current config info */
ndev_ctx->saved_netvsc_dev_info = netvsc_devinfo_get(nvdev);
 
@@ -2623,6 +2619,7 @@ static int netvsc_resume(struct hv_device *dev)
struct net_device *net = hv_get_drvdata(dev);
struct net_device_context *net_device_ctx;
struct netvsc_device_info *device_info;
+   struct net_device *vf_netdev;
int ret;
 
rtnl_lock();
@@ -2635,6 +2632,15 @@ static int netvsc_resume(struct hv_device *dev)
netvsc_devinfo_put(device_info);
net_device_ctx->saved_netvsc_dev_info = NULL;
 
+   /* A NIC driver (e.g. mlx5) may keep the VF network interface across
+* hibernation, but here the data path is implicitly switched to the
+* netvsc NIC since the vmbus channel is closed and re-opened, so
+* netvsc_vf_changed() must be used to switch the data path to the VF.
+*/
+   vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev);
+   if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK)
+   ret = -EINVAL;
+
rtnl_unlock();
 
return ret;
-- 
2.19.1

RE: [RESEND][PATCH v3] x86/apic/flat64: Add back the early_param("apic", parse_apic)

2020-09-07 Thread Dexuan Cui

> From: Thomas Gleixner 
> Sent: Friday, July 17, 2020 6:03 AM
> [...]
Hi, I'm very sorry for this extremely late reply -- I was sidetracked by 
something
else and I just had a chance to revisit the issue. Thank you tglx for the 
comments
and suggestions, which I think are reasonable. I realized this patch is 
problematic.
The good news is that the LAPIC emulation bug has been fixed in the latest 
version
of Hyper-V and now kdump can work reliably. I think the hypervisor fix would be 
backported to old Hyper-V versions, so hopefully this won't be an issue over 
time.

Thanks,
-- Dexuan

RE: [PATCH net] hv_netvsc: Fix hibernation for mlx5 VF driver

2020-09-05 Thread Dexuan Cui

> From: Jakub Kicinski 
> Sent: Saturday, September 5, 2020 4:27 PM
> [...]
> On Fri,  4 Sep 2020 19:52:18 -0700 Dexuan Cui wrote:
> > mlx5_suspend()/resume() keep the network interface, so during hibernation
> > netvsc_unregister_vf() and netvsc_register_vf() are not called, and hence
> > netvsc_resume() should call netvsc_vf_changed() to switch the data path
> > back to the VF after hibernation.
> 
> Does suspending the system automatically switch back to the synthetic
> datapath? 
Yes. 

For mlx4, since the VF network interafce is explicitly destroyed and re-created
during hibernation (i.e. suspend + resume), hv_netvsc explicitly switches the
data path from and to the VF.

For mlx5, the VF network interface persists across hibernation, so there is no
explicit switch-over, but after we close and re-open the vmbus channel of
the netvsc NIC in netvsc_suspend() and netvsc_resume(), the data path is
implicitly switched to the netvsc NIC, and with this patch netvsc_resume() ->
netvsc_vf_changed() switches the data path back to the mlx5 NIC.

> Please clarify this in the commit message and/or add a code
> comment.
I will add a comment in the commit message and the code.
 
> > @@ -2587,7 +2587,7 @@ static int netvsc_remove(struct hv_device *dev)
> >  static int netvsc_suspend(struct hv_device *dev)
> >  {
> > struct net_device_context *ndev_ctx;
> > -   struct net_device *vf_netdev, *net;
> > +   struct net_device *net;
> > struct netvsc_device *nvdev;
> > int ret;
> 
> Please keep reverse xmas tree variable ordering.

Will do.

> > @@ -2635,6 +2632,10 @@ static int netvsc_resume(struct hv_device *dev)
> > netvsc_devinfo_put(device_info);
> > net_device_ctx->saved_netvsc_dev_info = NULL;
> >
> > +   vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev);
> > +   if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK)
> > +   ret = -EINVAL;
> 
> Should you perhaps remove the VF in case of the failure?
IMO this failure actually should not happen since we're resuming the netvsc
NIC, so we're sure we have a valid pointer to the netvsc net device, and
netvsc_vf_changed() should be able to find the netvsc pointer and return
NOTIFY_OK. In case of a failure, something really bad must be happening,
and I'm not sure if it's safe to simply remove the VF, so I just return
-EINVAL for simplicity, since I believe the failure should not happen in 
practice.

I would rather keep the code as-is, but I'm OK to add a WARN_ON(1) if you
think that's necessary.

Thanks,
-- Dexuan

[PATCH] Drivers: hv: vmbus: hibernation: do not hang forever in vmbus_bus_resume()

2020-09-04 Thread Dexuan Cui

After we Stop and later Start a VM that uses Accelerated Networking (NIC
SR-IOV), currently the VF vmbus device's Instance GUID can change, so after
vmbus_bus_resume() -> vmbus_request_offers(), vmbus_onoffer() can not find
the original vmbus channel of the VF, and hence we can't complete()
vmbus_connection.ready_for_resume_event in check_ready_for_resume_event(),
and the VM hangs in vmbus_bus_resume() forever.

Fix the issue by adding a timeout, so the resuming can still succeed, and
the saved state is not lost, and according to my test, the user can disable
Accelerated Networking and then will be able to SSH into the VM for
further recovery. Also prevent the VM in question from suspending again.

The host will be fixed so in future the Instance GUID will stay the same
across hibernation.

Fixes: d8bd2d442bb2 ("Drivers: hv: vmbus: Resume after fixing up old primary 
channels")
Signed-off-by: Dexuan Cui 
---
 drivers/hv/vmbus_drv.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 910b6e90866c..946d0aba101f 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2382,7 +2382,10 @@ static int vmbus_bus_suspend(struct device *dev)
if (atomic_read(_connection.nr_chan_close_on_suspend) > 0)
wait_for_completion(_connection.ready_for_suspend_event);
 
-   WARN_ON(atomic_read(_connection.nr_chan_fixup_on_resume) != 0);
+   if (atomic_read(_connection.nr_chan_fixup_on_resume) != 0) {
+   pr_err("Can not suspend due to a previous failed resuming\n");
+   return -EBUSY;
+   }
 
mutex_lock(_connection.channel_mutex);
 
@@ -2456,7 +2459,9 @@ static int vmbus_bus_resume(struct device *dev)
 
vmbus_request_offers();
 
-   wait_for_completion(_connection.ready_for_resume_event);
+   if (wait_for_completion_timeout(
+   _connection.ready_for_resume_event, 10 * HZ) == 0)
+   pr_err("Some vmbus device is missing after suspending?\n");
 
/* Reset the event for the next suspend. */
reinit_completion(_connection.ready_for_suspend_event);
-- 
2.19.1

[PATCH] PCI: hv: Fix hibernation in case interrupts are not re-created

2020-09-04 Thread Dexuan Cui

Hyper-V doesn't trap and emulate the accesses to the MSI/MSI-X registers,
and we must use hv_compose_msi_msg() to ask Hyper-V to create the IOMMU
Interrupt Remapping Table Entries. This is not an issue for a lot of
PCI device drivers (e.g. NVMe driver, Mellanox NIC drivers), which
destroy and re-create the interrupts across hibernation, so
hv_compose_msi_msg() is called automatically. However, some other PCI
device drivers (e.g. the Nvidia driver) may not destroy and re-create
the interrupts across hibernation, so hv_pci_resume() has to call
hv_compose_msi_msg(), otherwise the PCI device drivers can no longer
receive MSI/MSI-X interrupts after hibernation.

Fixes: ac82fc832708 ("PCI: hv: Add hibernation support")
Cc: Jake Oshins 
Signed-off-by: Dexuan Cui 
---
 drivers/pci/controller/pci-hyperv.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index fc4c3a15e570..abefff9a20e1 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1211,6 +1211,21 @@ static void hv_irq_unmask(struct irq_data *data)
pbus = pdev->bus;
hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
 
+   if (hbus->state == hv_pcibus_removing) {
+   /*
+* During hibernatin, when a CPU is offlined, the kernel tries
+* to move the interrupt to the remaining CPUs that haven't
+* been offlined yet. In this case, the below hv_do_hypercall()
+* always fails since the vmbus channel has been closed, so we
+* should not call the hypercall, but we still need
+* pci_msi_unmask_irq() to reset the mask bit in desc->masked:
+* see cpu_disable_common() -> fixup_irqs() ->
+* irq_migrate_all_off_this_cpu() -> migrate_one_irq().
+*/
+   pci_msi_unmask_irq(data);
+   return;
+   }
+
spin_lock_irqsave(>retarget_msi_interrupt_lock, flags);
 
params = >retarget_msi_interrupt_params;
@@ -3372,6 +3387,33 @@ static int hv_pci_suspend(struct hv_device *hdev)
return 0;
 }
 
+static int hv_pci_restore_msi_msg(struct pci_dev *pdev, void *arg)
+{
+   struct msi_desc *entry;
+   struct irq_data *irq_data;
+
+   for_each_pci_msi_entry(entry, pdev) {
+   irq_data = irq_get_irq_data(entry->irq);
+   if (WARN_ON_ONCE(!irq_data))
+   return -EINVAL;
+
+   hv_compose_msi_msg(irq_data, >msg);
+   }
+
+   return 0;
+}
+
+/*
+ * Upon resume, pci_restore_msi_state() -> ... ->  __pci_write_msi_msg()
+ * re-writes the MSI/MSI-X registers, but since Hyper-V doesn't trap and
+ * emulate the accesses, we have to call hv_compose_msi_msg() to ask
+ * Hyper-V to re-create the IOMMU Interrupt Remapping Table Entries.
+ */
+static void hv_pci_restore_msi_state(struct hv_pcibus_device *hbus)
+{
+   pci_walk_bus(hbus->pci_bus, hv_pci_restore_msi_msg, NULL);
+}
+
 static int hv_pci_resume(struct hv_device *hdev)
 {
struct hv_pcibus_device *hbus = hv_get_drvdata(hdev);
@@ -3405,6 +3447,8 @@ static int hv_pci_resume(struct hv_device *hdev)
 
prepopulate_bars(hbus);
 
+   hv_pci_restore_msi_state(hbus);
+
hbus->state = hv_pcibus_installed;
return 0;
 out:
-- 
2.19.1

[PATCH net] hv_netvsc: Fix hibernation for mlx5 VF driver

2020-09-04 Thread Dexuan Cui

mlx5_suspend()/resume() keep the network interface, so during hibernation
netvsc_unregister_vf() and netvsc_register_vf() are not called, and hence
netvsc_resume() should call netvsc_vf_changed() to switch the data path
back to the VF after hibernation. Similarly, netvsc_suspend() should
not call netvsc_unregister_vf().

BTW, mlx4_suspend()/resume() are differnt in that they destroy and
re-create the network device, so netvsc_register_vf() and
netvsc_unregister_vf() are automatically called. Note: mlx4 can also work
with the changes here because in netvsc_suspend()/resume()
ndev_ctx->vf_netdev is NULL for mlx4.

Fixes: 0efeea5fb153 ("hv_netvsc: Add the support of hibernation")
Signed-off-by: Dexuan Cui 
---
 drivers/net/hyperv/netvsc_drv.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 64b0a74c1523..f896059a9588 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2587,7 +2587,7 @@ static int netvsc_remove(struct hv_device *dev)
 static int netvsc_suspend(struct hv_device *dev)
 {
struct net_device_context *ndev_ctx;
-   struct net_device *vf_netdev, *net;
+   struct net_device *net;
struct netvsc_device *nvdev;
int ret;
 
@@ -2604,10 +2604,6 @@ static int netvsc_suspend(struct hv_device *dev)
goto out;
}
 
-   vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
-   if (vf_netdev)
-   netvsc_unregister_vf(vf_netdev);
-
/* Save the current config info */
ndev_ctx->saved_netvsc_dev_info = netvsc_devinfo_get(nvdev);
 
@@ -2623,6 +2619,7 @@ static int netvsc_resume(struct hv_device *dev)
struct net_device *net = hv_get_drvdata(dev);
struct net_device_context *net_device_ctx;
struct netvsc_device_info *device_info;
+   struct net_device *vf_netdev;
int ret;
 
rtnl_lock();
@@ -2635,6 +2632,10 @@ static int netvsc_resume(struct hv_device *dev)
netvsc_devinfo_put(device_info);
net_device_ctx->saved_netvsc_dev_info = NULL;
 
+   vf_netdev = rtnl_dereference(net_device_ctx->vf_netdev);
+   if (vf_netdev && netvsc_vf_changed(vf_netdev) != NOTIFY_OK)
+   ret = -EINVAL;
+
rtnl_unlock();
 
return ret;
-- 
2.19.1

[RESEND][PATCH v3] x86/apic/flat64: Add back the early_param("apic", parse_apic)

2020-06-26 Thread Dexuan Cui

parse_apic() allows the user to try a different APIC driver than the
default one that's automatically chosen. It works for X86-32, but
doesn't work for X86-64 because it was removed in 2009 for X86-64 by
commit 7b38725318f4 ("x86: remove subarchitecture support code"),
whose changelog doesn't explicitly describe the removal for X86-64.

The patch adds back the functionality for X86-64. The intent is mainly
to work around an APIC emulation bug in Hyper-V in the case of kdump:
currently Hyper-V does not honor the disabled state of the local APICs,
so all the IOAPIC-based interrupts may not be delivered to the correct
virtual CPU, if the logical-mode APIC driver is used (the kdump
kernel usually uses the logical-mode APIC driver, since typically
only 1 CPU is active). Luckily the kdump issue can be worked around by
forcing the kdump kernel to use physical mode, before the fix to Hyper-V
becomes widely available.

The current algorithm of choosing an APIC driver is:

1. The global pointer "struct apic *apic" has a default value, i.e
"apic_default" on X86-32, and "apic_flat" on X86-64.

2. If the early_param "apic=" is specified, parse_apic() is called and
the pointer "apic" is changed if a matching APIC driver is found.

3. default_acpi_madt_oem_check() calls the acpi_madt_oem_check() method
of all APIC drivers, which may override the "apic" pointer.

4. default_setup_apic_routing() may override the "apic" pointer, e.g.
by calling the probe() method of all APIC drivers. Note: refer to the
order of the APIC drivers specified in arch/x86/kernel/apic/Makefile.

The patch is safe because if the apic= early param is not specified,
the current algorithm of choosing an APIC driver is unchanged; when the
param is specified (e.g. on X86-64, "apic=physical flat"), the kernel
still tries to find a "more suitable" APIC driver in the above step 3 and
4: e.g. if the BIOS/firmware requires that apic_x2apic_phys should be used,
the above step 4 will override the APIC driver to apic_x2apic_phys, even
if an early_param "apic=physical flat" is specified.

On Hyper-V, when a Linux VM has <= 8 virtual CPUs, if we use
"apic=physical flat", sending IPIs to multiple vCPUs is still fast because
Linux VM uses the para-virtualized IPI hypercalls: see hv_apic_init().

The patch adds the __init tag for flat_acpi_madt_oem_check() and
physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1":
flat_acpi_madt_oem_check() accesses cmdline_apic, which has a __initdata
tag.

Fixes: 7b38725318f4 ("x86: remove subarchitecture support code")
Signed-off-by: Dexuan Cui 
---

Changes in v2 (5/28/2020):
  Updated Documentation/admin-guide/kernel-parameters.txt. [Randy Dunlap]
  Changed apic_set_verbosity().
  Enhanced the changelog.

Changes in v3 (5/31/2020):
  Added the __init tag for flat_acpi_madt_oem_check() and
physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1".
  (Thanks to kbuild test robot ).

  Updated the changelog for the __init tag.

Today is 6/26/2020 and this is just a RESEND of v3, which was posted
on 5/31 (https://lkml.org/lkml/2020/5/31/198).

 .../admin-guide/kernel-parameters.txt | 11 +--
 arch/x86/kernel/apic/apic.c   | 11 +++
 arch/x86/kernel/apic/apic_flat_64.c   | 31 +--
 3 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7bc83f3d9bdf..c4503fff9348 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -341,10 +341,15 @@
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
-   For X86-32, this can also be used to specify an APIC
-   driver name.
+   This can also be used to specify an APIC driver name.
Format: apic=driver_name
-   Examples: apic=bigsmp
+   Examples:
+ On X86-32:  apic=bigsmp
+ On X86-64: "apic=physical flat"
+ Note: the available driver names depend on the
+ architecture and the kernel config; the setting may
+ be overridden by the acpi_madt_oem_check() and probe()
+ methods of other APIC drivers.
 
apic_extnmi=[APIC,X86] External NMI delivery setting
Format: { bsp (default) | all | none }
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e53dda210cd7..6f7d75b6358b 100644
--- a/arch/x86/kerne

RE: hv_hypercall_pg page permissios

2020-06-15 Thread Dexuan Cui

> From: linux-hyperv-ow...@vger.kernel.org
>  On Behalf Of Dexuan Cui
> Sent: Monday, June 15, 2020 10:42 AM
> > >
> > > Hi hch,
> > > The patch is merged into the mainine recently, but unluckily we noticed
> > > a warning with CONFIG_DEBUG_WX=y
> > >
> > > Should we revert this patch, or figure out a way to ask the DEBUG_WX
> > > code to ignore this page?
> >
> > Are you sure it is hv_hypercall_pg?
> Yes, 100% sure. I printed the value of hv_hypercall_pg and and it matched the
> address in the warning line " x86/mm: Found insecure W+X mapping at
> address".

I did this experiment:
  1. export vmalloc_exec and ptdump_walk_pgd_level_checkwx.
  2. write a test module that calls them.
  3. It turns out that every call of vmalloc_exec() triggers such a warning.

vmalloc_exec() uses PAGE_KERNEL_EXEC, which is defined as
   (__PP|__RW|   0|___A|   0|___D|   0|___G)

It looks the logic in note_page() is: for_each_RW_page, if the NX bit is unset,
then report the page as an insecure W+X mapping. IMO this explains the
warning?

Thanks,
-- Dexuan

RE: hv_hypercall_pg page permissios

2020-06-15 Thread Dexuan Cui

> From: Vitaly Kuznetsov 
> Sent: Monday, June 15, 2020 1:35 AM
> Dexuan Cui  writes:
> 
> >> From: linux-hyperv-ow...@vger.kernel.org
> >>  On Behalf Of Andy Lutomirski
> >> Sent: Tuesday, April 7, 2020 2:01 PM
> >> > On Apr 7, 2020, at 12:38 AM, Christoph Hellwig  wrote:
> >> >
> >> > On Tue, Apr 07, 2020 at 09:28:01AM +0200, Vitaly Kuznetsov wrote:
> >> >> Christoph Hellwig  writes:
> >> >>
> >> >>> Hi all,
> >> >>>
> >> >>> The x86 Hyper-V hypercall page (hv_hypercall_pg) is the only allocation
> >> >>> in the kernel using __vmalloc with exectutable persmissions, and the
> >> >>> only user of PAGE_KERNEL_RX.  Is there any good reason it needs to
> >> >>> be readable?  Otherwise we could use vmalloc_exec and kill off
> >> >>> PAGE_KERNEL_RX.  Note that before 372b1e91343e6 ("drivers: hv:
> Turn
> >> off
> >> >>> write permission on the hypercall page") it was even mapped writable..
> >> >>
> >> >> [There is nothing secret in the hypercall page, by reading it you can
> >> >> figure out if you're running on Intel or AMD (VMCALL/VMMCALL) but it's
> >> >> likely not the only possible way :-)]
> >> >>
> >> >> I see no reason for hv_hypercall_pg to remain readable. I just
> >> >> smoke-tested
> >> >
> >> > Thanks, I have the same in my WIP tree, but just wanted to confirm this
> >> > makes sense.
> >>
> >> Just to make sure we’re all on the same page: x86 doesn’t normally have
> an
> >> execute-only mode. Executable memory in the kernel is readable unless you
> >> are using fancy hypervisor-based XO support.
> >
> > Hi hch,
> > The patch is merged into the mainine recently, but unluckily we noticed
> > a warning with CONFIG_DEBUG_WX=y (it looks typically this config is defined
> > by default in Linux distros, at least in Ubuntu 18.04's
> > /boot/config-4.18.0-11-generic).
> >
> > Should we revert this patch, or figure out a way to ask the DEBUG_WX code
> to
> > ignore this page?
> >
> 
> Are you sure it is hv_hypercall_pg? 
Yes, 100% sure. I printed the value of hv_hypercall_pg and and it matched the
address in the warning line " x86/mm: Found insecure W+X mapping at address".

> AFAIU it shouldn't be W+X as we
> are allocating it with vmalloc_exec(). In other words, if you revert
> 78bb17f76edc, does the issue go away?
> 
> Vitaly

Yes, the warning goes away if I revert
78bb17f76edc ("x86/hyperv: use vmalloc_exec for the hypercall page")
88dca4ca5a93 ("mm: remove the pgprot argument to __vmalloc") 
(I have to revert the second as well with some manual adjustments, since
__vmalloc() has 2 parameters now.)

Thanks,
Dexuan

RE: [PATCH v3] x86/apic/flat64: Add back the early_param("apic", parse_apic)

2020-06-14 Thread Dexuan Cui

> From: Dexuan Cui 
> Sent: Sunday, May 31, 2020 9:49 AM
> To: t...@linutronix.de; mi...@redhat.com; rdun...@infradead.org;
> b...@alien8.de; h...@zytor.com; x...@kernel.org; pet...@infradead.org;
> alli...@lohutok.net; alexios.zav...@intel.com; gre...@linuxfoundation.org;
> Dexuan Cui ; na...@vmware.com; Michael Kelley
> ; Long Li 
> Cc: linux-kernel@vger.kernel.org; linux-hyp...@vger.kernel.org
> Subject: [PATCH v3] x86/apic/flat64: Add back the early_param("apic",
> parse_apic)
> 
> parse_apic() allows the user to try a different APIC driver than the
> default one that's automatically chosen. It works for X86-32, but
> doesn't work for X86-64 because it was removed in 2009 for X86-64 by
> commit 7b38725318f4 ("x86: remove subarchitecture support code"),
> whose changelog doesn't explicitly describe the removal for X86-64.
> 
> The patch adds back the functionality for X86-64. The intent is mainly
> to work around an APIC emulation bug in Hyper-V in the case of kdump:
> currently Hyper-V does not honor the disabled state of the local APICs,
> so all the IOAPIC-based interrupts may not be delivered to the correct
> virtual CPU, if the logical-mode APIC driver is used (the kdump
> kernel usually uses the logical-mode APIC driver, since typically
> only 1 CPU is active). Luckily the kdump issue can be worked around by
> forcing the kdump kernel to use physical mode, before the fix to Hyper-V
> becomes widely available.
> 
> The current algorithm of choosing an APIC driver is:
> 
> 1. The global pointer "struct apic *apic" has a default value, i.e
> "apic_default" on X86-32, and "apic_flat" on X86-64.
> 
> 2. If the early_param "apic=" is specified, parse_apic() is called and
> the pointer "apic" is changed if a matching APIC driver is found.
> 
> 3. default_acpi_madt_oem_check() calls the acpi_madt_oem_check() method
> of all APIC drivers, which may override the "apic" pointer.
> 
> 4. default_setup_apic_routing() may override the "apic" pointer, e.g.
> by calling the probe() method of all APIC drivers. Note: refer to the
> order of the APIC drivers specified in arch/x86/kernel/apic/Makefile.
> 
> The patch is safe because if the apic= early param is not specified,
> the current algorithm of choosing an APIC driver is unchanged; when the
> param is specified (e.g. on X86-64, "apic=physical flat"), the kernel
> still tries to find a "more suitable" APIC driver in the above step 3 and
> 4: e.g. if the BIOS/firmware requires that apic_x2apic_phys should be used,
> the above step 4 will override the APIC driver to apic_x2apic_phys, even
> if an early_param "apic=physical flat" is specified.
> 
> On Hyper-V, when a Linux VM has <= 8 virtual CPUs, if we use
> "apic=physical flat", sending IPIs to multiple vCPUs is still fast because
> Linux VM uses the para-virtualized IPI hypercalls: see hv_apic_init().
> 
> The patch adds the __init tag for flat_acpi_madt_oem_check() and
> physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1":
> flat_acpi_madt_oem_check() accesses cmdline_apic, which has a __initdata
> tag.
> 
> Fixes: 7b38725318f4 ("x86: remove subarchitecture support code")
> Signed-off-by: Dexuan Cui 
> ---
> 
> Changes in v2:
>   Updated Documentation/admin-guide/kernel-parameters.txt. [Randy
> Dunlap]
>   Changed apic_set_verbosity().
>   Enhanced the changelog.
> 
> Changes in v3:
>   Added the __init tag for flat_acpi_madt_oem_check() and
> physflat_acpi_madt_oem_check() to avoid a warning seen with "make W=1".
>   (Thanks to kbuild test robot ).
> 
>   Updated the changelog for the __init tag.
> 
>  .../admin-guide/kernel-parameters.txt | 11 +--
>  arch/x86/kernel/apic/apic.c   | 11 +++
>  arch/x86/kernel/apic/apic_flat_64.c   | 31 +--
>  3 files changed, 40 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt
> b/Documentation/admin-guide/kernel-parameters.txt
> index 7bc83f3d9bdf..c4503fff9348 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -341,10 +341,15 @@
>   Format: { quiet (default) | verbose | debug }
>   Change the amount of debugging information output
>   when initialising the APIC and IO-APIC components.
> - For X86-32, this can also be used to specify an APIC
> - driver name.
> + This can also be used to specify an APIC driver name.
>

RE: hv_hypercall_pg page permissios

2020-06-12 Thread Dexuan Cui

> From: linux-hyperv-ow...@vger.kernel.org
>  On Behalf Of Andy Lutomirski
> Sent: Tuesday, April 7, 2020 2:01 PM
> To: Christoph Hellwig 
> Cc: vkuznets ; x...@kernel.org;
> linux-hyp...@vger.kernel.org; linux-kernel@vger.kernel.org; KY Srinivasan
> ; Stephen Hemminger ;
> Andy Lutomirski ; Peter Zijlstra 
> Subject: Re: hv_hypercall_pg page permissios
> 
> 
> > On Apr 7, 2020, at 12:38 AM, Christoph Hellwig  wrote:
> >
> > On Tue, Apr 07, 2020 at 09:28:01AM +0200, Vitaly Kuznetsov wrote:
> >> Christoph Hellwig  writes:
> >>
> >>> Hi all,
> >>>
> >>> The x86 Hyper-V hypercall page (hv_hypercall_pg) is the only allocation
> >>> in the kernel using __vmalloc with exectutable persmissions, and the
> >>> only user of PAGE_KERNEL_RX.  Is there any good reason it needs to
> >>> be readable?  Otherwise we could use vmalloc_exec and kill off
> >>> PAGE_KERNEL_RX.  Note that before 372b1e91343e6 ("drivers: hv: Turn
> off
> >>> write permission on the hypercall page") it was even mapped writable..
> >>
> >> [There is nothing secret in the hypercall page, by reading it you can
> >> figure out if you're running on Intel or AMD (VMCALL/VMMCALL) but it's
> >> likely not the only possible way :-)]
> >>
> >> I see no reason for hv_hypercall_pg to remain readable. I just
> >> smoke-tested
> >
> > Thanks, I have the same in my WIP tree, but just wanted to confirm this
> > makes sense.
> 
> Just to make sure we’re all on the same page: x86 doesn’t normally have an
> execute-only mode. Executable memory in the kernel is readable unless you
> are using fancy hypervisor-based XO support.

Hi hch,
The patch is merged into the mainine recently, but unluckily we noticed
a warning with CONFIG_DEBUG_WX=y (it looks typically this config is defined
by default in Linux distros, at least in Ubuntu 18.04's  
/boot/config-4.18.0-11-generic).

Should we revert this patch, or figure out a way to ask the DEBUG_WX code to
ignore this page?

[   19.387536] debug: unmapping init [mem 0x82713000-0x82886fff]
[   19.431766] Write protecting the kernel read-only data: 18432k
[   19.438662] debug: unmapping init [mem 0x81c02000-0x81df]
[   19.446830] debug: unmapping init [mem 0x821d6000-0x821f]
[   19.522368] [ cut here ]
[   19.527495] x86/mm: Found insecure W+X mapping at address 0xc9012000
[   19.535066] WARNING: CPU: 26 PID: 1 at arch/x86/mm/dump_pagetables.c:248 
note_page+0x639/0x690
[   19.539038] Modules linked in:
[   19.539038] CPU: 26 PID: 1 Comm: swapper/0 Not tainted 5.7.0+ #1
[   19.539038] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090008  12/07/2018
[   19.539038] RIP: 0010:note_page+0x639/0x690
[   19.539038] Code: fe ff ff 31 c0 e9 a0 fe ff ff 80 3d 39 d1 31 01 00 0f 85 
76 fa ff ff 48 c7 c7 98 55 0a 82 c6 05 25 d1 31 01 01 e8 f7 c9 00 00 <0f> 0b e9 
5c fa ff ff 48 83 c0 18 48 c7 45 68 00 00 00 00 48 89 45
[   19.539038] RSP: :c90003137cb0 EFLAGS: 00010282
[   19.539038] RAX:  RBX:  RCX: 0007
[   19.539038] RDX:  RSI:  RDI: 810fa9c4
[   19.539038] RBP: c90003137ea0 R08:  R09: 
[   19.539038] R10: 0001 R11:  R12: c9013000
[   19.539038] R13:  R14: c91ff000 R15: 
[   19.539038] FS:  () GS:8884dad0() 
knlGS:
[   19.539038] CS:  0010 DS:  ES:  CR0: 80050033
[   19.539038] CR2:  CR3: 02210001 CR4: 003606e0
[   19.539038] DR0:  DR1:  DR2: 
[   19.539038] DR3:  DR6: fffe0ff0 DR7: 0400
[   19.539038] Call Trace:
[   19.539038]  ptdump_pte_entry+0x39/0x40
[   19.539038]  __walk_page_range+0x5b7/0x960
[   19.539038]  walk_page_range_novma+0x7e/0xd0
[   19.539038]  ptdump_walk_pgd+0x53/0x90
[   19.539038]  ptdump_walk_pgd_level_core+0xdf/0x110
[   19.539038]  ? ptdump_walk_pgd_level_debugfs+0x40/0x40
[   19.539038]  ? hugetlb_get_unmapped_area+0x2f0/0x2f0
[   19.703692]  ? rest_init+0x24d/0x24d
[   19.703692]  ? rest_init+0x24d/0x24d
[   19.703692]  kernel_init+0x2c/0x113
[   19.703692]  ret_from_fork+0x24/0x30
[   19.703692] irq event stamp: 2840666
[   19.703692] hardirqs last  enabled at (2840665): [] 
console_unlock+0x444/0x5b0
[   19.703692] hardirqs last disabled at (2840666): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
[   19.703692] softirqs last  enabled at (2840662): [] 
__do_softirq+0x366/0x490
[   19.703692] softirqs last disabled at (2840655): [] 
irq_exit+0xe8/0x100
[   19.703692] ---[ end trace 99ca90806a8e657c ]---
[   19.786235] x86/mm: Checked W+X mappings: FAILED, 1 W+X pages found.
[   19.793298] rodata_test: all tests were successful
[   19.798508] x86/mm: Checking user space page tables
[   19.818007] x86/mm: Checked W+X

RE: [PATCH] scsi: storvsc: Remove memset before memory freeing in storvsc_suspend()

2020-06-05 Thread Dexuan Cui

> From: Denis Efremov 
> Sent: Friday, June 5, 2020 1:00 AM
> To: Dexuan Cui ; Michael Kelley
> 
> Cc: Denis Efremov ; James E . J . Bottomley
> ; Martin K . Petersen ;
> linux-hyp...@vger.kernel.org; Linux SCSI List ;
> Linux Kernel Mailing List 
> Subject: [PATCH] scsi: storvsc: Remove memset before memory freeing in
> storvsc_suspend()
> 
> Remove memset with 0 for stor_device->stor_chns in storvsc_suspend()
> before the call to kfree() as the memory contains no sensitive information.
> 
> Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> Suggested-by: Dexuan Cui 
> Signed-off-by: Denis Efremov 
> ---
>  drivers/scsi/storvsc_drv.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 072ed8728657..2d90cddd8ac2 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -2035,9 +2035,6 @@ static int storvsc_suspend(struct hv_device
> *hv_dev)
> 
>   vmbus_close(hv_dev->channel);
> 
> - memset(stor_device->stor_chns, 0,
> -num_possible_cpus() * sizeof(void *));
> -
>   kfree(stor_device->stor_chns);
>   stor_device->stor_chns = NULL;
> 
> --

Denis, thank you for fixing this!

Reviewed-by: Dexuan Cui

RE: [PATCH] scsi: storvsc: Use kzfree() in storvsc_suspend()

2020-06-04 Thread Dexuan Cui

> From: Dexuan Cui
> Sent: Thursday, June 4, 2020 2:50 PM
> 
> > > Can you please make a v2 patch for it and Cc my corporate email "decui" 
> > > (in
> > To)?
> >
> > Yes, of course. Could I add "Suggested-by"?
> >
> > Thanks,
> > Denis
> 
> Sure.

Please also added a tag:
 
Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")

Thanks,
Dexuan

RE: [PATCH] scsi: storvsc: Use kzfree() in storvsc_suspend()

2020-06-04 Thread Dexuan Cui

> From: Denis Efremov 
> Sent: Thursday, June 4, 2020 2:43 PM
> >
> > Hi Denis,
> > When I added the function storvsc_suspend() several months ago, somehow
> > I forgot to remove the unnecessary memset(). Sorry!
> >
> > The buffer is recreated in storvsc_resume() -> storvsc_connect_to_vsp() ->
> > storvsc_channel_init() -> stor_device->stor_chns = kcalloc(...), so I 
> > believe
> > the memset() can be safely removed.
> 
> I'm not sure that I understand your description. As for me, memset with 0
> before
> memory freeing is required only for sensitive information and it's completely
> unrelated to memory zeroing during allocation with kzalloc/kcalloc.
> If it's not a sensitive information then memset could be safely removed.

There is no sensitive info in the buffer here.

> > Can you please make a v2 patch for it and Cc my corporate email "decui" (in
> To)?
> 
> Yes, of course. Could I add "Suggested-by"?
> 
> Thanks,
> Denis

Sure. 

Thanks,
-- Dexuan

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1347 matches

Mail list logo