Re: [PATCH net] bnxt_en: Fix VF mac address regression.

2018-09-14 Thread Siwei Liu
Ack. Looks fine to me.

-Siwei

On Fri, Sep 14, 2018 at 12:41 PM, Michael Chan
 wrote:
> The recent commit to always forward the VF MAC address to the PF for
> approval may not work if the PF driver or the firmware is older.  This
> will cause the VF driver to fail during probe:
>
>   bnxt_en :00:03.0 (unnamed net_device) (uninitialized): hwrm req_type 
> 0xf seq id 0x5 error 0x
>   bnxt_en :00:03.0 (unnamed net_device) (uninitialized): VF MAC address 
> 00:00:17:02:05:d0 not approved by the PF
>   bnxt_en :00:03.0: Unable to initialize mac address.
>   bnxt_en: probe of :00:03.0 failed with error -99
>
> We fix it by treating the error as fatal only if the VF MAC address is
> locally generated by the VF.
>
> Fixes: 707e7e966026 ("bnxt_en: Always forward VF MAC address to the PF.")
> Reported-by: Seth Forshee 
> Reported-by: Siwei Liu 
> Signed-off-by: Michael Chan 
> ---
> Please queue this for stable as well.  Thanks.
>
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c   | 9 +++--
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 9 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h | 2 +-
>  3 files changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index cecbb1d..177587f 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -8027,7 +8027,7 @@ static int bnxt_change_mac_addr(struct net_device *dev, 
> void *p)
> if (ether_addr_equal(addr->sa_data, dev->dev_addr))
> return 0;
>
> -   rc = bnxt_approve_mac(bp, addr->sa_data);
> +   rc = bnxt_approve_mac(bp, addr->sa_data, true);
> if (rc)
> return rc;
>
> @@ -8827,14 +8827,19 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
> } else {
>  #ifdef CONFIG_BNXT_SRIOV
> struct bnxt_vf_info *vf = >vf;
> +   bool strict_approval = true;
>
> if (is_valid_ether_addr(vf->mac_addr)) {
> /* overwrite netdev dev_addr with admin VF MAC */
> memcpy(bp->dev->dev_addr, vf->mac_addr, ETH_ALEN);
> +   /* Older PF driver or firmware may not approve this
> +* correctly.
> +*/
> +   strict_approval = false;
> } else {
> eth_hw_addr_random(bp->dev);
> }
> -   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
> +   rc = bnxt_approve_mac(bp, bp->dev->dev_addr, strict_approval);
>  #endif
> }
> return rc;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> index fcd085a..3962f6f 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> @@ -1104,7 +1104,7 @@ void bnxt_update_vf_mac(struct bnxt *bp)
> mutex_unlock(>hwrm_cmd_lock);
>  }
>
> -int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
> +int bnxt_approve_mac(struct bnxt *bp, u8 *mac, bool strict)
>  {
> struct hwrm_func_vf_cfg_input req = {0};
> int rc = 0;
> @@ -1122,12 +1122,13 @@ int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
> memcpy(req.dflt_mac_addr, mac, ETH_ALEN);
> rc = hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
>  mac_done:
> -   if (rc) {
> +   if (rc && strict) {
> rc = -EADDRNOTAVAIL;
> netdev_warn(bp->dev, "VF MAC address %pM not approved by the 
> PF\n",
> mac);
> +   return rc;
> }
> -   return rc;
> +   return 0;
>  }
>  #else
>
> @@ -1144,7 +1145,7 @@ void bnxt_update_vf_mac(struct bnxt *bp)
>  {
>  }
>
> -int bnxt_approve_mac(struct bnxt *bp, u8 *mac)
> +int bnxt_approve_mac(struct bnxt *bp, u8 *mac, bool strict)
>  {
> return 0;
>  }
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
> index e9b20cd..2eed9ed 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.h
> @@ -39,5 +39,5 @@ int bnxt_sriov_configure(struct pci_dev *pdev, int num_vfs);
>  void bnxt_sriov_disable(struct bnxt *);
>  void bnxt_hwrm_exec_fwd_req(struct bnxt *);
>  void bnxt_update_vf_mac(struct bnxt *);
> -int bnxt_approve_mac(struct bnxt *, u8 *);
> +int bnxt_approve_mac(struct bnxt *, u8 *, bool);
>  #endif
> --
> 2.5.1
>


Re: [PATCH net-next 4/4] bnxt_en: Always forward VF MAC address to the PF.

2018-09-14 Thread Siwei Liu
This commit is toxic, if possible I hope it can be reverted and
reworked with a new patch.

First, the patch introduced backward incompatible changes to bnxt_en
VF driver that is causing issue when interoperating with the old PF
driver without this commit. In that event, VF probing fails from
within the VM:

[5.660331] Broadcom NetXtreme-C/E driver bnxt_en v1.9.1
[5.663653] bnxt_en :00:03.0 (unnamed net_device)
(uninitialized): hwrm req_type 0xf seq id 0x6 error 0x4
[5.665804] bnxt_en :00:03.0 (unnamed net_device)
(uninitialized): VF MAC address 00:01:02:03:04:05 not approved by the
PF
[5.668268] bnxt_en :00:03.0: Unable to initialize mac address.
[5.670974] bnxt_en: probe of :00:03.0 failed with error -99

Second, this commit contains driver changes to both PF and VF side,
and incorrectly assumes that both PF and VF can/should be updated at
the same time to resolve the original issue (zero VF MAC address in
'ip link show') it tried to address. In fact that is not warranted. A
potential warranted fix is for VF driver to ignore what
bnxt_approve_mac() may return when it got a valid MAC address from the
firmware. The only purpose for the bnxt_approve_mac call for this case
is a best-effort attempt to inform PF of the MAC address, instead of
failing the VF driver probe when talking to an old PF driver.

Canonical reported a similar issue a few days back due to the same cause.

https://www.spinics.net/lists/netdev/msg521428.html


Regards,
-Siwei

On Tue, May 8, 2018 at 12:18 AM, Michael Chan  wrote:
> The current code already forwards the VF MAC address to the PF, except
> in one case.  If the VF driver gets a valid MAC address from the firmware
> during probe time, it will not forward the MAC address to the PF,
> incorrectly assuming that the PF already knows the MAC address.  This
> causes "ip link show" to show zero VF MAC addresses for this case.
>
> This assumption is not correct.  Newer firmware remembers the VF MAC
> address last used by the VF and provides it to the VF driver during
> probe.  So we need to always forward the VF MAC address to the PF.
>
> The forwarded MAC address may now be the PF assigned MAC address and so we
> need to make sure we approve it for this case.
>
> Signed-off-by: Michael Chan 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c   | 2 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index cd3ab78..dfa0839 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -8678,8 +8678,8 @@ static int bnxt_init_mac_addr(struct bnxt *bp)
> memcpy(bp->dev->dev_addr, vf->mac_addr, ETH_ALEN);
> } else {
> eth_hw_addr_random(bp->dev);
> -   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
> }
> +   rc = bnxt_approve_mac(bp, bp->dev->dev_addr);
>  #endif
> }
> return rc;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> index cc21d87..a649108 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
> @@ -923,7 +923,8 @@ static int bnxt_vf_configure_mac(struct bnxt *bp, struct 
> bnxt_vf_info *vf)
> if (req->enables & 
> cpu_to_le32(FUNC_VF_CFG_REQ_ENABLES_DFLT_MAC_ADDR)) {
> if (is_valid_ether_addr(req->dflt_mac_addr) &&
> ((vf->flags & BNXT_VF_TRUST) ||
> -(!is_valid_ether_addr(vf->mac_addr {
> +!is_valid_ether_addr(vf->mac_addr) ||
> +ether_addr_equal(req->dflt_mac_addr, vf->mac_addr))) {
> ether_addr_copy(vf->vf_mac_addr, req->dflt_mac_addr);
> return bnxt_hwrm_exec_fwd_resp(bp, vf, msg_size);
> }
> --
> 1.8.3.1
>


Re: virtio_net failover and initramfs

2018-08-22 Thread Siwei Liu
On Wed, Aug 22, 2018 at 12:23 AM, Harald Hoyer  wrote:
> On 22.08.2018 09:17, Siwei Liu wrote:
>> On Tue, Aug 21, 2018 at 6:44 AM, Harald Hoyer  wrote:
>>> On 17.08.2018 21:09, Samudrala, Sridhar wrote:
>>>> On 8/17/2018 2:56 AM, Harald Hoyer wrote:
>>>>> On 17.08.2018 11:51, Harald Hoyer wrote:
>>>>>> On 16.08.2018 00:17, Siwei Liu wrote:
>>>>>>> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
>>>>>>>  wrote:
>>>>>>>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>>>>>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>>>>>>>> just looking at "IFLA_MASTER" attribute?
>>>>>>>>>
>>>>>>>>> When STANDBY is enabled on virtio-net, a failover master interface
>>>>>>>>> will appear, which automatically enslaves the virtio device. But it is
>>>>>>>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>>>>>>>> new failover interface together with a standby virtio (without any VF
>>>>>>>>> or PT device in place).
>>>>>>>>>
>>>>>>>>> Dracut (initramfs) ends up with timeout and dropping into emergency 
>>>>>>>>> shell:
>>>>>>>>>
>>>>>>>>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>>>>>>>> timeout - starting timeout scripts
>>>>>>>>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>>>>>>>>Starting Dracut Emergency Shell...
>>>>>>>>> Generating "/run/initramfs/rdsosreport.txt"
>>>>>>>>> Entering emergency mode. Exit the shell to continue.
>>>>>>>>> Type "journalctl" to view system logs.
>>>>>>>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB 
>>>>>>>>> stick or
>>>>>>>>> /boot
>>>>>>>>> after mounting them and attach it to a bug report.
>>>>>>>>> dracut:/# ip l sh
>>>>>>>>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
>>>>>>>>> mode DEFAULT group default qlen 1000
>>>>>>>>>   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>>>>> 2: eth0:  mtu 1500 qdisc noqueue
>>>>>>>>> state UP mode DEFAULT group default qlen 1000
>>>>>>>>>   link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>>>>>>>> 3: eth1:  mtu 1500 qdisc pfifo_fast
>>>>>>>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>>>>>>>>   link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>>>>>>>> dracut:/#
>>>>>>>>>
>>>>>>>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>>>>>>>> network boot starts to work.
>>>>>>>>
>>>>>>>> Does dracut by default tries to use all the interfaces that are UP?
>>>>>>>>
>>>>>>> Yes. The specific dracut cmdline of our case is "ip=dhcp
>>>>>>> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
>>>>>>> of same MAC address for failover and standby, while dracut tries to
>>>>>>> run DHCP on all interfaces that are up it eventually gets same route
>>>>>>> for each interface. Those conflict route entries kill off the network
>>>>>>> connection.
>>>>>>>
>>>>>>>>> The reason is that dracut has its own means to differentiate virtual
>>>>>>>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>>>>>>>> ignores slave interfaces. Instead, users have to provide explicit
>>>>>>>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>>>>>>>> the config and ignore the slave interfaces.
>>>>>>>>
>>>>>>>> Isn't it possible to specify the interface that should be used for 
>>>>>>>> network
>>>>>>>> boot?
>>>>>>> 

Re: virtio_net failover and initramfs

2018-08-22 Thread Siwei Liu
On Tue, Aug 21, 2018 at 6:44 AM, Harald Hoyer  wrote:
> On 17.08.2018 21:09, Samudrala, Sridhar wrote:
>> On 8/17/2018 2:56 AM, Harald Hoyer wrote:
>>> On 17.08.2018 11:51, Harald Hoyer wrote:
>>>> On 16.08.2018 00:17, Siwei Liu wrote:
>>>>> On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
>>>>>  wrote:
>>>>>> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>>>>>> Are we sure all userspace apps skip and ignore slave interfaces by
>>>>>>> just looking at "IFLA_MASTER" attribute?
>>>>>>>
>>>>>>> When STANDBY is enabled on virtio-net, a failover master interface
>>>>>>> will appear, which automatically enslaves the virtio device. But it is
>>>>>>> found out that iSCSI (or any network boot) cannot boot strap over the
>>>>>>> new failover interface together with a standby virtio (without any VF
>>>>>>> or PT device in place).
>>>>>>>
>>>>>>> Dracut (initramfs) ends up with timeout and dropping into emergency 
>>>>>>> shell:
>>>>>>>
>>>>>>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>>>>>>> timeout - starting timeout scripts
>>>>>>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>>>>>>Starting Dracut Emergency Shell...
>>>>>>> Generating "/run/initramfs/rdsosreport.txt"
>>>>>>> Entering emergency mode. Exit the shell to continue.
>>>>>>> Type "journalctl" to view system logs.
>>>>>>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick 
>>>>>>> or
>>>>>>> /boot
>>>>>>> after mounting them and attach it to a bug report.
>>>>>>> dracut:/# ip l sh
>>>>>>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
>>>>>>> mode DEFAULT group default qlen 1000
>>>>>>>   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>>>>> 2: eth0:  mtu 1500 qdisc noqueue
>>>>>>> state UP mode DEFAULT group default qlen 1000
>>>>>>>   link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>>>>>>> 3: eth1:  mtu 1500 qdisc pfifo_fast
>>>>>>> master eth0 state UP mode DEFAULT group default qlen 1000
>>>>>>>   link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>>>>>>> dracut:/#
>>>>>>>
>>>>>>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>>>>>>> network boot starts to work.
>>>>>>
>>>>>> Does dracut by default tries to use all the interfaces that are UP?
>>>>>>
>>>>> Yes. The specific dracut cmdline of our case is "ip=dhcp
>>>>> netroot=iscsi:... ", but it's not specific to iscsi boot. And because
>>>>> of same MAC address for failover and standby, while dracut tries to
>>>>> run DHCP on all interfaces that are up it eventually gets same route
>>>>> for each interface. Those conflict route entries kill off the network
>>>>> connection.
>>>>>
>>>>>>> The reason is that dracut has its own means to differentiate virtual
>>>>>>> interfaces for network boot: it does not look at IFLA_MASTER and
>>>>>>> ignores slave interfaces. Instead, users have to provide explicit
>>>>>>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>>>>>>> the config and ignore the slave interfaces.
>>>>>>
>>>>>> Isn't it possible to specify the interface that should be used for 
>>>>>> network
>>>>>> boot?
>>>>> As I understand it, one can only specify interface name for running
>>>>> DHCP but not select interface for network boot.  We want DHCP to run
>>>>> on every NIC that is up (excluding the enslaved interfaces), and only
>>>>> one of them can get a route entry to the network boot server (ie.g.
>>>>> iSCSI target).
>>>>>
>>>>>>
>>>>>>> However, with automatic creation of failover interface that assumption
>>>>>>> is no longer true. Can we change dracut to ignore all slave interface
>>&

Re: virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)

2018-08-15 Thread Siwei Liu
On Wed, Aug 15, 2018 at 12:05 PM, Samudrala, Sridhar
 wrote:
> On 8/14/2018 5:03 PM, Siwei Liu wrote:
>>
>> Are we sure all userspace apps skip and ignore slave interfaces by
>> just looking at "IFLA_MASTER" attribute?
>>
>> When STANDBY is enabled on virtio-net, a failover master interface
>> will appear, which automatically enslaves the virtio device. But it is
>> found out that iSCSI (or any network boot) cannot boot strap over the
>> new failover interface together with a standby virtio (without any VF
>> or PT device in place).
>>
>> Dracut (initramfs) ends up with timeout and dropping into emergency shell:
>>
>> [  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
>> timeout - starting timeout scripts
>> [  228.171788] dracut-initqueue[377]: Warning: Could not boot.
>>   Starting Dracut Emergency Shell...
>> Generating "/run/initramfs/rdsosreport.txt"
>> Entering emergency mode. Exit the shell to continue.
>> Type "journalctl" to view system logs.
>> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or
>> /boot
>> after mounting them and attach it to a bug report.
>> dracut:/# ip l sh
>> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
>> mode DEFAULT group default qlen 1000
>>  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: eth0:  mtu 1500 qdisc noqueue
>> state UP mode DEFAULT group default qlen 1000
>>  link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
>> 3: eth1:  mtu 1500 qdisc pfifo_fast
>> master eth0 state UP mode DEFAULT group default qlen 1000
>>  link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
>> dracut:/#
>>
>> If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
>> network boot starts to work.
>
>
> Does dracut by default tries to use all the interfaces that are UP?
>
Yes. The specific dracut cmdline of our case is "ip=dhcp
netroot=iscsi:... ", but it's not specific to iscsi boot. And because
of same MAC address for failover and standby, while dracut tries to
run DHCP on all interfaces that are up it eventually gets same route
for each interface. Those conflict route entries kill off the network
connection.

>
>>
>> The reason is that dracut has its own means to differentiate virtual
>> interfaces for network boot: it does not look at IFLA_MASTER and
>> ignores slave interfaces. Instead, users have to provide explicit
>> option e.g. bond=eth0,eth1 in the boot line, then dracut would know
>> the config and ignore the slave interfaces.
>
>
> Isn't it possible to specify the interface that should be used for network
> boot?
As I understand it, one can only specify interface name for running
DHCP but not select interface for network boot.  We want DHCP to run
on every NIC that is up (excluding the enslaved interfaces), and only
one of them can get a route entry to the network boot server (ie.g.
iSCSI target).

>
>
>>
>> However, with automatic creation of failover interface that assumption
>> is no longer true. Can we change dracut to ignore all slave interface
>> by checking  IFLA_MASTER? I don't think so. It has a large impact to
>> existing configs.
>
>
> What is the issue with checking for IFLA_MASTER? I guess this is used with
> team/bonding setups.
That should be discussed within and determined by the dracut
community. But the current dracut code doesn't check IFLA_MASTER for
team or bonding specifically. I guess this change might have broader
impact to existing userspace that might be already relying on the
current behaviour.

Thanks,
-Siwei

>
>
>>
>> What's a feasible solution then? Check the driver name for failover as
>> well?
>>
>> Thanks,
>> -Siwei
>>
>>
>>
>> On Tue, May 22, 2018 at 10:38 AM, Jiri Pirko  wrote:
>>>
>>> Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote:
>>>>
>>>> On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote:
>>>>>
>>>>> Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote:
>>>>>>
>>>>>> On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote:
>>>>>>>
>>>>>>> Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:
>>>>>>>>
>>>>>>>> On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:
>>>>>>>>>
>>>>>>>>> Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>>>>>>>>>>
>>>>>>>>>> O

virtio_net failover and initramfs (was: Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework)

2018-08-14 Thread Siwei Liu
Are we sure all userspace apps skip and ignore slave interfaces by
just looking at "IFLA_MASTER" attribute?

When STANDBY is enabled on virtio-net, a failover master interface
will appear, which automatically enslaves the virtio device. But it is
found out that iSCSI (or any network boot) cannot boot strap over the
new failover interface together with a standby virtio (without any VF
or PT device in place).

Dracut (initramfs) ends up with timeout and dropping into emergency shell:

[  228.170425] dracut-initqueue[377]: Warning: dracut-initqueue
timeout - starting timeout scripts
[  228.171788] dracut-initqueue[377]: Warning: Could not boot.
 Starting Dracut Emergency Shell...
Generating "/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
dracut:/# ip l sh
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 1500 qdisc noqueue
state UP mode DEFAULT group default qlen 1000
link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff\
3: eth1:  mtu 1500 qdisc pfifo_fast
master eth0 state UP mode DEFAULT group default qlen 1000
link/ether 9a:46:22:ae:33:54 brd ff:ff:ff:ff:ff:ff
dracut:/#

If changing dracut code to ignore eth1 (with IFLA_MASTER attr),
network boot starts to work.

The reason is that dracut has its own means to differentiate virtual
interfaces for network boot: it does not look at IFLA_MASTER and
ignores slave interfaces. Instead, users have to provide explicit
option e.g. bond=eth0,eth1 in the boot line, then dracut would know
the config and ignore the slave interfaces.

However, with automatic creation of failover interface that assumption
is no longer true. Can we change dracut to ignore all slave interface
by checking  IFLA_MASTER? I don't think so. It has a large impact to
existing configs.

What's a feasible solution then? Check the driver name for failover as well?

Thanks,
-Siwei



On Tue, May 22, 2018 at 10:38 AM, Jiri Pirko  wrote:
> Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote:
>>On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote:
>>> Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote:
>>> >On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote:
>>> >> Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:
>>> >> >On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:
>>> >> >> Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>>> >> >> >On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:
>>> >> >> >> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:
>>> >> >> >> >On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>>> >> >> >> >> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>>> >> >> >> >> >Tue, May 22, 2018 at 04:06:18AM CEST, 
>>> >> >> >> >> >sridhar.samudr...@intel.com wrote:
>>> >> >> >> >> >>Use the registration/notification framework supported by the 
>>> >> >> >> >> >>generic
>>> >> >> >> >> >>failover infrastructure.
>>> >> >> >> >> >>
>>> >> >> >> >> >>Signed-off-by: Sridhar Samudrala 
>>> >> >> >> >> >
>>> >> >> >> >> >In previous patchset versions, the common code did
>>> >> >> >> >> >netdev_rx_handler_register() and netdev_upper_dev_link() etc
>>> >> >> >> >> >(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>>> >> >> >> >> >
>>> >> >> >> >> >This should be part of the common "failover" code.
>>> >> >> >> >> >
>>> >> >> >> >>
>>> >> >> >> >> Also note that in the current patchset you use IFF_FAILOVER 
>>> >> >> >> >> flag for
>>> >> >> >> >> master, yet for the slave you use IFF_SLAVE. That is wrong.
>>> >> >> >> >> IFF_FAILOVER_SLAVE should be used.
>>> >> >> >> >
>>> >> >> >> >Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and 
>>> >> >> >> >IFF_SLAVE?
>>> >> >> >>
>>> >> >> >> No. IFF_SLAVE is for bonding.
>>> >> >> >
>>> >> >> >What breaks if we reuse it for failover?
>>> >> >>
>>> >> >> This is exposed to userspace. IFF_SLAVE is expected for bonding 
>>> >> >> slaves.
>>> >> >> And failover slave is not a bonding slave.
>>> >> >
>>> >> >That does not really answer the question.  I'd claim it's sufficiently
>>> >> >like a bond slave for IFF_SLAVE to make sense.
>>> >> >
>>> >> >In fact you will find that netvsc already sets IFF_SLAVE, and so
>>> >>
>>> >> netvsc does the whole failover thing in a wrong way. This patchset is
>>> >> trying to fix it.
>>> >
>>> >Maybe, but we don't need gratuitous changes either, especially if they
>>> >break userspace.
>>>
>>> What do you mean by the "break"? It was a mistake to reuse IFF_SLAVE at
>>> the first place, lets fix it. If some userspace depends on that flag, it
>>> is broken anyway.
>>>
>>>
>>> >
>>> >> >does e.g. the eql driver.
>>> >> >
>>> >> >The advantage of using IFF_SLAVE is that userspace 

Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Siwei Liu
On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 17:42:21 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>>  wrote:
>> > On Fri, 8 Jun 2018 16:44:12 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> >>  wrote:
>> >> > On Fri, 8 Jun 2018 15:25:59 -0700
>> >> > Siwei Liu  wrote:
>> >> >
>> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> >>  wrote:
>> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> >> > "Michael S. Tsirkin"  wrote:
>> >> >> >
>> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
>> >> >> >> > wrote:
>> >> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> >> > >object with function callbacks (see callback hell).
>> >> >> >> >
>> >> >> >> > Why just a library? It should do a common things. I think it 
>> >> >> >> > should be a
>> >> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> >> > functionality into multiple drivers. That is kind of backwards 
>> >> >> >> > attitude.
>> >> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 
>> >> >> >> > 3-netdev
>> >> >> >> > model.
>> >> >> >>
>> >> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> >> handling of renames.
>> >> >> >>
>> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> >> concerned about risk of breaking some userspace.
>> >> >> >>
>> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> >> address, and you said then "why not use existing network namespaces
>> >> >> >> rather than inventing a new abstraction". So how about it then? Do 
>> >> >> >> you
>> >> >> >> want to find a way to use namespaces to hide the PV device for 
>> >> >> >> netvsc
>> >> >> >> compatibility?
>> >> >> >>
>> >> >> >
>> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough 
>> >> >> > distro's and
>> >> >> > startups that all demand eth0 always be present. And VF may come and 
>> >> >> > go.
>> >> >> > After this history, there is a strong motivation not to change how 
>> >> >> > kernel
>> >> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> >> > existing userspace.
>> >> >> >
>> >> >> > With virtio you can  work it out with the distro's yourself.
>> >> >> > There is no pre-existing semantics to deal with.
>> >> >> >
>> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >> >>
>> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> >> around renaming devices, customizing udev rules and et al. Why
>> >> >> inheriting VF's name important? To allow existing config/setup around
>> >> >> VF continues to work across kernel feature upgrade. Most of network
>> >> >> config files in all distros are based on interface names. Few are MAC
>> >> >> address based but making lower slaves hidden would cover the rest. And
>> >> >> most importantly, preserving the same level of user experience as
>> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> >> exposed. This is esse

Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Siwei Liu
On Fri, Jun 8, 2018 at 6:29 PM, Jakub Kicinski  wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> IMHO it may be a better idea to look at the VF as acceleration for the
> PV rather than PV a migration vehicle from the VF.  Hence we should

I'm basically talking about two use cases not about solutions or
implementations specifically. As said, the one I'm looking into needs
to migrate a pre-failover VF setup to 1-netdev failover model in a
transparent manner. There's no point to switch PCI address back and
forth in the backend to set where to bind the PV or the VF, as you
have no ways to predict what guest kernel will be running until its
fully loaded. Supporting a VF on new location binding to existing PV
might be nice, but not directly relevant to those who don't need this
side feature than migration itself.

Having said that, while I somewhat agree both use cases should have
its own place in the picture, I don't think judging one better than
the other or vice versa is logical IMHO.

> continue to follow the naming of PV, like the current implementation
> does implicitly by linking to PV's struct device.

The current implementation may only work with new userspace, even so
the eth0/eth0nsby naming is not consistenly persisted due to races in
bus probing. The naming part should be fixed.

-Siwei


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>>  wrote:
>> > On Fri, 8 Jun 2018 15:25:59 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >>  wrote:
>> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> > "Michael S. Tsirkin"  wrote:
>> >> >
>> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
>> >> >> > wrote:
>> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> > >object with function callbacks (see callback hell).
>> >> >> >
>> >> >> > Why just a library? It should do a common things. I think it should 
>> >> >> > be a
>> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> > functionality into multiple drivers. That is kind of backwards 
>> >> >> > attitude.
>> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> >> > model.
>> >> >>
>> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> handling of renames.
>> >> >>
>> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> concerned about risk of breaking some userspace.
>> >> >>
>> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> address, and you said then "why not use existing network namespaces
>> >> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> >> compatibility?
>> >> >>
>> >> >
>> >> > Netvsc can't work with 3 dev model. MS has worked with enough distro's 
>> >> > and
>> >> > startups that all demand eth0 always be present. And VF may come and go.
>> >> > After this history, there is a strong motivation not to change how 
>> >> > kernel
>> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> > existing userspace.
>> >> >
>> >> > With virtio you can  work it out with the distro's yourself.
>> >> > There is no pre-existing semantics to deal with.
>> >> >
>> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >>
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> The host only guarantees that the PV device will be on the same network.
> It does not 

Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 15:25:59 -0700
> Siwei Liu  wrote:
>
>> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>>  wrote:
>> > On Wed, 6 Jun 2018 15:30:27 +0300
>> > "Michael S. Tsirkin"  wrote:
>> >
>> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>> >> > >The net failover should be a simple library, not a virtual
>> >> > >object with function callbacks (see callback hell).
>> >> >
>> >> > Why just a library? It should do a common things. I think it should be a
>> >> > virtual object. Looks like your patch again splits the common
>> >> > functionality into multiple drivers. That is kind of backwards attitude.
>> >> > I don't get it. We should rather focus on fixing the mess the
>> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> >> > model.
>> >>
>> >> So it seems that at least one benefit for netvsc would be better
>> >> handling of renames.
>> >>
>> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> concerned about risk of breaking some userspace.
>> >>
>> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> address, and you said then "why not use existing network namespaces
>> >> rather than inventing a new abstraction". So how about it then? Do you
>> >> want to find a way to use namespaces to hide the PV device for netvsc
>> >> compatibility?
>> >>
>> >
>> > Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> > startups that all demand eth0 always be present. And VF may come and go.
>> > After this history, there is a strong motivation not to change how kernel
>> > behaves. Switching to 3 device model would be perceived as breaking
>> > existing userspace.
>> >
>> > With virtio you can  work it out with the distro's yourself.
>> > There is no pre-existing semantics to deal with.
>> >
>> > For the virtio, I don't see the need for IFF_HIDDEN.
>>
>> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> that flag, as well as the 1-netdev model, is to have a means to
>> inherit the interface name from the VF, and to eliminate playing hacks
>> around renaming devices, customizing udev rules and et al. Why
>> inheriting VF's name important? To allow existing config/setup around
>> VF continues to work across kernel feature upgrade. Most of network
>> config files in all distros are based on interface names. Few are MAC
>> address based but making lower slaves hidden would cover the rest. And
>> most importantly, preserving the same level of user experience as
>> using raw VF interface once getting all ndo_ops and ethtool_ops
>> exposed. This is essential to realize transparent live migration that
>> users dont have to learn and be aware of the undertaken.
>
> Inheriting the VF name will fail in the migration scenario.
> It is perfectly reasonable to migrate a guest to another machine where
> the VF PCI address is different. And since current udev/systemd model
> is to base network device name off of PCI address, the device will change
> name when guest is migrated.
>
The scenario of having VF on a different PCI address on post migration
is essentially equal to plugging in a new NIC. Why it has to pair with
the original PV? A sepearte PV device should be in place to pair the
new VF.


> On Azure, the VF maybe removed (by host) at any time and then later
> reattached. There is no guarantee that VF will show back up at
> the same synthetic PCI address. It will likely have a different
> PCI domain value.

This is something QEMU can do and make sure the PCI address is
consistent after migration.

-Siwei


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
 wrote:
>
>
> On 6/6/2018 2:24 PM, Stephen Hemminger wrote:
>>
>> On Wed, 6 Jun 2018 15:30:27 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:

 Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>
> The net failover should be a simple library, not a virtual
> object with function callbacks (see callback hell).

 Why just a library? It should do a common things. I think it should be a
 virtual object. Looks like your patch again splits the common
 functionality into multiple drivers. That is kind of backwards attitude.
 I don't get it. We should rather focus on fixing the mess the
 introduction of netvsc-bonding caused and switch netvsc to 3-netdev
 model.
>>>
>>> So it seems that at least one benefit for netvsc would be better
>>> handling of renames.
>>>
>>> Question is how can this change to 3-netdev happen?  Stephen is
>>> concerned about risk of breaking some userspace.
>>>
>>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>>> address, and you said then "why not use existing network namespaces
>>> rather than inventing a new abstraction". So how about it then? Do you
>>> want to find a way to use namespaces to hide the PV device for netvsc
>>> compatibility?
>>>
>> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
>> startups that all demand eth0 always be present. And VF may come and go.
>> After this history, there is a strong motivation not to change how kernel
>> behaves. Switching to 3 device model would be perceived as breaking
>> existing userspace.
>
>
> I think it should be possible for netvsc to work with 3 dev model if the
> only
> requirement is that eth0 will always be present. With net_failover, you will
> see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> issue
> if somehow there is userspace requirement that there can be only 2 netdevs,
> not 3
> when VF is plugged.
>
> eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> device
> and the IP address gets configured on eth0. Will this be an issue?
>
Did you realize that the eth0 name in the current 3-netdev code can't
be consistently persisted across reboot, if you have more than one VFs
to pair with? On one boot it got eth0/eth0nsby, on the next it may get
eth1/eth1nsby on the same interface.

It won't be useable by default until you add some custom udev rules.

-Siwei

>
>
>>
>> With virtio you can  work it out with the distro's yourself.
>> There is no pre-existing semantics to deal with.
>>
>> For the virtio, I don't see the need for IFF_HIDDEN.
>> With 3-dev model as long as you mark the PV and VF devices
>> as slaves, then userspace knows to leave them alone. Assuming userspace
>> is already able to deal with team and bond devices.
>> Any time you introduce new UAPI behavior something breaks.
>>
>> On the rename front, I really don't care if VF can be renamed. And for
>> netvsc want to allow the PV device to be renamed. Udev developers want
>> that
>> but have not found a stable/persistent value to expose to userspace
>> to allow it.
>
>


Re: [PATCH net] failover: eliminate callback hell

2018-06-08 Thread Siwei Liu
On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
 wrote:
> On Wed, 6 Jun 2018 15:30:27 +0300
> "Michael S. Tsirkin"  wrote:
>
>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:
>> > >The net failover should be a simple library, not a virtual
>> > >object with function callbacks (see callback hell).
>> >
>> > Why just a library? It should do a common things. I think it should be a
>> > virtual object. Looks like your patch again splits the common
>> > functionality into multiple drivers. That is kind of backwards attitude.
>> > I don't get it. We should rather focus on fixing the mess the
>> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>> > model.
>>
>> So it seems that at least one benefit for netvsc would be better
>> handling of renames.
>>
>> Question is how can this change to 3-netdev happen?  Stephen is
>> concerned about risk of breaking some userspace.
>>
>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> address, and you said then "why not use existing network namespaces
>> rather than inventing a new abstraction". So how about it then? Do you
>> want to find a way to use namespaces to hide the PV device for netvsc
>> compatibility?
>>
>
> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> startups that all demand eth0 always be present. And VF may come and go.
> After this history, there is a strong motivation not to change how kernel
> behaves. Switching to 3 device model would be perceived as breaking
> existing userspace.
>
> With virtio you can  work it out with the distro's yourself.
> There is no pre-existing semantics to deal with.
>
> For the virtio, I don't see the need for IFF_HIDDEN.

I have a somewhat different view regarding IFF_HIDDEN. The purpose of
that flag, as well as the 1-netdev model, is to have a means to
inherit the interface name from the VF, and to eliminate playing hacks
around renaming devices, customizing udev rules and et al. Why
inheriting VF's name important? To allow existing config/setup around
VF continues to work across kernel feature upgrade. Most of network
config files in all distros are based on interface names. Few are MAC
address based but making lower slaves hidden would cover the rest. And
most importantly, preserving the same level of user experience as
using raw VF interface once getting all ndo_ops and ethtool_ops
exposed. This is essential to realize transparent live migration that
users dont have to learn and be aware of the undertaken.

It's unfair to say all virtio use cases don't need IFF_HIDDEN. A few
use cases don't care about getting slaves exposed, the 3-netdev model
would work for them. For the rest, the pre-existing semantics to them
is the single VF device model they've already dealt with. This is
nothing different than having Azure stick to the 2-netdev model
because of existing user base IMHO.

-Siwei


> With 3-dev model as long as you mark the PV and VF devices
> as slaves, then userspace knows to leave them alone. Assuming userspace
> is already able to deal with team and bond devices.
> Any time you introduce new UAPI behavior something breaks.
>
> On the rename front, I really don't care if VF can be renamed. And for
> netvsc want to allow the PV device to be renamed. Udev developers want that
> but have not found a stable/persistent value to expose to userspace
> to allow it.


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-29 Thread Siwei Liu
On Sat, Apr 28, 2018 at 2:42 AM, Jiri Pirko  wrote:
> Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>netdev is present with the same MAC address. It allows live migration
>>of a VM with a direct attached VF without the need to setup a bond/team
>>between a VF and virtio net device in the guest.
>>
>>The hypervisor needs to enable only one datapath at any time so that
>>packets don't get looped back to the VM over the other datapath. When a VF
>>is plugged, the virtio datapath link state can be marked as down. The
>>hypervisor needs to unplug the VF device from the guest on the source host
>>and reset the MAC filter of the VF to initiate failover of datapath to
>>virtio before starting the migration. After the migration is completed,
>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>to the guest to switch over to VF datapath.
>>
>>It uses the generic failover framework that provides 2 functions to create
>>and destroy a master failover netdev. When STANDBY feature is enabled, an
>>additional netdev(failover netdev) is created that acts as a master device
>>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>>is marked as 'standby' netdev and a passthru device with the same MAC is
>>registered as 'primary' netdev.
>>
>>This patch is based on the discussion initiated by Jesse on this thread.
>>https://marc.info/?l=linux-virtualization=151189725224231=2
>>
>
> When I enabled the standby feature (hardcoded), I have 2 netdevices now:
> 4: ens3:  mtu 1500 qdisc noqueue state UP 
> group default qlen 1000
> link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
> inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>valid_lft forever preferred_lft forever
> 5: ens3n_sby:  mtu 1500 qdisc fq_codel state 
> UP group default qlen 1000
> link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
> inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>valid_lft forever preferred_lft forever
>
> However, it seems to confuse my initscripts on Fedora:
> [root@test1 ~]# ifup ens3
> ./network-functions: line 78: [: /etc/dhcp/dhclient-ens3: binary operator 
> expected
> ./network-functions: line 80: [: /etc/dhclient-ens3: binary operator expected
> ./network-functions: line 69: [: /var/lib/dhclient/dhclient-ens3: binary 
> operator expected
>
You should teach Fedora and all cloud vendors to upgrade their
initscripts and other userspace tools, no?

> Determining IP information for ens3
> ens3n_sby...Cannot find device "ens3n_sby.pid"
> Cannot find device "ens3n_sby.lease"
>  failed.
>
> I tried to change the standby device mac:
> ip link set ens3n_sby addr 52:54:00:b2:a7:f2
> [root@test1 ~]# ifup ens3
>
> Determining IP information for ens3... done.
> [root@test1 ~]#
>
> But now the network does not work. I think that the mac change on
> standby device should be probably refused, no?
>
> When I change the mac back, all works fine.
>
>
> Now I try to change mac of the failover master:
> [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
> RTNETLINK answers: Operation not supported
>
> That I did expect to work. I would expect this would change the mac of
> the master and both standby and primary slaves.
>
>
> Now I tried to add a primary pci device. I don't have any fancy VF on my
> test setup, but I expected the good old 8139cp to work:
> [root@test1 ~]# ethtool -i ens9
> driver: 8139cp
> 
> [root@test1 ~]# ip link set ens9 addr 52:54:00:b2:a7:f1
>
> I see no message in dmesg, so I guess the failover module did not
> enslave this netdev. The mac change is not monitored. I would expect
> that it is and whenever a device changes mac to the failover one, it
> should be enslaved and whenever it changes mac back to something else,
> it should be released - the primary one ofcourse.
>
>
>
> [...]
>
>>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>+size_t len)
>>+{
>>+  struct virtnet_info *vi = netdev_priv(dev);
>>+  int ret;
>>+
>>+  if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
>>+  return -EOPNOTSUPP;
>>+
>>+  ret = snprintf(buf, len, "_sby");
>
> please avoid the "_".
>
> [...]


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-27 Thread Siwei Liu
On Thu, Apr 26, 2018 at 4:42 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Thu, Apr 26, 2018 at 03:14:46PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> >> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> >> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin 
>> >> >> >> <m...@redhat.com> wrote:
>> >> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >> >> >> >>
>> >> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger 
>> >> >> >> >> > wrote:
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> >> >> > > > >especially the
>> >> >> >> >> > > > >three device model.  MS worked hard with distro vendors 
>> >> >> >> >> > > > >to support transparent
>> >> >> >> >> > > > >mode, ans we really can't have a new model; or do 
>> >> >> >> >> > > > >backport.
>> >> >> >> >> > > > >
>> >> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> >> > > >
>> >> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> >> >> > > > oddities.
>> >> >> >> >> > >
>> >> >> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> >> >> > > userspace application.
>> >> >> >> >> >
>> >> >> >> >> > It is userspace but are you sure dpdk is actually poking at 
>> >> >> >> >> > netdevs?
>> >> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >> >
>> >> >> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> >> >> > > application.
>> >> >> >> >> >
>> >> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> >> >> > userspace
>> >> >> >> >> > API at all.
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux 
>> >> >> >> >> network devices
>> >> >> >> >> to look for Linux netvsc device and the paired VF device and 
>> >> >> >> >> setup the
>> >> >> >> >> DPDK environment.  This setup creates a DPDK failsafe 
>> >> >> >> >> (bondingish) instance
>> >> >> >> >> and sets up TAP support over the Linux netvsc device as well as 
>> >> >> >> >> the Mellanox
>> >> >> >> >> VF device.
>> >> >> >> >>
>> >> >> >> >> So it depends on existing 2 device model. You can't go to a 3 
>> >> >> >> >> device model
>> >> >> >> >> or start hiding devices from userspace.
>> >> >> >> >
>> >> >> >> > Okay so how does the existing patch break that? IIUC does not go 
>> >> >> >>

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-26 Thread Siwei Liu
On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
>> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <m...@redhat.com> 
>> >> >> wrote:
>> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >> >> >>
>> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> >> > > > >
>> >> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> >> > > > >especially the
>> >> >> >> > > > >three device model.  MS worked hard with distro vendors to 
>> >> >> >> > > > >support transparent
>> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> >> > > > >
>> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> >> > > >
>> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> >> > > > oddities.
>> >> >> >> > >
>> >> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> >> > > userspace application.
>> >> >> >> >
>> >> >> >> > It is userspace but are you sure dpdk is actually poking at 
>> >> >> >> > netdevs?
>> >> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >> >
>> >> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> >> > > application.
>> >> >> >> >
>> >> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> >> > userspace
>> >> >> >> > API at all.
>> >> >> >> >
>> >> >> >>
>> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux 
>> >> >> >> network devices
>> >> >> >> to look for Linux netvsc device and the paired VF device and setup 
>> >> >> >> the
>> >> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
>> >> >> >> instance
>> >> >> >> and sets up TAP support over the Linux netvsc device as well as the 
>> >> >> >> Mellanox
>> >> >> >> VF device.
>> >> >> >>
>> >> >> >> So it depends on existing 2 device model. You can't go to a 3 
>> >> >> >> device model
>> >> >> >> or start hiding devices from userspace.
>> >> >> >
>> >> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >> >
>> >> >> >> Also, I am working on associating netvsc and VF device based on 
>> >> >> >> serial number
>> >> >> >> rather than MAC address. The serial number is how Windows works 
>> >> >> >> now, and it makes
>> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >> >
>> >> >> > Maybe we should support same for virtio ...
>> >> >> > Which serial do you mean? From vpd?
>> >> >> >
>> >> >> > I guess you will want to keep supporting MAC for 

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-25 Thread Siwei Liu
On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
>> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <m...@redhat.com> 
>> >> wrote:
>> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >> >>
>> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> >> > > > >
>> >> >> > > > >I will NAK patches to change to common code for netvsc 
>> >> >> > > > >especially the
>> >> >> > > > >three device model.  MS worked hard with distro vendors to 
>> >> >> > > > >support transparent
>> >> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> >> > > > >
>> >> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> >> > > >
>> >> >> > > > Sorry, but nobody here cares about dpdk or other similar 
>> >> >> > > > oddities.
>> >> >> > >
>> >> >> > > The network device model is a userspace API, and DPDK is a 
>> >> >> > > userspace application.
>> >> >> >
>> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> >> > AFAIK it's normally banging device registers directly.
>> >> >> >
>> >> >> > > You can't go breaking userspace even if you don't like the 
>> >> >> > > application.
>> >> >> >
>> >> >> > Could you please explain how is the proposed patchset breaking
>> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the 
>> >> >> > userspace
>> >> >> > API at all.
>> >> >> >
>> >> >>
>> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
>> >> >> devices
>> >> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
>> >> >> instance
>> >> >> and sets up TAP support over the Linux netvsc device as well as the 
>> >> >> Mellanox
>> >> >> VF device.
>> >> >>
>> >> >> So it depends on existing 2 device model. You can't go to a 3 device 
>> >> >> model
>> >> >> or start hiding devices from userspace.
>> >> >
>> >> > Okay so how does the existing patch break that? IIUC does not go to
>> >> > a 3 device model since netvsc calls failover_register directly.
>> >> >
>> >> >> Also, I am working on associating netvsc and VF device based on serial 
>> >> >> number
>> >> >> rather than MAC address. The serial number is how Windows works now, 
>> >> >> and it makes
>> >> >> sense for Linux and Windows to use the same mechanism if possible.
>> >> >
>> >> > Maybe we should support same for virtio ...
>> >> > Which serial do you mean? From vpd?
>> >> >
>> >> > I guess you will want to keep supporting MAC for old hypervisors?
>> >> >
>> >> > It all seems like a reasonable thing to support in the generic core.
>> >>
>> >> That's the reason why I chose explicit identifier rather than rely on
>> >> MAC address to bind/pair a device. MAC address can change. Even if it
>> >> can't, malicious guest user can fake MAC address to skip binding.
>> >>
>> >> -Siwei
>> >
>> > Address should be sampled at device creation to prevent this
>> > kind of hack. Not that it buys the malicious user much:
>> > if you can poke at MAC addresses you probably already can
>> > break networking.
>>
>> I don't understand why poking at MAC address may potentially break
>> networking.
>

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-25 Thread Siwei Liu
On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
>> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> >> On Mon, 23 Apr 2018 20:24:56 +0300
>> >> "Michael S. Tsirkin" <m...@redhat.com> wrote:
>> >>
>> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> >> > > > >
>> >> > > > >I will NAK patches to change to common code for netvsc especially 
>> >> > > > >the
>> >> > > > >three device model.  MS worked hard with distro vendors to support 
>> >> > > > >transparent
>> >> > > > >mode, ans we really can't have a new model; or do backport.
>> >> > > > >
>> >> > > > >Plus, DPDK is now dependent on existing model.
>> >> > > >
>> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> >> > >
>> >> > > The network device model is a userspace API, and DPDK is a userspace 
>> >> > > application.
>> >> >
>> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> >> > AFAIK it's normally banging device registers directly.
>> >> >
>> >> > > You can't go breaking userspace even if you don't like the 
>> >> > > application.
>> >> >
>> >> > Could you please explain how is the proposed patchset breaking
>> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> >> > API at all.
>> >> >
>> >>
>> >> The DPDK has a device driver vdev_netvsc which scans the Linux network 
>> >> devices
>> >> to look for Linux netvsc device and the paired VF device and setup the
>> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) 
>> >> instance
>> >> and sets up TAP support over the Linux netvsc device as well as the 
>> >> Mellanox
>> >> VF device.
>> >>
>> >> So it depends on existing 2 device model. You can't go to a 3 device model
>> >> or start hiding devices from userspace.
>> >
>> > Okay so how does the existing patch break that? IIUC does not go to
>> > a 3 device model since netvsc calls failover_register directly.
>> >
>> >> Also, I am working on associating netvsc and VF device based on serial 
>> >> number
>> >> rather than MAC address. The serial number is how Windows works now, and 
>> >> it makes
>> >> sense for Linux and Windows to use the same mechanism if possible.
>> >
>> > Maybe we should support same for virtio ...
>> > Which serial do you mean? From vpd?
>> >
>> > I guess you will want to keep supporting MAC for old hypervisors?
>> >
>> > It all seems like a reasonable thing to support in the generic core.
>>
>> That's the reason why I chose explicit identifier rather than rely on
>> MAC address to bind/pair a device. MAC address can change. Even if it
>> can't, malicious guest user can fake MAC address to skip binding.
>>
>> -Siwei
>
> Address should be sampled at device creation to prevent this
> kind of hack. Not that it buys the malicious user much:
> if you can poke at MAC addresses you probably already can
> break networking.

I don't understand why poking at MAC address may potentially break
networking. Unlike VF, passthrough PCI endpoint device has its freedom
to change the MAC address. Even on a VF setup it's not neccessarily
always safe to assume the VF's MAC address cannot or shouldn't be
changed. That depends on the specific need whether the host admin
wants to restrict guest from changing the MAC address, although in
most cases it's true.

I understand we can use the perm_addr to distinguish. But as said,
this will pose limitation of flexible configuration where one can
assign VFs with identical MAC address at all while each VF belongs to
different PF and/or different subnet for e.g. load balancing. And
furthermore, the QEMU device model never uses MAC address to be
interpreted as an identifier, which requires to be unique per VM
instance. Why we're introducing this inconsistency?

-Siwei

>
>
>
>
>>
>> >
>> > --
>> > MST


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Siwei Liu
On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin  wrote:
> On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
>> On Mon, 23 Apr 2018 20:24:56 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
>> > > > >
>> > > > >I will NAK patches to change to common code for netvsc especially the
>> > > > >three device model.  MS worked hard with distro vendors to support 
>> > > > >transparent
>> > > > >mode, ans we really can't have a new model; or do backport.
>> > > > >
>> > > > >Plus, DPDK is now dependent on existing model.
>> > > >
>> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
>> > >
>> > > The network device model is a userspace API, and DPDK is a userspace 
>> > > application.
>> >
>> > It is userspace but are you sure dpdk is actually poking at netdevs?
>> > AFAIK it's normally banging device registers directly.
>> >
>> > > You can't go breaking userspace even if you don't like the application.
>> >
>> > Could you please explain how is the proposed patchset breaking
>> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
>> > API at all.
>> >
>>
>> The DPDK has a device driver vdev_netvsc which scans the Linux network 
>> devices
>> to look for Linux netvsc device and the paired VF device and setup the
>> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
>> and sets up TAP support over the Linux netvsc device as well as the Mellanox
>> VF device.
>>
>> So it depends on existing 2 device model. You can't go to a 3 device model
>> or start hiding devices from userspace.
>
> Okay so how does the existing patch break that? IIUC does not go to
> a 3 device model since netvsc calls failover_register directly.
>
>> Also, I am working on associating netvsc and VF device based on serial number
>> rather than MAC address. The serial number is how Windows works now, and it 
>> makes
>> sense for Linux and Windows to use the same mechanism if possible.
>
> Maybe we should support same for virtio ...
> Which serial do you mean? From vpd?
>
> I guess you will want to keep supporting MAC for old hypervisors?
>
> It all seems like a reasonable thing to support in the generic core.

That's the reason why I chose explicit identifier rather than rely on
MAC address to bind/pair a device. MAC address can change. Even if it
can't, malicious guest user can fake MAC address to skip binding.

-Siwei


>
> --
> MST


Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-19 Thread Siwei Liu
On Wed, Apr 18, 2018 at 11:10 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
>
> On 4/18/2018 10:07 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 10:00:51PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>>>
>>>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>>>
>>>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>>>> that I'd like to share thus revive the discussion.
>>>>>>
>>>>>> First of all, as illustrated in the reply below, cloud service
>>>>>> providers require transparent live migration. Specifically, the main
>>>>>> target of our case is to support SR-IOV live migration via kernel
>>>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>>>> because this use case is not appealing enough for the mainline to
>>>>>> adopt, I will shut up and not continue discussing, although
>>>>>> technically it's entirely possible (and there's precedent in other
>>>>>> implementation) to do so to benefit any cloud service providers.
>>>>>>
>>>>>> If it's just the implementation of hiding netdev itself needs to be
>>>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>>>> API, that's completely fine and we can look into that. However, the
>>>>>> specific issue needs to be undestood beforehand is to make transparent
>>>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>>>> from the lower netdev, which needs some games with uevents and name
>>>>>> space reservation. So far I don't think it's been well discussed.
>>>>>>
>>>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>>>> model currently missed to address the core problem of live migration:
>>>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>>>> and hardware offloading states. Only general network state (IP
>>>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>>>> save and apply those hardware specific configs before or after
>>>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>>>> part of this patch series will be able to solve that problem naturally
>>>>>> by making all hardware specific configurations go through the central
>>>>>> bypass driver, such that hardware configurations can be replayed when
>>>>>> new VF or passthrough gets plugged back in. Although that
>>>>>> corresponding function hasn't been implemented today, I'd like to
>>>>>> refresh everyone's mind that is the core problem any live migration
>>>>>> proposal should have addressed.
>>>>>>
>>>>>> If it would make things more clear to defer netdev hiding until all
>>>>>> functionalities regarding centralizing and replay are implemented,
>>>>>> we'd take advices like that and move on to implementing those features
>>>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>>>> the work for hiding lower netdev at that point. Think it would be the
>>>>>> best to make everyone understand the big picture in advance before
>>>>>> going too far.
>>>>>
>>>>> I think we should get the 3-netdev model integrated and add any
>>>>> additional
>>>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>>>> into
>>>>> hiding the lower netdevs.
>>>>
>>>> Once they are exposed, I don't think we'll be able to hide them -
>>>> they will be a kernel ABI.
>>>>
>>>> Do you think everyone needs to hide the SRIOV device?
>>>> Or that only some users need this?
>>>
>>> Hyper-V is currently supporting live migration without hiding the SR-IOV
>>> device. So i don't
>>> think it is a hard requirement.
>>
>> OK, fine.
>>
>>> And also,  as we don't yet have a consensus on 

Re: [virtio-dev] Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-19 Thread Siwei Liu
On Wed, Apr 18, 2018 at 10:00 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 4/18/2018 9:41 PM, Michael S. Tsirkin wrote:
>>
>> On Wed, Apr 18, 2018 at 04:33:34PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 4/17/2018 5:26 PM, Siwei Liu wrote:
>>>>
>>>> I ran this with a few folks offline and gathered some good feedbacks
>>>> that I'd like to share thus revive the discussion.
>>>>
>>>> First of all, as illustrated in the reply below, cloud service
>>>> providers require transparent live migration. Specifically, the main
>>>> target of our case is to support SR-IOV live migration via kernel
>>>> upgrade while keeping the userspace of old distros unmodified. If it's
>>>> because this use case is not appealing enough for the mainline to
>>>> adopt, I will shut up and not continue discussing, although
>>>> technically it's entirely possible (and there's precedent in other
>>>> implementation) to do so to benefit any cloud service providers.
>>>>
>>>> If it's just the implementation of hiding netdev itself needs to be
>>>> improved, such as implementing it as attribute flag or adding linkdump
>>>> API, that's completely fine and we can look into that. However, the
>>>> specific issue needs to be undestood beforehand is to make transparent
>>>> SR-IOV to be able to take over the name (so inherit all the configs)
>>>> from the lower netdev, which needs some games with uevents and name
>>>> space reservation. So far I don't think it's been well discussed.
>>>>
>>>> One thing in particular I'd like to point out is that the 3-netdev
>>>> model currently missed to address the core problem of live migration:
>>>> migration of hardware specific feature/state, for e.g. ethtool configs
>>>> and hardware offloading states. Only general network state (IP
>>>> address, gateway, for eg.) associated with the bypass interface can be
>>>> migrated. As a follow-up work, bypass driver can/should be enhanced to
>>>> save and apply those hardware specific configs before or after
>>>> migration as needed. The transparent 1-netdev model being proposed as
>>>> part of this patch series will be able to solve that problem naturally
>>>> by making all hardware specific configurations go through the central
>>>> bypass driver, such that hardware configurations can be replayed when
>>>> new VF or passthrough gets plugged back in. Although that
>>>> corresponding function hasn't been implemented today, I'd like to
>>>> refresh everyone's mind that is the core problem any live migration
>>>> proposal should have addressed.
>>>>
>>>> If it would make things more clear to defer netdev hiding until all
>>>> functionalities regarding centralizing and replay are implemented,
>>>> we'd take advices like that and move on to implementing those features
>>>> as follow-up patches. Once all needed features get done, we'd resume
>>>> the work for hiding lower netdev at that point. Think it would be the
>>>> best to make everyone understand the big picture in advance before
>>>> going too far.
>>>
>>> I think we should get the 3-netdev model integrated and add any
>>> additional
>>> ndo_ops/ethool ops that we would like to support/migrate before looking
>>> into
>>> hiding the lower netdevs.
>>
>> Once they are exposed, I don't think we'll be able to hide them -
>> they will be a kernel ABI.
>>
>> Do you think everyone needs to hide the SRIOV device?
>> Or that only some users need this?
>
>
> Hyper-V is currently supporting live migration without hiding the SR-IOV
> device. So i don't
> think it is a hard requirement. And also,  as we don't yet have a consensus
This is a vague point as Hyper-V is mostly Windows oriented: the
target users don't change adapter settings in device manager much as
it's hidden too deep already. Actually it does not address the general
case for SR-IOV live migration but just a subset, why are we making
such comparison?

Note it's always the hard requirement for live migration that *all
states* should be migrated no matter what the implementation it is
going to be. The current 3-netdev model is remote to be useful for
real world scenario and it has no advantage compared to using
userspace generic bonding.

-Siwei

> on how to hide
> the lower netdevs, we could make it as another feature bit to hide lower
> netdevs once
> we have an acceptable solution.
>


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-17 Thread Siwei Liu
I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <losewe...@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <da...@davemloft.net> wrote:
>> From: Siwei Liu <losewe...@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature th

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Siwei Liu
On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin  wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala  wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala 
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Siwei Liu
On Tue, Apr 10, 2018 at 8:43 AM, Jiri Pirko  wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com 
>>> > > > > wrote:
>>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>>> > > > > > > sridhar.samudr...@intel.com wrote:
>>> > > > > [...]
>>> > > > >
>>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>>> > > > > > > > *bypass_netdev,
>>> > > > > > > > +   struct net_device 
>>> > > > > > > > *child_netdev)
>>> > > > > > > > +{
>>> > > > > > > > +  struct virtnet_bypass_info *vbi;
>>> > > > > > > > +  bool backup;
>>> > > > > > > > +
>>> > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>>> > > > > > > > +  backup = (child_netdev->dev.parent == 
>>> > > > > > > > bypass_netdev->dev.parent);
>>> > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>> > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>>> > > > > > > > +  netdev_info(bypass_netdev,
>>> > > > > > > > +  "%s attempting to join bypass dev 
>>> > > > > > > > when %s already present\n",
>>> > > > > > > > +  child_netdev->name, backup ? 
>>> > > > > > > > "backup" : "active");
>>> > > > > > > Bypass module should check if there is already some other netdev
>>> > > > > > > enslaved and refuse right there.
>>> > > > > > This will work for virtio-net with 3 netdev model, but this check 
>>> > > > > > has to be done by netvsc
>>> > > > > > as its bypass_netdev is same as the backup_netdev.
>>> > > > > > Will add a flag while registering with the bypass module to 
>>> > > > > > indicate if the driver is doing
>>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check 
>>> > > > > > can be done in bypass module
>>> > > > > > for 3 netdev scenario.
>>> > > > > Just let me undestand it clearly. What I expect the difference 
>>> > > > > would be
>>> > > > > between 2netdev and3 netdev model is this:
>>> > > > > 2netdev:
>>> > > > >  bypass_master
>>> > > > > /
>>> > > > >/
>>> > > > > VF_slave
>>> > > > >
>>> > > > > 3netdev:
>>> > > > >  bypass_master
>>> > > > > / \
>>> > > > >/   \
>>> > > > > VF_slave   backup_slave
>>> > > > >
>>> > > > > Is that correct? If not, how does it look like?
>>> > > > >
>>> > > > >
>>> > > > Looks correct.
>>> > > > VF_slave and backup_slave are the original netdevs and are present in 
>>> > > > both the models.
>>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave 
>>> > > > and backup_slave are
>>> > > > marked as the 2 slaves of this new netdev.
>>> > > You say it looks correct and in another sentence you provide completely
>>> > > different description. Could you please look again?
>>> > >
>>> > To be exact, 2 netdev model with netvsc looks like this.
>>> >
>>> > netvsc_netdev
>>> >   /
>>> >  /
>>> > VF_slave
>>> >
>>> > With virtio_net, 3 netdev model
>>> >
>>> >   bypass_netdev
>>> >   / \
>>> >  /   \
>>> > VF_slave   virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>>
>> bypass_netdev
>> / \
>>/   \
>>VF_slave   virtio_net netdev (original)
>
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
>netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
>configured on it (well you could, but the rx_handler would eat every
>incoming packet). So you will break the user bacause he would have to
>move the configuration to the new master device.

That's exactly the point why I need to hide the lower netdev slaves
and trying the align the naming of the bypass with where IP was
configured on the original netdev. The current 3-netdev model is not
"transparent" by any means.

-Siwei

> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-10 Thread Siwei Liu
On Sun, Apr 8, 2018 at 9:32 AM, David Miller <da...@davemloft.net> wrote:
> From: Siwei Liu <losewe...@gmail.com>
> Date: Fri, 6 Apr 2018 19:32:05 -0700
>
>> And I assume everyone here understands the use case for live
>> migration (in the context of providing cloud service) is very
>> different, and we have to hide the netdevs. If not, I'm more than
>> happy to clarify.
>
> I think you still need to clarify.

OK. The short answer is cloud users really want *transparent* live migration.

By being transparent it means they don't and shouldn't care about the
existence and the occurence of live migration, but they do if
userspace toolstack and libraries have to be updated or modified,
which means potential dependency brokeness of their applications. They
don't like any change to the userspace envinroment (existing apps
lift-and-shift, no recompilation, no re-packaging, no re-certification
needed), while no one barely cares about ABI or API compatibility in
the kernel level, as long as their applications don't break.

I agree the current bypass solution for SR-IOV live migration requires
guest cooperation. Though it doesn't mean guest *userspace*
cooperation. As a matter of fact, techinically it shouldn't invovle
userspace at all to get SR-IOV migration working. It's the kernel that
does the real work. If I understand the goal of this in-kernel
approach correctly, it was meant to save userspace from modification
or corresponding toolstack support, as those additional 2 interfaces
is more a side product of this approach, rather than being neccessary
for users to be aware of. All what the user needs to deal with is one
single interface, and that's what they care about. It's more a trouble
than help when they see 2 extra interfaces are present. Management
tools in the old distros don't recoginze them and try to bring up
those extra interfaces for its own. Various odd warnings start to spew
out, and there's a lot of caveats for the users to get around...

On the other hand, if we "teach" those cloud users to update the
userspace toolstack just for trading a feature they don't need, no one
is likely going to embrace the change. As such there's just no real
value of adopting this in-kernel bypass facility for any cloud service
provider. It does not look more appealing than just configure generic
bonding using its own set of daemons or scripts. But again, cloud
users don't welcome that facility. And basically it would get to
nearly the same set of problems if leaving userspace alone.

IMHO we're not hiding the devices, think it the way we're adding a
feature transparent to user. Those auto-managed slaves are ones users
don't care about much. And user is still able to see and configure the
lower netdevs if they really desires to do so. But generally the
target user for this feature won't need to know that. Why they care
how many interfaces a VM virtually has rather than how many interfaces
are actually _useable_ to them??

Thanks,
-Siwei


>
> netdevs are netdevs.  If they have special attributes, mark them as
> such and the tools base their actions upon that.
>
> "Hiding", or changing classes, doesn't make any sense to me still.


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <losewe...@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <and...@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> > Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
> Andrew


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn  wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>Andrew


Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-06 Thread Siwei Liu
(click the wrong reply button again, sorry)


On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini <pbonz...@redhat.com> wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
> devices = container_get(qdev_get_machine(), "/peripheral");
> return object_resolve_path_component(devices, id);


No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.

Regards,
-Siwei

>
> ?
>
> Thanks,
>
> Paolo


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Siwei Liu
On Wed, Apr 4, 2018 at 10:37 AM, David Miller  wrote:
> From: David Ahern 
> Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
> I agree on this.

I'm completely fine of having an API for inspection purpose. The thing
is that we'd perhaps need to go for the namespace approach, for which
I think everyone seems to agree not to fiddle with the ":" prefix, but
rather have a new class of network subsystem under /sys/class thus a
separate device namespace e.g. /sys/class/net-kernel for those
auto-managed lower netdevs is needed.

And I assume everyone here understands the use case for live migration
(in the context of providing cloud service) is very different, and we
have to hide the netdevs. If not, I'm more than happy to clarify.

With that in mind, if having a new class of net-kernel namespace, we
can name the kernel device elaborately which is not neccessarily equal
to the device name exposed to userspace. For example, we can use
driver name as the prefix as opposed to "eth" or ":eth". And we don't
need to have auto-managed netdevs locked into the ":" prefix at all (I
intentionally left it out in the this RFC patch to ask for comments on
the namespace solution which is much cleaner). That said, an userpsace
named device through udev may call something like ens3 and
switch1-port2, but in the kernel-net namespace, it may look like
ixgbevf0 and mlxsw1p2.

So if we all agree introducing a new namespace is the rigth thing to
do, `ip link' will no longer serve the purpose of displaying the
information for kernel-net devnames for the sake of avoiding ambiguity
and namespace collision: it's entirely possible the ip link name could
collide with a kernel-net devname, it's become unclear which name of a
netdev object the command is expected to operate on. That's why I
thought showing the kernel-only netdevs using a separate subcommand
makes more sense.

Thoughts and comments? Please let me know.

Thanks,
-Siwei

>
> What I really don't understand still is the use case... really.
>
> So there are control netdevs, what exactly is the problem with that?
>
> Are we not exporting enough information for applications to handle
> these devices sanely?  If so, then's let add that information.
>
> We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
> Another alternative is to add an interface flag like IFF_CONTROL or
> similar, and that probably is much nicer.
>
> Hiding the devices means that we acknowledge that applications are
> currently broken with control netdevs... and we want them to stay
> broken!
>
> That doesn't sound like a good plan to me.
>
> So let's fix handling of control netdevs instead of hiding them.
>
> Thanks.


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Siwei Liu
(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu <losewe...@gmail.com> wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>> <step...@networkplumber.org> wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu <si-wei@oracle.com> wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu <si-wei@oracle.com>
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Wed, Apr 4, 2018 at 10:21 AM, David Ahern <dsah...@gmail.com> wrote:
> On 4/4/18 1:36 AM, Siwei Liu wrote:
>> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern <dsah...@gmail.com> wrote:
>>> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>>>
>>>>> There are other use cases that want to hide a device from userspace. I
>>>>
>>>> What usecases do you have in mind?
>>>
>>> As mentioned in a previous response some kernel drivers create control
>>> netdevs. Just as in this case users should not be mucking with it, and
>>> S/W like lldpd should ignore it.
>>>
>>>>
>>>>> would prefer a better solution than playing games with name prefixes and
>>>>> one that includes an API for users to list all devices -- even ones
>>>>> hidden by default.
>>>>
>>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>>> for userspace issues. Why can't the netdevice be visible always and
>>>> userspace would know what is it and what should it do with it?
>>>>
>>>> Once we start with hiding, there are other things related to that which
>>>> appear. Like who can see what, levels of visibility etc...
>>>>
>>>
>>> I would not advocate for any API that does not allow users to have full
>>> introspection. The intent is to hide the netdev by default but have an
>>> option to see it.
>>
>> I'm fine with having a link dump API to inspect the hidden netdev. As
>> said, the name for hidden netdevs should be in a separate device
>> namespace, and we did not even get closer to what it should look like
>> as I don't want to make it just an option for ip link. Perhaps a new
>> set of sub-commands of, say, 'ip device'.
>
> It is a netdev so there is no reason to have a separate ip command to
> inspect it. 'ip link' is the right place.

If you're still thinking the visibility is part of link attribute
rather than a separate namespace, I'd say we are trying to solve
essentially different problems, and I really don't understand your
proposal in solving that problem to be honest.

So, let's step back on studying your case if that's the right thing
and let's talk about concrete examples.

-Siwei


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 6:04 PM, David Ahern  wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.

I'm still not sure I understand your case: why you want to hide the
control netdev, as I assume those devices could choose either to
silently ignore the request, or fail loudly against user operations?
Is it creating issues already, or what problem you want to solve if
not making the netdev invisible. Why couldn't lldpd check some
specific flag and ignore the control netdevice (can you please give an
example of a concrete driver for control netdevice *in tree*).

And I'm completely lost why you want an API to make a hidden netdev
visible again for these control devices.

Thanks,
-Siwei


>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.


Re: [virtio-dev] Re: [RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 5:20 AM, Michael S. Tsirkin  wrote:
> On Sun, Apr 01, 2018 at 05:13:10AM -0400, Si-Wei Liu wrote:
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index aa40664..0827b7e 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -80,6 +80,8 @@ struct virtio_net_config {
>>   __u16 max_virtqueue_pairs;
>>   /* Default maximum transmit unit advice */
>>   __u16 mtu;
>> + /* Device at bus:slot.function backed up by virtio_net */
>> + __u16 bsf2backup;
>>  } __attribute__((packed));
>
> I'm not sure this is a good interface.  This isn't unique even on some
> PCI systems, not to speak of non-PCI ones.

Are you suggesting adding PCI address domain besides to make it
universally unique? And what the non-PCI device you envisioned that
the main target, essetially live migration, can/should cover? Or is
there better option in your mind already?

Thanks,
-Siwei
>
>>  /*
>> --
>> 1.8.3.1
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>


Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 5:25 AM, Michael S. Tsirkin  wrote:
> On Sun, Apr 01, 2018 at 05:13:08AM -0400, Si-Wei Liu wrote:
>> @@ -896,6 +898,68 @@ void qmp_device_del(const char *id, Error **errp)
>>  }
>>  }
>>
>> +int pci_get_busdevfn_by_id(const char *id, uint16_t *busnr,
>> +   uint16_t *devfn, Error **errp)
>> +{
>> +uint16_t busnum = 0, slot = 0, func = 0;
>> +const char *pc, *pd, *pe;
>> +Error *local_err = NULL;
>> +ObjectClass *class;
>> +char value[1024];
>> +BusState *bus;
>> +uint64_t u64;
>> +
>> +if (!(pc = strchr(id, ':'))) {
>> +error_setg(errp, "Invalid id: backup=%s, "
>> +   "correct format should be backup="
>> +   "':[.]'", id);
>> +return -1;
>> +}
>> +get_opt_name(value, sizeof(value), id, ':');
>> +if (pc != id + 1) {
>> +bus = qbus_find(value, errp);
>> +if (!bus)
>> +return -1;
>> +
>> +class = object_get_class(OBJECT(bus));
>> +if (class != object_class_by_name(TYPE_PCI_BUS) &&
>> +class != object_class_by_name(TYPE_PCIE_BUS)) {
>> +error_setg(errp, "%s is not a device on pci bus", id);
>> +return -1;
>> +}
>> +busnum = (uint16_t)pci_bus_num(PCI_BUS(bus));
>> +}
>
> pci_bus_num is almost always a bug if not done within
> a context of a PCI host, bridge, etc.
>
> In particular this will not DTRT if run before guest assigns bus
> numbers.
>
I was seeking means to reserve a specific pci bus slot from drivers,
and update the driver when guest assigns the bus number but it seems
there's no low-hanging fruits. Because of that reason the bus_num is
only obtained until it's really needed (during get_config) and I
assume at that point the pci bus assignment is already done. I know
the current one is not perfect, but we need that information (PCI
bus:slot.func number) to name the guest device correctly.

Regards,
-Siwei
>
>> +
>> +if (!devfn)
>> +goto out;
>> +
>> +pd = strchr(pc, '.');
>> +pe = get_opt_name(value, sizeof(value), pc + 1, '.');
>> +if (pe != pc + 1) {
>> +parse_option_number("slot", value, , _err);
>> +if (local_err) {
>> +error_propagate(errp, local_err);
>> +return -1;
>> +}
>> +slot = (uint16_t)u64;
>> +}
>> +if (pd && *(pd + 1) != '\0') {
>> +parse_option_number("function", pd, , _err);
>> +if (local_err) {
>> +error_propagate(errp, local_err);
>> +return -1;
>> +}
>> +func = (uint16_t)u64;
>> +}
>> +
>> +out:
>> +if (busnr)
>> +*busnr = busnum;
>> +if (devfn)
>> +*devfn = ((slot & 0x1F) << 3) | (func & 0x7);
>> +return 0;
>> +}
>> +
>>  BlockBackend *blk_by_qdev_id(const char *id, Error **errp)
>>  {
>>  DeviceState *dev;
>> --
>> 1.8.3.1
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 11:19 PM, Jiri Pirko  wrote:
> Wed, Apr 04, 2018 at 03:04:26AM CEST, dsah...@gmail.com wrote:
>>On 4/3/18 9:42 AM, Jiri Pirko wrote:

 There are other use cases that want to hide a device from userspace. I
>>>
>>> What usecases do you have in mind?
>>
>>As mentioned in a previous response some kernel drivers create control
>>netdevs. Just as in this case users should not be mucking with it, and
>
> virtio_net. Any other drivers?

netvsc if factoring out virtio_bypass to a common driver.

>
>
>>S/W like lldpd should ignore it.
>
> It's just a matter of identification of the netdevs, so the user knows
> what to do.
>
>
>>
>>>
 would prefer a better solution than playing games with name prefixes and
 one that includes an API for users to list all devices -- even ones
 hidden by default.
>>>
>>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>>> for userspace issues. Why can't the netdevice be visible always and
>>> userspace would know what is it and what should it do with it?
>>>
>>> Once we start with hiding, there are other things related to that which
>>> appear. Like who can see what, levels of visibility etc...
>>>
>>
>>I would not advocate for any API that does not allow users to have full
>>introspection. The intent is to hide the netdev by default but have an
>>option to see it.
>
> As an administrator, I want to see all by default. I think it is
> reasonable requirements. Again, this awfully smells like a workaround...

If the requirement is just for dumping the link info i.e. perform
read-only operation on the hidden netdev, it's completely fine.
However, I am not a big fan of creating a weird mechanism to allow
user deliberately manipulate the visibility (hide/unhide) of a netdev
in any case at any time. This is subject to becoming a slippery slope
to work around any software issue that should get fixed in the right
place.

Let's treat IFF_HIDDEN as a means to hide auto-managed netdevices. If
it's just the name is misleading, I can get it renamed to something
like IFF_AUTO_MANAGED which might reflect its nature more properly.

Thanks,
-Siwei


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Siwei Liu
On Tue, Apr 3, 2018 at 6:04 PM, David Ahern  wrote:
> On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>>
>> What usecases do you have in mind?
>
> As mentioned in a previous response some kernel drivers create control
> netdevs. Just as in this case users should not be mucking with it, and
> S/W like lldpd should ignore it.
>
>>
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>>
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>>
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>>
>
> I would not advocate for any API that does not allow users to have full
> introspection. The intent is to hide the netdev by default but have an
> option to see it.

I'm fine with having a link dump API to inspect the hidden netdev. As
said, the name for hidden netdevs should be in a separate device
namespace, and we did not even get closer to what it should look like
as I don't want to make it just an option for ip link. Perhaps a new
set of sub-commands of, say, 'ip device'.

-Siwei


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Siwei Liu
On Tue, Apr 3, 2018 at 8:42 AM, Jiri Pirko  wrote:
> Sun, Apr 01, 2018 at 06:11:29PM CEST, dsah...@gmail.com wrote:
>>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>>> Hidden netdevice is not visible to userspace such that
>>> typical network utilites e.g. ip, ifconfig and et al,
>>> cannot sense its existence or configure it. Internally
>>> hidden netdev may associate with an upper level netdev
>>> that userspace has access to. Although userspace cannot
>>> manipulate the lower netdev directly, user may control
>>> or configure the underlying hidden device through the
>>> upper-level netdev. For identification purpose, the
>>> kobject for hidden netdev still presents in the sysfs
>>> hierarchy, however, no uevent message will be generated
>>> when the sysfs entry is created, modified or destroyed.
>>>
>>> For that end, a separate namescope needs to be carved
>>> out for IFF_HIDDEN netdevs. As of now netdev name that
>>> starts with colon i.e. ':' is invalid in userspace,
>>> since socket ioctls such as SIOCGIFCONF use ':' as the
>>> separator for ifname. The absence of namescope started
>>> with ':' can rightly be used as the namescope for
>>> the kernel-only IFF_HIDDEN netdevs.
>>>
>>> Signed-off-by: Si-Wei Liu 
>>> ---
>>>  include/linux/netdevice.h   |  12 ++
>>>  include/net/net_namespace.h |   2 +
>>>  net/core/dev.c  | 281 
>>> ++--
>>>  net/core/net_namespace.c|   1 +
>>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>>
>>
>>There are other use cases that want to hide a device from userspace. I
>
> What usecases do you have in mind?

Hope you're not staring at me and shouting. :)

I think we had discussed a lot, and if the common goal is to merge two
drivers rather than diverge, there's no better way than to hide the
lower devices from all existing userspace management utiliies
(NetworManager, cloud-init). This does not mean loss of visibility as
we can add new API or CLI later on to get those missing ones exposed
as needed, in a way existing userspace apps don't break while new apps
aware of the feature know where to get it. This requirement is
critical to cloud providers, which I wouldn't repeat enough why it
drove me crazy if not seeing this resolved.

Thanks,
-Siwei

>
>>would prefer a better solution than playing games with name prefixes and
>>one that includes an API for users to list all devices -- even ones
>>hidden by default.
>
> Netdevice hiding feels a bit scarry for me. This smells like a workaround
> for userspace issues. Why can't the netdevice be visible always and
> userspace would know what is it and what should it do with it?
>
> Once we start with hiding, there are other things related to that which
> appear. Like who can see what, levels of visibility etc...
>
>
>>
>>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>>
>>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>>
>>Also, why are you suggesting that the device should still be visible via
>>/sysfs? That leads to inconsistent views of networking state - /sys
>>shows a device but a link dump does not.


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Siwei Liu
On Sun, Apr 1, 2018 at 9:11 AM, David Ahern  wrote:
> On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilities e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>>
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>>
>> Signed-off-by: Si-Wei Liu 
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c  | 281 
>> ++--
>>  net/core/net_namespace.c|   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>>
>
> There are other use cases that want to hide a device from userspace.

Can you elaborate your case in more details? Looking at the links
below I realize that the purpose of hiding devices in your case is
quite different from the our migration case. Particularly, I don't
like the part of elaborately allowing user to manipulate the link's
visibility - things fall apart easily while live migration is on
going. And, why doing additional check for invisible links in every
for_each_netdev() and its friends. This is effectively changing
semantics of internal APIs that exist for decades.

> I would prefer a better solution than playing games with name prefixes and

This part is intentionally left to be that way and I would anticipate
feedback before going too far. The idea in my mind was that I need a
completely new device namespace underneath (or namescope, which is !=
netns) for all netdevs: kernel-only IFF_HIDDEN network devices and
those not. The current namespace for devname is already exposed to
userspace and visible in the sysfs hierarchy, but for backwards
compatibility reasons it's necessary to keep the old udevd still able
to reference the entry of IFF_HIDDEN netdev under the /sys/net
directory. By using the ':' prefix it has the benefit of minimal
changes without introducing another namespace or the accompanied
complexity in managing these two separate namespaces.

Technically, I can create a separate sysfs directory for the new
namescope, say /sys/net-kernel, which includes all netdev entries like
':eth0' and 'ens3', and which can be referenced from /sys/net. It
would make the /sys/net consistent with the view of userspace
utilities, but I am not even sure if that's an overkill, and would
like to gather more feedback before moving to that model.

> one that includes an API for users to list all devices -- even ones

What kind of API you would like to query for hidden devices?
rtnetlink? a private socket API? or something else?

For our case, the sysfs interface is what we need and is sufficient,
since udev is the main target we'd like to support to make the naming
of virtio_bypass consistent and compatible.

> hidden by default.
>
> https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
> https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
> Also, why are you suggesting that the device should still be visible via
> /sysfs? That leads to inconsistent views of networking state - /sys
> shows a device but a link dump does not.

See my clarifications above. I don't mind kernel-only netdevs being
visible via sysfs, as that way we get a good trade-off between
backwards compatibility and visibility. There's still kobject created
there right. Bottom line is that all kernel devices and its life-cycle
uevents are made invisible to userpace network utilities, and I think
it simply gets to the goal of not breaking existing apps while being
able to add new features.

-Siwei


Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-13 Thread Siwei Liu
On Tue, Mar 13, 2018 at 5:28 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 3/12/2018 3:44 PM, Siwei Liu wrote:
>>
>> Apologies, still some comments going. Please see inline.
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudr...@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeb...@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>  u32 speed;
>>>
>>>  unsigned long guest_offloads;
>>> +
>>> +   /* upper netdev created when BACKUP feature enabled */
>>> +   struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>  }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> + size_t len)
>>> +{
>>> +   struct virtnet_info *vi = netdev_priv(dev);
>>> +   int ret;
>>> +
>>> +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   ret = snprintf(buf, len, "_bkup");
>>> +   if (ret >= len)
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   return 0;
>>> +}
>>> +
>>>   static const struct net_device_ops virtnet_netdev = {
>>>  .ndo_open= virtnet_open,
>>>  .ndo_stop= virtnet_close,
>>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev =
>>> {
>>>  .ndo_xdp_xmit   = virtnet_xdp_xmit,
>>>  .ndo_xdp_flush  = virtnet_xdp_flush,
>>>  .ndo_features_check = passthru_features_check,
>>> +   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>>>   };
>>>
>>>   static void virtnet_config_changed_work(struct work_struct *work)
>>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device
>>> *vdev)
>>>  return 0;
>>>   }
>>>
>>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>>> + * When BACKUP feature is enabled, an addit

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-12 Thread Siwei Liu
Apologies, still some comments going. Please see inline.

On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
 wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization=151189725224231=2
>
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> Reviewed-by: Jesse Brandeburg 
> ---
>  drivers/net/virtio_net.c | 683 
> ++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
> +
> +   /* upper netdev created when BACKUP feature enabled */
> +   struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
> netdev_bpf *xdp)
> }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> + size_t len)
> +{
> +   struct virtnet_info *vi = netdev_priv(dev);
> +   int ret;
> +
> +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +   return -EOPNOTSUPP;
> +
> +   ret = snprintf(buf, len, "_bkup");
> +   if (ret >= len)
> +   return -EOPNOTSUPP;
> +
> +   return 0;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
> .ndo_open= virtnet_open,
> .ndo_stop= virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
> .ndo_xdp_xmit   = virtnet_xdp_xmit,
> .ndo_xdp_flush  = virtnet_xdp_flush,
> .ndo_features_check = passthru_features_check,
> +   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device 
> *vdev)
> return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +   /* passthru netdev with same MAC */
> +   struct net_device __rcu *active_netdev;
> +
> +   /* virtio_net netdev */
> +   struct net_device __rcu *backup_netdev;
> +
> +   /* active netdev stats */
> +   struct rtnl_link_stats64 active_stats;
> +
> +   /* backup netdev stats */
> +   struct rtnl_link_stats64 backup_stats;
> +
> +   /* aggregated stats */
> +   struct rtnl_link_stats64 bypass_stats;
> +
> +   /* spinlock while updating stats */
> +   spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> + struct net_device *child_netdev)
> +{
> +   int err = dev_open(child_netdev);
> +
> +   if (err)
> +   netdev_warn(dev, "unable to open slave: %s: %d\n",
> +   child_netdev->name, err);
> +}

Why ignoring the error?? i.e. virtnet_bypass_child_open should have
return value. I don't 

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-12 Thread Siwei Liu
On Sat, Mar 3, 2018 at 8:04 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> >> <sridhar.samudr...@intel.com> wrote:
>> >> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> >> > netdev is present with the same MAC address. It allows live migration
>> >> > of a VM with a direct attached VF without the need to setup a bond/team
>> >> > between a VF and virtio net device in the guest.
>> >> >
>> >> > The hypervisor needs to enable only one datapath at any time so that
>> >> > packets don't get looped back to the VM over the other datapath. When a 
>> >> > VF
>> >> > is plugged, the virtio datapath link state can be marked as down. The
>> >> > hypervisor needs to unplug the VF device from the guest on the source 
>> >> > host
>> >> > and reset the MAC filter of the VF to initiate failover of datapath to
>> >> > virtio before starting the migration. After the migration is completed,
>> >> > the destination hypervisor sets the MAC filter on the VF and plugs it 
>> >> > back
>> >> > to the guest to switch over to VF datapath.
>> >> >
>> >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> >> > created that acts as a master device and tracks the state of the 2 lower
>> >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev 
>> >> > and a
>> >> > passthru device with the same MAC is registered as 'active' netdev.
>> >> >
>> >> > This patch is based on the discussion initiated by Jesse on this thread.
>> >> > https://marc.info/?l=linux-virtualization=151189725224231=2
>> >> >
>> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> > Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeb...@intel.com>
>> >> > ---
>> >> >  drivers/net/virtio_net.c | 683 
>> >> > ++-
>> >> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> > index bcd13fe906ca..f2860d86c952 100644
>> >> > --- a/drivers/net/virtio_net.c
>> >> > +++ b/drivers/net/virtio_net.c
>> >> > @@ -30,6 +30,8 @@
>> >> >  #include 
>> >> >  #include 
>> >> >  #include 
>> >> > +#include 
>> >> > +#include 
>> >> >  #include 
>> >> >  #include 
>> >> >
>> >> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> >> > u32 speed;
>> >> >
>> >> > unsigned long guest_offloads;
>> >> > +
>> >> > +   /* upper netdev created when BACKUP feature enabled */
>> >> > +   struct net_device *bypass_netdev;
>> >> >  };
>> >> >
>> >> >  struct padded_vnet_hdr {
>> >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, 
>> >> > struct netdev_bpf *xdp)
>> >> > }
>> >> >  }
>> >> >
>> >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char 
>> >> > *buf,
>> >> > + size_t len)
>> >> > +{
>> >> > +   struct virtnet_info *vi = netdev_priv(dev);
>> >> > +   int ret;
>> >> > +
>> >> > +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> >> > +   return -EOPNOTSUPP;
>> >> > +
>> >> > +   ret = snprintf(buf, len, "_bkup");
>> >> > +   if (ret >= len)
>> >> > +   return -EOPNOTSUPP;
>> >> > +
>> >> > +   return 0;
>> >> > +}
>> >> > +
>> >>
>> >> What if the systemd/udevd is not new enough to enforce the
>> >> 

Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 3/2/2018 1:11 PM, Siwei Liu wrote:
>>
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudr...@intel.com> wrote:
>>>
>>> This patch enables virtio_net to switch over to a VF datapath when a VF
>>> netdev is present with the same MAC address. It allows live migration
>>> of a VM with a direct attached VF without the need to setup a bond/team
>>> between a VF and virtio net device in the guest.
>>>
>>> The hypervisor needs to enable only one datapath at any time so that
>>> packets don't get looped back to the VM over the other datapath. When a
>>> VF
>>> is plugged, the virtio datapath link state can be marked as down. The
>>> hypervisor needs to unplug the VF device from the guest on the source
>>> host
>>> and reset the MAC filter of the VF to initiate failover of datapath to
>>> virtio before starting the migration. After the migration is completed,
>>> the destination hypervisor sets the MAC filter on the VF and plugs it
>>> back
>>> to the guest to switch over to VF datapath.
>>>
>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>>> created that acts as a master device and tracks the state of the 2 lower
>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and
>>> a
>>> passthru device with the same MAC is registered as 'active' netdev.
>>>
>>> This patch is based on the discussion initiated by Jesse on this thread.
>>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>>
>>> Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>>> Reviewed-by: Jesse Brandeburg <jesse.brandeb...@intel.com>
>>> ---
>>>   drivers/net/virtio_net.c | 683
>>> ++-
>>>   1 file changed, 682 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index bcd13fe906ca..f2860d86c952 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -30,6 +30,8 @@
>>>   #include 
>>>   #include 
>>>   #include 
>>> +#include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>
>>> @@ -206,6 +208,9 @@ struct virtnet_info {
>>>  u32 speed;
>>>
>>>  unsigned long guest_offloads;
>>> +
>>> +   /* upper netdev created when BACKUP feature enabled */
>>> +   struct net_device *bypass_netdev;
>>>   };
>>>
>>>   struct padded_vnet_hdr {
>>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev,
>>> struct netdev_bpf *xdp)
>>>  }
>>>   }
>>>
>>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>>> + size_t len)
>>> +{
>>> +   struct virtnet_info *vi = netdev_priv(dev);
>>> +   int ret;
>>> +
>>> +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   ret = snprintf(buf, len, "_bkup");
>>> +   if (ret >= len)
>>> +   return -EOPNOTSUPP;
>>> +
>>> +   return 0;
>>> +}
>>> +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n naming? Would virtio_bypass get a different name
>> than the original virtio_net? Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
>
> If udev doesn't support renaming of the devices,  then the upper bypass
> device
> should get the original name and the lower virtio netdev will get the next
> name.

If you got two virtio-net's (say e.g. eth0 and eth1) before the
update, the virtio-bypass interface on the first virtio gets eth0 and
the backup netdev would get eth1? Then the IP address originally on
eth1 gets configurd on the backup interface?

> Hopefully the distros updating the kernel will also move to the new
> systemd/udev.

This is not reliable. I'd opt for a new udev API for this, and fall
back to what it was if don't see fit.

>
>
>
>>
>>>   static const struct net_device_ops virtnet_netdev = {
>>>  .ndo_open 

Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote:
>> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
>> <sridhar.samudr...@intel.com> wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> >
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> >
>> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> > created that acts as a master device and tracks the state of the 2 lower
>> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> > passthru device with the same MAC is registered as 'active' netdev.
>> >
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization=151189725224231=2
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>> > Reviewed-by: Jesse Brandeburg <jesse.brandeb...@intel.com>
>> > ---
>> >  drivers/net/virtio_net.c | 683 
>> > ++-
>> >  1 file changed, 682 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > index bcd13fe906ca..f2860d86c952 100644
>> > --- a/drivers/net/virtio_net.c
>> > +++ b/drivers/net/virtio_net.c
>> > @@ -30,6 +30,8 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#include 
>> > +#include 
>> >  #include 
>> >  #include 
>> >
>> > @@ -206,6 +208,9 @@ struct virtnet_info {
>> > u32 speed;
>> >
>> > unsigned long guest_offloads;
>> > +
>> > +   /* upper netdev created when BACKUP feature enabled */
>> > +   struct net_device *bypass_netdev;
>> >  };
>> >
>> >  struct padded_vnet_hdr {
>> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, 
>> > struct netdev_bpf *xdp)
>> > }
>> >  }
>> >
>> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>> > + size_t len)
>> > +{
>> > +   struct virtnet_info *vi = netdev_priv(dev);
>> > +   int ret;
>> > +
>> > +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>> > +   return -EOPNOTSUPP;
>> > +
>> > +   ret = snprintf(buf, len, "_bkup");
>> > +   if (ret >= len)
>> > +   return -EOPNOTSUPP;
>> > +
>> > +   return 0;
>> > +}
>> > +
>>
>> What if the systemd/udevd is not new enough to enforce the
>> n naming? Would virtio_bypass get a different name
>> than the original virtio_net?
>
> You mean people using ethX names? Any hardware config change breaks
> these, I don't think that can be helped.

I don't like the way to rely on .ndo_get_phys_port_name - it's fragile
and it does not completely solve the problem it tries to address.
Imagine what can end up with if getting an old udevd, or users already
have exsiting explicit udev rules around phys_port_name. It does not
give you the an ack in saying "yes, I know you're the bypass and
you're the backup, please continue and I will give you both correct
names", or an unacknowlegment saying "no, I don't know what these
extra interfaces are, please go back and leave the VF device alone".
We need new udev API for both feature negotiation and naming, or may
even completely hide the lower interfaces.

>
>> Should we detect this earlier and fall
>> back to legacy mode without creating the bypass netdev and ensalving
>> the VF?
>
> I don't think we can do this with existing kernel/userspace APIs.

That's why I ever said to make udev aware of this new type of combined
device instead of doing hacks here and there around.

Regards,
-Siwei

>
> --
> MST


Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote:
>> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>> >>
>> >>
>> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> >> > > The design limits things to a 1:1 relationship since we just have the
>> >> > > child and backup pointers, but I don't think I am seeing exception
>> >> > > handling to prevent us from overwriting the child pointers so there
>> >> > > may be a leak there.
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > > - Alex
>> >> > In fact maintaining a list in that case would be nicer, and
>> >> > just use an arbitrary one.
>> >> > E.g. one can see how a user wanting to swap device 1 for device 2
>> >> > might first add device 2 with same MAC then drop device 1.
>> >>
>> >> It should be possible to swap VF1 with VF2 by
>> >> 1.- enabling virtio link
>> >> 2.- unplugging VF1
>> >> 3.- plugging VF2
>> >> 4.- disabling virtio link
>> >>
>> >
>> > True, but it isn't hard to avoid breakage if user
>> > swapped steps 2 and 3. No need to make it more
>> > fragile that it has to be.
>>
>> The migration case, VF2 is associated with another PF on another
>> machine (destination), I wonder how it is possible.
>
> E.g. you want to remove the PF so you unplug the VF
> then add another VF of the same PF.
>
>> Even with local plugging of VF2 on the same PF, the MAC address
>> requirement (VF1's == VF2's) would fail the MAC address assignment on
>> VF2.
>>
>> -Siwei
>
> Why would it fail? These are separate cards.

OK. I realized that you may talk about assigning a VF on a diffferent
PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through
device rather than a VF. Yes, it's indeed possible that may happen but
I take it as a further step down (another patch maybe) as it would
involve changes to notify the network with gratuituious ARP and/or
unsolicited ND advertisement of the MAC address association with the
new port.

-Siwei

>
>> >
>> > --
>> > MST
>> >
>> > -
>> > To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
>> > For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>> >


Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala
 wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to enable only one datapath at any time so that
> packets don't get looped back to the VM over the other datapath. When a VF
> is plugged, the virtio datapath link state can be marked as down. The
> hypervisor needs to unplug the VF device from the guest on the source host
> and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed,
> the destination hypervisor sets the MAC filter on the VF and plugs it back
> to the guest to switch over to VF datapath.
>
> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
> created that acts as a master device and tracks the state of the 2 lower
> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
> passthru device with the same MAC is registered as 'active' netdev.
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization=151189725224231=2
>
> Signed-off-by: Sridhar Samudrala 
> Signed-off-by: Alexander Duyck 
> Reviewed-by: Jesse Brandeburg 
> ---
>  drivers/net/virtio_net.c | 683 
> ++-
>  1 file changed, 682 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bcd13fe906ca..f2860d86c952 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>
> @@ -206,6 +208,9 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
> +
> +   /* upper netdev created when BACKUP feature enabled */
> +   struct net_device *bypass_netdev;
>  };
>
>  struct padded_vnet_hdr {
> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
> netdev_bpf *xdp)
> }
>  }
>
> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
> + size_t len)
> +{
> +   struct virtnet_info *vi = netdev_priv(dev);
> +   int ret;
> +
> +   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> +   return -EOPNOTSUPP;
> +
> +   ret = snprintf(buf, len, "_bkup");
> +   if (ret >= len)
> +   return -EOPNOTSUPP;
> +
> +   return 0;
> +}
> +

What if the systemd/udevd is not new enough to enforce the
n naming? Would virtio_bypass get a different name
than the original virtio_net? Should we detect this earlier and fall
back to legacy mode without creating the bypass netdev and ensalving
the VF?

>  static const struct net_device_ops virtnet_netdev = {
> .ndo_open= virtnet_open,
> .ndo_stop= virtnet_close,
> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = {
> .ndo_xdp_xmit   = virtnet_xdp_xmit,
> .ndo_xdp_flush  = virtnet_xdp_flush,
> .ndo_features_check = passthru_features_check,
> +   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
>  };
>
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device 
> *vdev)
> return 0;
>  }
>
> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
> + * When BACKUP feature is enabled, an additional netdev(bypass netdev)
> + * is created that acts as a master device and tracks the state of the
> + * 2 lower netdevs. The original virtio_net netdev is registered as
> + * 'backup' netdev and a passthru device with the same MAC is registered
> + * as 'active' netdev.
> + */
> +
> +/* bypass state maintained when BACKUP feature is enabled */
> +struct virtnet_bypass_info {
> +   /* passthru netdev with same MAC */
> +   struct net_device __rcu *active_netdev;
> +
> +   /* virtio_net netdev */
> +   struct net_device __rcu *backup_netdev;
> +
> +   /* active netdev stats */
> +   struct rtnl_link_stats64 active_stats;
> +
> +   /* backup netdev stats */
> +   struct rtnl_link_stats64 backup_stats;
> +
> +   /* aggregated stats */
> +   struct rtnl_link_stats64 bypass_stats;
> +
> +   /* spinlock while updating stats */
> +   spinlock_t stats_lock;
> +};
> +
> +static void virtnet_bypass_child_open(struct net_device *dev,
> + struct net_device *child_netdev)
> +{
> +   int err = dev_open(child_netdev);
> +
> +   if (err)
> +   

Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 11:42 AM, Michael S. Tsirkin  wrote:
> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote:
>> >Yeah, this code essentially calls out the "shareable" code with a
>> >comment at the start and end of the section what defines the
>> >virtio_bypass functionality. It would just be a matter of mostly
>> >cutting and pasting to put it into a separate driver module.
>>
>> Please put it there and unite the use of it with netvsc.
>
> Surely, adding this to other drivers (e.g. might this be handy for xen
> too?) can be left for a separate patchset. Let's get one device merged
> first.
Agreed.

-Siwei

>
> --
> MST
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>


Re: [virtio-dev] Re: [PATCH v4 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-03-02 Thread Siwei Liu
On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin  wrote:
> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote:
>>
>>
>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote:
>> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote:
>> > > The design limits things to a 1:1 relationship since we just have the
>> > > child and backup pointers, but I don't think I am seeing exception
>> > > handling to prevent us from overwriting the child pointers so there
>> > > may be a leak there.
>> > >
>> > > Thanks.
>> > >
>> > > - Alex
>> > In fact maintaining a list in that case would be nicer, and
>> > just use an arbitrary one.
>> > E.g. one can see how a user wanting to swap device 1 for device 2
>> > might first add device 2 with same MAC then drop device 1.
>>
>> It should be possible to swap VF1 with VF2 by
>> 1.- enabling virtio link
>> 2.- unplugging VF1
>> 3.- plugging VF2
>> 4.- disabling virtio link
>>
>
> True, but it isn't hard to avoid breakage if user
> swapped steps 2 and 3. No need to make it more
> fragile that it has to be.

The migration case, VF2 is associated with another PF on another
machine (destination), I wonder how it is possible.

Even with local plugging of VF2 on the same PF, the MAC address
requirement (VF1's == VF2's) would fail the MAC address assignment on
VF2.

-Siwei

>
> --
> MST
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Siwei Liu
On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko  wrote:
> Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote:
>
> [...]
>

 No, that's not what I was talking about of course. I thought you
 mentioned the upgrade scenario this patch would like to address is to
 use the bypass interface "to take the place of the original virtio,
 and get udev to rename the bypass to what the original virtio_net
 was". That is one of the possible upgrade paths for sure. However the
 upgrade path I was seeking is to use the bypass interface to take the
 place of original VF interface while retaining the name and network
 configs, which generally can be done simply with kernel upgrade. It
 would become limiting as this patch makes the bypass interface share
 the same virtio pci device with virito backup. Can this bypass
 interface be made general to take place of any pci device other than
 virtio-net? This will be more helpful as the cloud users who has
 existing setup on VF interface don't have to recreate it on virtio-net
 and VF separately again.
>
> How that could work? If you have the VF netdev with all configuration
> including IPs and routes and whatever - now you want to do migration
> so you add virtio_net and do some weird in-driver bonding with it. But
> then, VF disappears and the VF netdev with that and also all
> configuration it had.
> I don't think this scenario is valid.

We are talking about making udev aware of the new virtio-bypass to
rebind the name of the old VF interface with supposedly virtio-bypass
*post the kernel upgrade*. Of course, this needs virtio-net backend to
supply the [bdf] info where the VF/PT device was located.

-Siwei


>
>
>>>
>>>
>>> Yes. This sounds interesting. Looks like you want an existing VM image with
>>> VF only configuration to get transparent live migration support by adding
>>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>>> between these 2 options.
>>
>>Yes, that's what I was thinking about. I have been building something
>>like this before, and would like to get back after merging with your
>>patch.


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Siwei Liu
On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>>
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>>
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <losewe...@gmail.com> wrote:
>>>>
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.du...@gmail.com> wrote:
>>>>>
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>>>>
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active'
>>>>>>> netdev
>>>>>>> as default for transmits when it is available with link up and
>>>>>>> running.
>>>>>>
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>>virtio pci device, udev tries to rename both of them with the same
>>>>>>> name
>>>>>>>and the 2nd rename will fail. This would be OK as long as the
>>>>>>> first netdev
>>>>>>>to be renamed is the 'bypass' netdev, but the order in which udev
>>>>>>> gets
>>>>>>>to rename the 2 netdevs is not reliable.
>>>>>>
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>>
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>>
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>>
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu
On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
<alexander.du...@gmail.com> wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <losewe...@gmail.com> wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>> <alexander.du...@gmail.com> wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>>> It extends virtio_net to use alternate datapath when available and
>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>> as default for transmits when it is available with link up and running.
>>>>
>>>> Thank you do doing this.
>>>>
>>>>> We noticed a couple of issues with this approach during testing.
>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>>   and the 2nd rename will fail. This would be OK as long as the first 
>>>>> netdev
>>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>>   to rename the 2 netdevs is not reliable.
>>>>
>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>> struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu
I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal 

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Siwei Liu
On Fri, Jan 26, 2018 at 8:51 AM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
>
>
> On 1/26/2018 12:14 AM, Siwei Liu wrote:
>>
>> On Tue, Jan 23, 2018 at 2:58 PM, Michael S. Tsirkin <m...@redhat.com>
>> wrote:
>>>
>>> On Tue, Jan 23, 2018 at 12:24:47PM -0800, Siwei Liu wrote:
>>>>
>>>> On Mon, Jan 22, 2018 at 1:41 PM, Michael S. Tsirkin <m...@redhat.com>
>>>> wrote:
>>>>>
>>>>> On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
>>>>>>
>>>>>> First off, as mentioned in another thread, the model of stacking up
>>>>>> virt-bond functionality over virtio seems a wrong direction to me.
>>>>>> Essentially the migration process would need to carry over all guest
>>>>>> side configurations previously done on the VF/PT and get them moved to
>>>>>> the new device being it virtio or VF/PT.
>>>>>
>>>>> I might be wrong but I don't see why we should worry about this
>>>>> usecase.
>>>>> Whoever has a bond configured already has working config for migration.
>>>>> We are trying to help people who don't, not convert existig users.
>>>>
>>>> That has been placed in the view of cloud providers that the imported
>>>> images from the store must be able to run unmodified thus no
>>>> additional setup script is allowed (just as Stephen mentioned in
>>>> another mail). Cloud users don't care about live migration themselves
>>>> but the providers are required to implement such automation mechanism
>>>> to make this process transparent if at all possible. The user does not
>>>> care about the device underneath being VF or not, but they do care
>>>> about consistency all across and the resulting performance
>>>> acceleration in making VF the prefered datapath. It is not quite
>>>> peculiar user cases but IMHO *any* approach proposed for live
>>>> migration should be able to persist the state including network config
>>>> e.g. as simple as MTU. Actually this requirement has nothing to do
>>>> with virtio but our target users are live migration agnostic, being it
>>>> tracking DMA through dirty pages, using virtio as the helper, or
>>>> whatsoever, the goal of persisting configs across remains same.
>>>
>>> So the patching being discussed here will mostly do exactly that if your
>>> original config was simply a single virtio net device.
>>>
>> True, but I don't see the patch being discussed starts with good
>> foundation of supporting the same for VF/PT device. That is the core
>> of the issue.
>
>
>>> What kind of configs do your users have right now?
>>
>> Any configs be it generic or driver specific that the VF/PT device
>> supports and have been enabled/configured. General network configs
>> (MAC, IP address, VLAN, MTU, iptables rules), ethtool settings
>> (hardware offload, # of queues and ring entris, RSC options, rss
>> rxfh-indir table, rx-flow-hash, et al) , bpf/XDP program being run, tc
>> flower offload, just to name a few. As cloud providers we don't limit
>> users from applying driver specific tuning to the NIC/VF, and
>> sometimes this is essential to achieving best performance for their
>> workload. We've seen cases like tuning coalescing parameters for
>> getting low latency, changing rx-flow-hash function for better VXLAN
>> throughput, or even adopting quite advanced NIC features such as flow
>> director or cloud filter. We don't expect users to compromise even a
>> little bit on these. That is once we turn on live migration for the VF
>> or pass through devices in the VM, it all takes place under the hood,
>> users (guest admins, applications) don't have to react upon it or even
>> notice the change. I should note that the majority of live migrations
>> take place between machines with completely identical hardware, it's
>> more critical than necessary to keep the config as-is across the move,
>> stealth while quiet.
>
>
> This usecase is much more complicated and different than what this patch is
> trying
> to address.

Yep, it is technically difficult, but as cloud providers we would like
to take actions to address use case for our own if no one else is
willing to do so. However we're not seeking complicated design or
messing up the others such as your use case. As this is the first time
a real patch of the PV failover approach, although having be discussed
for years, posted to the mailing lis

Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-26 Thread Siwei Liu
On Tue, Jan 23, 2018 at 2:58 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Tue, Jan 23, 2018 at 12:24:47PM -0800, Siwei Liu wrote:
>> On Mon, Jan 22, 2018 at 1:41 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
>> > On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
>> >> First off, as mentioned in another thread, the model of stacking up
>> >> virt-bond functionality over virtio seems a wrong direction to me.
>> >> Essentially the migration process would need to carry over all guest
>> >> side configurations previously done on the VF/PT and get them moved to
>> >> the new device being it virtio or VF/PT.
>> >
>> > I might be wrong but I don't see why we should worry about this usecase.
>> > Whoever has a bond configured already has working config for migration.
>> > We are trying to help people who don't, not convert existig users.
>>
>> That has been placed in the view of cloud providers that the imported
>> images from the store must be able to run unmodified thus no
>> additional setup script is allowed (just as Stephen mentioned in
>> another mail). Cloud users don't care about live migration themselves
>> but the providers are required to implement such automation mechanism
>> to make this process transparent if at all possible. The user does not
>> care about the device underneath being VF or not, but they do care
>> about consistency all across and the resulting performance
>> acceleration in making VF the prefered datapath. It is not quite
>> peculiar user cases but IMHO *any* approach proposed for live
>> migration should be able to persist the state including network config
>> e.g. as simple as MTU. Actually this requirement has nothing to do
>> with virtio but our target users are live migration agnostic, being it
>> tracking DMA through dirty pages, using virtio as the helper, or
>> whatsoever, the goal of persisting configs across remains same.
>
> So the patching being discussed here will mostly do exactly that if your
> original config was simply a single virtio net device.
>

True, but I don't see the patch being discussed starts with good
foundation of supporting the same for VF/PT device. That is the core
of the issue.

>
> What kind of configs do your users have right now?

Any configs be it generic or driver specific that the VF/PT device
supports and have been enabled/configured. General network configs
(MAC, IP address, VLAN, MTU, iptables rules), ethtool settings
(hardware offload, # of queues and ring entris, RSC options, rss
rxfh-indir table, rx-flow-hash, et al) , bpf/XDP program being run, tc
flower offload, just to name a few. As cloud providers we don't limit
users from applying driver specific tuning to the NIC/VF, and
sometimes this is essential to achieving best performance for their
workload. We've seen cases like tuning coalescing parameters for
getting low latency, changing rx-flow-hash function for better VXLAN
throughput, or even adopting quite advanced NIC features such as flow
director or cloud filter. We don't expect users to compromise even a
little bit on these. That is once we turn on live migration for the VF
or pass through devices in the VM, it all takes place under the hood,
users (guest admins, applications) don't have to react upon it or even
notice the change. I should note that the majority of live migrations
take place between machines with completely identical hardware, it's
more critical than necessary to keep the config as-is across the move,
stealth while quiet.

As you see generic bond or bridge cannot suffice the need. That's why
we need a new customized virt bond driver, and tailor it for VM live
migration specifically. Leveraging para-virtual e.g. virtio net device
as the backup path is one thing, tracking through driver config
changes in order to re-config as necessary is another. I would think
without considering the latter, the proposal being discussed is rather
incomplete, and remote to be useful in production.

>
>
>> >
>> >> Without the help of a new
>> >> upper layer bond driver that enslaves virtio and VF/PT devices
>> >> underneath, virtio will be overloaded with too much specifics being a
>> >> VF/PT backup in the future.
>> >
>> > So this paragraph already includes at least two conflicting
>> > proposals. On the one hand you want a separate device for
>> > the virtual bond, on the other you are saying a separate
>> > driver.
>>
>> Just to be crystal clear: separate virtual bond device (netdev ops,
>> not necessarily bus device) for VM migration specifically with a
>> separate driver.
>
> Okay, but note that any config someone had on a virtio 

Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-23 Thread Siwei Liu
On Mon, Jan 22, 2018 at 1:41 PM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
>> First off, as mentioned in another thread, the model of stacking up
>> virt-bond functionality over virtio seems a wrong direction to me.
>> Essentially the migration process would need to carry over all guest
>> side configurations previously done on the VF/PT and get them moved to
>> the new device being it virtio or VF/PT.
>
> I might be wrong but I don't see why we should worry about this usecase.
> Whoever has a bond configured already has working config for migration.
> We are trying to help people who don't, not convert existig users.

That has been placed in the view of cloud providers that the imported
images from the store must be able to run unmodified thus no
additional setup script is allowed (just as Stephen mentioned in
another mail). Cloud users don't care about live migration themselves
but the providers are required to implement such automation mechanism
to make this process transparent if at all possible. The user does not
care about the device underneath being VF or not, but they do care
about consistency all across and the resulting performance
acceleration in making VF the prefered datapath. It is not quite
peculiar user cases but IMHO *any* approach proposed for live
migration should be able to persist the state including network config
e.g. as simple as MTU. Actually this requirement has nothing to do
with virtio but our target users are live migration agnostic, being it
tracking DMA through dirty pages, using virtio as the helper, or
whatsoever, the goal of persisting configs across remains same.

>
>> Without the help of a new
>> upper layer bond driver that enslaves virtio and VF/PT devices
>> underneath, virtio will be overloaded with too much specifics being a
>> VF/PT backup in the future.
>
> So this paragraph already includes at least two conflicting
> proposals. On the one hand you want a separate device for
> the virtual bond, on the other you are saying a separate
> driver.

Just to be crystal clear: separate virtual bond device (netdev ops,
not necessarily bus device) for VM migration specifically with a
separate driver.

>
> Further, the reason to have a separate *driver* was that
> some people wanted to share code with netvsc - and that
> one does not create a separate device, which you can't
> change without breaking existing configs.

I'm not sure I understand this statement. netvsc is already another
netdev being created than the enslaved VF netdev, why it bothers? In
the Azure case, the stock image to be imported does not bind to a
specific driver but only MAC address. And people just deal with the
new virt-bond netdev rather than the underlying virtio and VF. And
both these two underlying netdevs should be made invisible to prevent
userspace script from getting them misconfigured IMHO.

A separate driver was for code sharing for sure, only just netvsc but
could be other para-virtual devices floating around: any PV can serve
as the side channel and the backup path for VF/PT. Once we get the new
driver working atop virtio we may define ops and/or protocol needed to
talk to various other PV frontend that may implement the side channel
of its own for datapath switching (e.g. virtio is one of them, Xen PV
frontend can be another). I just don't like to limit the function to
virtio only and we have to duplicate code then it starts to scatter
around all over the places.

I understand right now we start it as simple so it may just be fine
that the initial development activities center around virtio. However,
from cloud provider/vendor perspective I don't see the proposed scheme
limits to virtio only. Any other PV driver which has the plan to
support the same scheme can benefit. The point is that we shouldn't be
limiting the scheme to virtio specifics so early which is hard to have
it promoted to a common driver once we get there.

>
> So some people want a fully userspace-configurable switchdev, and that
> already exists at some level, and maybe it makes sense to add more
> features for performance.
>
> But the point was that some host configurations are very simple,
> and it probably makes sense to pass this information to the guest
> and have guest act on it directly. Let's not conflate the two.

It may be fine to push some of the configurations from host but that
perhaps doesn't cover all the cases: how is it possible for the host
to save all network states and configs done by the guest before
migration. Some of the configs might come from future guest which is
unknown to host. Anyhow the bottom line is that the guest must be able
to act on those configuration request changes automatically without
involving users intervention.

Regards,
-Siwei

>
> --
> MST


Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-22 Thread Siwei Liu
First off, as mentioned in another thread, the model of stacking up
virt-bond functionality over virtio seems a wrong direction to me.
Essentially the migration process would need to carry over all guest
side configurations previously done on the VF/PT and get them moved to
the new device being it virtio or VF/PT. Without the help of a new
upper layer bond driver that enslaves virtio and VF/PT devices
underneath, virtio will be overloaded with too much specifics being a
VF/PT backup in the future. I hope you're already aware of the issue
in longer term and move to that model as soon as possible. See more
inline.

On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
 wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. The VF datapath is only used
> for unicast traffic. Broadcasts/multicasts go via virtio datapath so that
> east-west broadcasts don't use the PCI bandwidth.

Why not making an this an option/optimization rather than being the
only means? The problem of east-west broadcast eating PCI bandwidth
depends on specifics of the (virtual) network setup, while some users
won't want to lose VF's merits such as latency. Why restricting
broadcast/multicast xmit to virtio only which potentially regresses
the performance against raw VF?

> It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to unplug the VF device from the guest on the source
> host and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed, the
> destination hypervisor sets the MAC filter on the VF and plugs it back to
> the guest to switch over to VF datapath.

Is there a host side patch (planned) for this MAC filter switching
process? As said in another thread, that simple script won't work for
macvtap backend.

Thanks,
-Siwei

>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization=151189725224231=2
>
> Signed-off-by: Sridhar Samudrala 
> Reviewed-by: Jesse Brandeburg 
> ---
>  drivers/net/virtio_net.c | 307 
> ++-
>  1 file changed, 305 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f149a160a8c5..0e58d364fde9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>
> @@ -120,6 +122,15 @@ struct receive_queue {
> struct xdp_rxq_info xdp_rxq;
>  };
>
> +struct virtnet_vf_pcpu_stats {
> +   u64 rx_packets;
> +   u64 rx_bytes;
> +   u64 tx_packets;
> +   u64 tx_bytes;
> +   struct u64_stats_sync   syncp;
> +   u32 tx_dropped;
> +};
> +
>  struct virtnet_info {
> struct virtio_device *vdev;
> struct virtqueue *cvq;
> @@ -182,6 +193,10 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
> +
> +   /* State to manage the associated VF interface. */
> +   struct net_device __rcu *vf_netdev;
> +   struct virtnet_vf_pcpu_stats __percpu *vf_stats;
>  };
>
>  struct padded_vnet_hdr {
> @@ -1314,16 +1329,53 @@ static int xmit_skb(struct send_queue *sq, struct 
> sk_buff *skb)
> return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
>  }
>
> +/* Send skb on the slave VF device. */
> +static int virtnet_vf_xmit(struct net_device *dev, struct net_device 
> *vf_netdev,
> +  struct sk_buff *skb)
> +{
> +   struct virtnet_info *vi = netdev_priv(dev);
> +   unsigned int len = skb->len;
> +   int rc;
> +
> +   skb->dev = vf_netdev;
> +   skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +   rc = dev_queue_xmit(skb);
> +   if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
> +   struct virtnet_vf_pcpu_stats *pcpu_stats
> +   = this_cpu_ptr(vi->vf_stats);
> +
> +   u64_stats_update_begin(_stats->syncp);
> +   pcpu_stats->tx_packets++;
> +   pcpu_stats->tx_bytes += len;
> +   u64_stats_update_end(_stats->syncp);
> +   } else {
> +   this_cpu_inc(vi->vf_stats->tx_dropped);
> +   }
> +
> +   return rc;
> +}
> +
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
> struct virtnet_info *vi = netdev_priv(dev);
> int qnum = skb_get_queue_mapping(skb);
> struct send_queue *sq = >sq[qnum];
> +   struct net_device *vf_netdev;
> int err;
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = 

Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device

2018-01-22 Thread Siwei Liu
Apologies I didn't notice that the discussion was mistakenly taken
offline. Post it back.

-Siwei

On Sat, Jan 13, 2018 at 7:25 AM, Siwei Liu <losewe...@gmail.com> wrote:
> On Thu, Jan 11, 2018 at 12:32 PM, Samudrala, Sridhar
> <sridhar.samudr...@intel.com> wrote:
>> On 1/8/2018 9:22 AM, Siwei Liu wrote:
>>>
>>> On Sat, Jan 6, 2018 at 2:33 AM, Samudrala, Sridhar
>>> <sridhar.samudr...@intel.com> wrote:
>>>>
>>>> On 1/5/2018 9:07 AM, Siwei Liu wrote:
>>>>>
>>>>> On Thu, Jan 4, 2018 at 8:22 AM, Samudrala, Sridhar
>>>>> <sridhar.samudr...@intel.com> wrote:
>>>>>>
>>>>>> On 1/3/2018 10:28 AM, Alexander Duyck wrote:
>>>>>>>
>>>>>>> On Wed, Jan 3, 2018 at 10:14 AM, Samudrala, Sridhar
>>>>>>> <sridhar.samudr...@intel.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/3/2018 8:59 AM, Alexander Duyck wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubak...@wp.pl>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>>>>>>>>>>>
>>>>>>>>>>> This patch series enables virtio to switch over to a VF datapath
>>>>>>>>>>> when
>>>>>>>>>>> a
>>>>>>>>>>> VF
>>>>>>>>>>> netdev is present with the same MAC address. It allows live
>>>>>>>>>>> migration
>>>>>>>>>>> of
>>>>>>>>>>> a VM
>>>>>>>>>>> with a direct attached VF without the need to setup a bond/team
>>>>>>>>>>> between
>>>>>>>>>>> a
>>>>>>>>>>> VF and virtio net device in the guest.
>>>>>>>>>>>
>>>>>>>>>>> The hypervisor needs to unplug the VF device from the guest on the
>>>>>>>>>>> source
>>>>>>>>>>> host and reset the MAC filter of the VF to initiate failover of
>>>>>>>>>>> datapath
>>>>>>>>>>> to
>>>>>>>>>>> virtio before starting the migration. After the migration is
>>>>>>>>>>> completed,
>>>>>>>>>>> the
>>>>>>>>>>> destination hypervisor sets the MAC filter on the VF and plugs it
>>>>>>>>>>> back
>>>>>>>>>>> to
>>>>>>>>>>> the guest to switch over to VF datapath.
>>>>>>>>>>>
>>>>>>>>>>> It is based on netvsc implementation and it may be possible to
>>>>>>>>>>> make
>>>>>>>>>>> this
>>>>>>>>>>> code
>>>>>>>>>>> generic and move it to a common location that can be shared by
>>>>>>>>>>> netvsc
>>>>>>>>>>> and virtio.
>>>>>>>>>>>
>>>>>>>>>>> This patch series is based on the discussion initiated by Jesse on
>>>>>>>>>>> this
>>>>>>>>>>> thread.
>>>>>>>>>>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>>>>>>>>>
>>>>>>>>>> How does the notion of a device which is both a bond and a leg of a
>>>>>>>>>> bond fit with Alex's recent discussions about feature propagation?
>>>>>>>>>> Which propagation rules will apply to VirtIO master?  Meaning of
>>>>>>>>>> the
>>>>>>>>>> flags on a software upper device may be different.  Why muddy the
>>>>>>>>>> architecture like this and not introduce a synthetic bond device?
>>>>>>>>>
>>>>>>>>> It doesn't really fit with the notion I had. I think there may have
>>>>>>>>> been a bit of a disconnect as I have been out for the last week or
>>>>>

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-22 Thread Siwei Liu
On Wed, Dec 20, 2017 at 8:52 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
> On Wed, 20 Dec 2017 18:16:30 -0800, Siwei Liu wrote:
>> > The plan is to remove the delay and do the naming in the kernel.
>> > This was suggested by Lennart since udev is only doing naming policy
>> > because kernel names were not repeatable.
>> >
>> > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>> >
>> > Patch is pending.
>>
>> While it's good to show VF with specific naming to indicate
>> enslavement, I wonder wouldn't it be better to hide this netdev at all
>> from the user space? IMHO this extra device is useless when being
>> enslaved and we may delegate controls (e.g. ethtool) over to the
>> para-virtual device instead? That way it's possible to eliminate the
>> possibility of additional udev setup or modification?
>>
>> I'm not sure if this  is consistent with Windows guest or not, but I
>> don't find it _Linux_ user friendly that ethtool doesn't work on the
>> composite interface any more, and I have to end up with finding out
>> the correct enslaved VF I must operate on.
>
> Hiding "low level" netdevs comes up from time to time, and is more
> widely applicable than just to VF bonds.  We should find a generic
> solution to that problem.

Wholeheartedly agreed.

Be it a generic virtual bond or virtio-net specific one, there should
be some common code between netvsc and virtio for this type of work.
For avoiding duplicated bugs, consistent (Linux) user experience,
future code refactoring/management, and whatever...

One thing worth to note is that, unlike the Hyper-V netvsc backend
there's currently no equivalent (fine-grained) Linux ndo_* driver
interface that is able to move around MAC address/VLAN filters at
run-time specifically. The OID_RECEIVE_FILTER_MOVE_FILTER request I
mean. That translates to one substantial difference in VF
plumbing/unplumbing sequence: you cannot move the MAC address around
to paravirt device until VF is fully unplugged out of the guest OS. I
don't know what backend changes to be proposed for virtio-net as
helper, but the common code needs to work with both flavors of data
path switching backends and do its job correctly.

Regards,
-Siwei


Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-20 Thread Siwei Liu
On Tue, Dec 19, 2017 at 10:41 AM, Stephen Hemminger
 wrote:
> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> David Miller  wrote:
>
>> From: Stephen Hemminger 
>> Date: Tue, 19 Dec 2017 09:55:48 -0800
>>
>> > could be 10ms, just enough to let udev do its renaming
>>
>> Please, move to some kind of notification or event based handling of
>> this problem.
>>
>> No delay is safe, what if userspace gets swapped out or whatever
>> else might make userspace stall unexpectedly?
>>
>
> The plan is to remove the delay and do the naming in the kernel.
> This was suggested by Lennart since udev is only doing naming policy
> because kernel names were not repeatable.
>
> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>
> Patch is pending.

While it's good to show VF with specific naming to indicate
enslavement, I wonder wouldn't it be better to hide this netdev at all
from the user space? IMHO this extra device is useless when being
enslaved and we may delegate controls (e.g. ethtool) over to the
para-virtual device instead? That way it's possible to eliminate the
possibility of additional udev setup or modification?

I'm not sure if this  is consistent with Windows guest or not, but I
don't find it _Linux_ user friendly that ethtool doesn't work on the
composite interface any more, and I have to end up with finding out
the correct enslaved VF I must operate on.

Regards,
-Siwei