[patch net-next v4] net: virtio_net: implement exact header length guest feature

2023-03-09 Thread Jiri Pirko
From: Jiri Pirko 

Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
set implicates that device benefits from knowing the exact size
of the header. For compatibility, to signal to the device that
the header is reliable driver also needs to set this feature.
Without this feature set by driver, device has to figure
out the header size itself.

Quoting the original virtio spec:
"hdr_len is a hint to the device as to how much of the header needs to
 be kept to copy into each packet"

"a hint" might not be clear for the reader what does it mean, if it is
"maybe like that" of "exactly like that". This feature just makes it
crystal clear and let the device count on the hdr_len being filled up
by the exact length of header.

Also note the spec already has following note about hdr_len:
"Due to various bugs in implementations, this field is not useful
 as a guarantee of the transport header size."

Without this feature the device needs to parse the header in core
data path handling. Accurate information helps the device to eliminate
such header parsing and directly use the hardware accelerators
for GSO operation.

virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
The driver already complies to fill the correct value. Introduce the
feature and advertise it.

Note that virtio spec also includes following note for device
implementation:
"Caution should be taken by the implementation so as to prevent
 a malicious driver from attacking the device by setting
 an incorrect hdr_len."

There is a plan to support this feature in our emulated device.
A device of SolidRun offers this feature bit. They claim this feature
will save the device a few cycles for every GSO packet.

Link: 
https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-230006x3
Signed-off-by: Jiri Pirko 
Reviewed-by: Parav Pandit 
Reviewed-by: Alvaro Karsz 
Acked-by: Michael S. Tsirkin 
Acked-by: Willem de Bruijn 
---
v3->v4:
- fixed double "which when"
v2->v3:
- changed the first paragraph in patch description according to
  Michael's suggestion
- added Link tag with link to the spec
v1->v2:
- extended patch description
---
 drivers/net/virtio_net.c| 6 --
 include/uapi/linux/virtio_net.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fb5e68ed3ec2..e85b03988733 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_GUEST_CSUM,
VIRTIO_NET_F_GUEST_USO4,
-   VIRTIO_NET_F_GUEST_USO6
+   VIRTIO_NET_F_GUEST_USO6,
+   VIRTIO_NET_F_GUEST_HDRLEN
 };
 
 #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
@@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
-   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
+   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
+   VIRTIO_NET_F_GUEST_HDRLEN
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index b4062bed186a..12c1c9699935 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -61,6 +61,7 @@
 #define VIRTIO_NET_F_GUEST_USO655  /* Guest can handle USOv6 in. */
 #define VIRTIO_NET_F_HOST_USO  56  /* Host can handle USO in. */
 #define VIRTIO_NET_F_HASH_REPORT  57   /* Supports hash report */
+#define VIRTIO_NET_F_GUEST_HDRLEN  59  /* Guest provides the exact hdr_len 
value. */
 #define VIRTIO_NET_F_RSS 60/* Supports RSS RX steering */
 #define VIRTIO_NET_F_RSC_EXT 61/* extended coalescing info */
 #define VIRTIO_NET_F_STANDBY 62/* Act as standby for another device
-- 
2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next v3] net: virtio_net: implement exact header length guest feature

2023-02-22 Thread Jiri Pirko
Wed, Feb 22, 2023 at 12:24:18PM CET, m...@redhat.com wrote:
>On Wed, Feb 22, 2023 at 09:06:38AM +0100, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> which when 
>
>which when which when is probably unintentional :)

I copy-pasted your text :)

Anyway, I guess net-next is closed anyway. Will fix after merge
window closes.


>
>>set implicates that device benefits from knowing the exact
>> size of the header. For compatibility, to signal to the device that
>> the header is reliable driver also needs to set this feature.
>> Without this feature set by driver, device has to figure
>> out the header size itself.
>> 
>> Quoting the original virtio spec:
>> "hdr_len is a hint to the device as to how much of the header needs to
>>  be kept to copy into each packet"
>> 
>> "a hint" might not be clear for the reader what does it mean, if it is
>> "maybe like that" of "exactly like that". This feature just makes it
>> crystal clear and let the device count on the hdr_len being filled up
>> by the exact length of header.
>> 
>> Also note the spec already has following note about hdr_len:
>> "Due to various bugs in implementations, this field is not useful
>>  as a guarantee of the transport header size."
>> 
>> Without this feature the device needs to parse the header in core
>> data path handling. Accurate information helps the device to eliminate
>> such header parsing and directly use the hardware accelerators
>> for GSO operation.
>> 
>> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> The driver already complies to fill the correct value. Introduce the
>> feature and advertise it.
>> 
>> Note that virtio spec also includes following note for device
>> implementation:
>> "Caution should be taken by the implementation so as to prevent
>>  a malicious driver from attacking the device by setting
>>  an incorrect hdr_len."
>> 
>> There is a plan to support this feature in our emulated device.
>> A device of SolidRun offers this feature bit. They claim this feature
>> will save the device a few cycles for every GSO packet.
>> 
>> Link: 
>> https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-230006x3
>> Signed-off-by: Jiri Pirko 
>> Reviewed-by: Parav Pandit 
>> Reviewed-by: Alvaro Karsz 
>> Acked-by: Michael S. Tsirkin 
>> ---
>> v2->v3:
>> - changed the first paragraph in patch description according to
>>   Michael's suggestion
>> - added Link tag with link to the spec
>> v1->v2:
>> - extended patch description
>> ---
>>  drivers/net/virtio_net.c| 6 --
>>  include/uapi/linux/virtio_net.h | 1 +
>>  2 files changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index fb5e68ed3ec2..e85b03988733 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
>>  VIRTIO_NET_F_GUEST_UFO,
>>  VIRTIO_NET_F_GUEST_CSUM,
>>  VIRTIO_NET_F_GUEST_USO4,
>> -VIRTIO_NET_F_GUEST_USO6
>> +VIRTIO_NET_F_GUEST_USO6,
>> +VIRTIO_NET_F_GUEST_HDRLEN
>>  };
>>  
>>  #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
>> @@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
>>  VIRTIO_NET_F_CTRL_MAC_ADDR, \
>>  VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>>  VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
>> -VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
>> +VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
>> +VIRTIO_NET_F_GUEST_HDRLEN
>>  
>>  static unsigned int features[] = {
>>  VIRTNET_FEATURES,
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index b4062bed186a..12c1c9699935 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -61,6 +61,7 @@
>>  #define VIRTIO_NET_F_GUEST_USO6 55  /* Guest can handle USOv6 in. */
>>  #define VIRTIO_NET_F_HOST_USO   56  /* Host can handle USO in. */
>>  #define VIRTIO_NET_F_HASH_REPORT  57/* Supports hash report */
>> +#define VIRTIO_NET_F_GUEST_HDRLEN  59   /* Guest provides the exact 
>> hdr_len value. */
>>  #define VIRTIO_NET_F_RSS  60/* Supports RSS RX steering */
>>  #define VIRTIO_NET_F_RSC_EXT  61/* extended coalescing info */
>>  #define VIRTIO_NET_F_STANDBY  62/* Act as standby for another 
>> device
>> -- 
>> 2.39.0
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next v3] net: virtio_net: implement exact header length guest feature

2023-02-22 Thread Jiri Pirko
From: Jiri Pirko 

Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
which when set implicates that device benefits from knowing the exact
size of the header. For compatibility, to signal to the device that
the header is reliable driver also needs to set this feature.
Without this feature set by driver, device has to figure
out the header size itself.

Quoting the original virtio spec:
"hdr_len is a hint to the device as to how much of the header needs to
 be kept to copy into each packet"

"a hint" might not be clear for the reader what does it mean, if it is
"maybe like that" of "exactly like that". This feature just makes it
crystal clear and let the device count on the hdr_len being filled up
by the exact length of header.

Also note the spec already has following note about hdr_len:
"Due to various bugs in implementations, this field is not useful
 as a guarantee of the transport header size."

Without this feature the device needs to parse the header in core
data path handling. Accurate information helps the device to eliminate
such header parsing and directly use the hardware accelerators
for GSO operation.

virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
The driver already complies to fill the correct value. Introduce the
feature and advertise it.

Note that virtio spec also includes following note for device
implementation:
"Caution should be taken by the implementation so as to prevent
 a malicious driver from attacking the device by setting
 an incorrect hdr_len."

There is a plan to support this feature in our emulated device.
A device of SolidRun offers this feature bit. They claim this feature
will save the device a few cycles for every GSO packet.

Link: 
https://docs.oasis-open.org/virtio/virtio/v1.2/cs01/virtio-v1.2-cs01.html#x1-230006x3
Signed-off-by: Jiri Pirko 
Reviewed-by: Parav Pandit 
Reviewed-by: Alvaro Karsz 
Acked-by: Michael S. Tsirkin 
---
v2->v3:
- changed the first paragraph in patch description according to
  Michael's suggestion
- added Link tag with link to the spec
v1->v2:
- extended patch description
---
 drivers/net/virtio_net.c| 6 --
 include/uapi/linux/virtio_net.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fb5e68ed3ec2..e85b03988733 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_GUEST_CSUM,
VIRTIO_NET_F_GUEST_USO4,
-   VIRTIO_NET_F_GUEST_USO6
+   VIRTIO_NET_F_GUEST_USO6,
+   VIRTIO_NET_F_GUEST_HDRLEN
 };
 
 #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
@@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
-   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
+   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
+   VIRTIO_NET_F_GUEST_HDRLEN
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index b4062bed186a..12c1c9699935 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -61,6 +61,7 @@
 #define VIRTIO_NET_F_GUEST_USO655  /* Guest can handle USOv6 in. */
 #define VIRTIO_NET_F_HOST_USO  56  /* Host can handle USO in. */
 #define VIRTIO_NET_F_HASH_REPORT  57   /* Supports hash report */
+#define VIRTIO_NET_F_GUEST_HDRLEN  59  /* Guest provides the exact hdr_len 
value. */
 #define VIRTIO_NET_F_RSS 60/* Supports RSS RX steering */
 #define VIRTIO_NET_F_RSC_EXT 61/* extended coalescing info */
 #define VIRTIO_NET_F_STANDBY 62/* Act as standby for another device
-- 
2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next v2] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
Tue, Feb 21, 2023 at 06:21:16PM CET, m...@redhat.com wrote:
>On Tue, Feb 21, 2023 at 03:47:41PM +0100, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> set implicates that the driver provides the exact size of the header.
>
>OK but I feel this is not the important point. The important points are:
>- this bit means device needs this info
>- driver also has to set this bit
>For example one might replace above with:
>
>   Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>   which when set implicates that device benefits from knowing the exact
>   size of the header. For compatiblity, to signal to the device that the 
> header
>   is reliable driver also needs to set this feature.
>   Without this feature set by driver, device has to figure
>   out the header size itself.
>
>and the below is ok.
>
>> Quoting the original virtio spec:
>> "hdr_len is a hint to the device as to how much of the header needs to
>>  be kept to copy into each packet"
>> 
>> "a hint" might not be clear for the reader what does it mean, if it is
>> "maybe like that" of "exactly like that". This feature just makes it
>> crystal clear and let the device count on the hdr_len being filled up
>> by the exact length of header.
>> 
>> Also note the spec already has following note about hdr_len:
>> "Due to various bugs in implementations, this field is not useful
>>  as a guarantee of the transport header size."
>> 
>> Without this feature the device needs to parse the header in core
>> data path handling. Accurate information helps the device to eliminate
>> such header parsing and directly use the hardware accelerators
>> for GSO operation.
>> 
>> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> The driver already complies to fill the correct value. Introduce the
>> feature and advertise it.
>> 
>> Note that virtio spec also includes following note for device
>> implementation:
>> "Caution should be taken by the implementation so as to prevent
>>  a malicious driver from attacking the device by setting
>>  an incorrect hdr_len."
>> 
>> There is a plan to support this feature in our emulated device.
>> A device of SolidRun offers this feature bit. They claim this feature
>> will save the device a few cycles for every GSO packet.
>> 
>> Signed-off-by: Jiri Pirko 
>
>I'm fine with patch itself. with commit log tweak:
>
>Acked-by: Michael S. Tsirkin 

Okay. Will do. I will put link to the spec as well

Thanks!

>
>
>> ---
>> v1->v2:
>> - extended patch description
>> ---
>>  drivers/net/virtio_net.c| 6 --
>>  include/uapi/linux/virtio_net.h | 1 +
>>  2 files changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index fb5e68ed3ec2..e85b03988733 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
>>  VIRTIO_NET_F_GUEST_UFO,
>>  VIRTIO_NET_F_GUEST_CSUM,
>>  VIRTIO_NET_F_GUEST_USO4,
>> -VIRTIO_NET_F_GUEST_USO6
>> +VIRTIO_NET_F_GUEST_USO6,
>> +VIRTIO_NET_F_GUEST_HDRLEN
>>  };
>>  
>>  #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
>> @@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
>>  VIRTIO_NET_F_CTRL_MAC_ADDR, \
>>  VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>>  VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
>> -VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
>> +VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
>> +VIRTIO_NET_F_GUEST_HDRLEN
>>  
>>  static unsigned int features[] = {
>>  VIRTNET_FEATURES,
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index b4062bed186a..12c1c9699935 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -61,6 +61,7 @@
>>  #define VIRTIO_NET_F_GUEST_USO6 55  /* Guest can handle USOv6 in. */
>>  #define VIRTIO_NET_F_HOST_USO   56  /* Host can handle USO in. */
>>  #define VIRTIO_NET_F_HASH_REPORT  57/* Supports hash report */
>> +#define VIRTIO_NET_F_GUEST_HDRLEN  59   /* Guest provides the exact 
>> hdr_len value. */
>>  #define VIRTIO_NET_F_RSS  60/* Supports RSS RX steering */
>>  #define VIRTIO_NET_F_RSC_EXT  61/* extended coalescing info */
>>  #define VIRTIO_NET_F_STANDBY  62/* Act as standby for another 
>> device
>> -- 
>> 2.39.0
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next v2] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
Tue, Feb 21, 2023 at 05:12:33PM CET, willemdebruijn.ker...@gmail.com wrote:
>Jiri Pirko wrote:
>> Tue, Feb 21, 2023 at 04:11:53PM CET, willemdebruijn.ker...@gmail.com wrote:
>> >Jiri Pirko wrote:
>> >> From: Jiri Pirko 
>> >> 
>> >> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> >> set implicates that the driver provides the exact size of the header.
>> >> 
>> >> Quoting the original virtio spec:
>> >> "hdr_len is a hint to the device as to how much of the header needs to
>> >>  be kept to copy into each packet"
>> >> 
>> >> "a hint" might not be clear for the reader what does it mean, if it is
>> >> "maybe like that" of "exactly like that". This feature just makes it
>> >> crystal clear and let the device count on the hdr_len being filled up
>> >> by the exact length of header.
>> >> 
>> >> Also note the spec already has following note about hdr_len:
>> >> "Due to various bugs in implementations, this field is not useful
>> >>  as a guarantee of the transport header size."
>> >> 
>> >> Without this feature the device needs to parse the header in core
>> >> data path handling. Accurate information helps the device to eliminate
>> >> such header parsing and directly use the hardware accelerators
>> >> for GSO operation.
>> >> 
>> >> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> >> The driver already complies to fill the correct value. Introduce the
>> >> feature and advertise it.
>> >> 
>> >> Note that virtio spec also includes following note for device
>> >> implementation:
>> >> "Caution should be taken by the implementation so as to prevent
>> >>  a malicious driver from attacking the device by setting
>> >>  an incorrect hdr_len."
>> >> 
>> >> There is a plan to support this feature in our emulated device.
>> >> A device of SolidRun offers this feature bit. They claim this feature
>> >> will save the device a few cycles for every GSO packet.
>> >> 
>> >> Signed-off-by: Jiri Pirko 
>> >> ---
>> >> v1->v2:
>> >> - extended patch description
>> >
>> >Is the expectation that in-kernel devices support this feature, and
>> >if so how would it affect them? If I read the spec correctly, devices
>> 
>> Well, the tap driver actually trusts the hdr_len to be of correct header
>> size nowadays.
>
>tap_get_user performs basic bounds checking on the length passed.

Sure. It trusts the hdr_len, but it sanitizes the input.


> 
>> 
>> >still need to be careful against malicious drivers, so cannot assume
>> >much beyond what they do today (i.e., a hint).
>> 
>> Malicious how? There is upper limit of size in tap which is checked.
>> I assume that for hw implementation, that would be the same.
>
>A device cannot blindly trust a hdr_len passed from a driver. We have
>had bugs in the kernel with this before, such as the one fixed in
>commit 57031eb79490 ("packet: round up linear to header len").
>
>> But anyway, this discussion would be rather part of the spec/device
>> patch, don't you think?
>
>I disagree. If it's not much effort to make a commit self-documenting
>that is preferable. And if not, then an explicit reference to an
>authoratitive external reference is preferable over "it is trivial to
>look it up".

Sorry, I don't follow. What exactly do you want me to do?


> 
>> 
>> >
>> >Might be good to point to the definition commit:
>> >https://github.com/oasis-tcs/virtio-spec/commit/4f1981a1ff46b7aeb801c4c524ff76e93d9ce022
>> 
>> There were couple of fixes to the spec since then, that's why I didn't
>> include it. It is trivial to look it up in the spec.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next v2] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
Tue, Feb 21, 2023 at 04:11:53PM CET, willemdebruijn.ker...@gmail.com wrote:
>Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> set implicates that the driver provides the exact size of the header.
>> 
>> Quoting the original virtio spec:
>> "hdr_len is a hint to the device as to how much of the header needs to
>>  be kept to copy into each packet"
>> 
>> "a hint" might not be clear for the reader what does it mean, if it is
>> "maybe like that" of "exactly like that". This feature just makes it
>> crystal clear and let the device count on the hdr_len being filled up
>> by the exact length of header.
>> 
>> Also note the spec already has following note about hdr_len:
>> "Due to various bugs in implementations, this field is not useful
>>  as a guarantee of the transport header size."
>> 
>> Without this feature the device needs to parse the header in core
>> data path handling. Accurate information helps the device to eliminate
>> such header parsing and directly use the hardware accelerators
>> for GSO operation.
>> 
>> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> The driver already complies to fill the correct value. Introduce the
>> feature and advertise it.
>> 
>> Note that virtio spec also includes following note for device
>> implementation:
>> "Caution should be taken by the implementation so as to prevent
>>  a malicious driver from attacking the device by setting
>>  an incorrect hdr_len."
>> 
>> There is a plan to support this feature in our emulated device.
>> A device of SolidRun offers this feature bit. They claim this feature
>> will save the device a few cycles for every GSO packet.
>> 
>> Signed-off-by: Jiri Pirko 
>> ---
>> v1->v2:
>> - extended patch description
>
>Is the expectation that in-kernel devices support this feature, and
>if so how would it affect them? If I read the spec correctly, devices

Well, the tap driver actually trusts the hdr_len to be of correct header
size nowadays.


>still need to be careful against malicious drivers, so cannot assume
>much beyond what they do today (i.e., a hint).

Malicious how? There is upper limit of size in tap which is checked.
I assume that for hw implementation, that would be the same.

But anyway, this discussion would be rather part of the spec/device
patch, don't you think?


>
>Might be good to point to the definition commit:
>https://github.com/oasis-tcs/virtio-spec/commit/4f1981a1ff46b7aeb801c4c524ff76e93d9ce022

There were couple of fixes to the spec since then, that's why I didn't
include it. It is trivial to look it up in the spec.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next v2] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
From: Jiri Pirko 

Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
set implicates that the driver provides the exact size of the header.

Quoting the original virtio spec:
"hdr_len is a hint to the device as to how much of the header needs to
 be kept to copy into each packet"

"a hint" might not be clear for the reader what does it mean, if it is
"maybe like that" of "exactly like that". This feature just makes it
crystal clear and let the device count on the hdr_len being filled up
by the exact length of header.

Also note the spec already has following note about hdr_len:
"Due to various bugs in implementations, this field is not useful
 as a guarantee of the transport header size."

Without this feature the device needs to parse the header in core
data path handling. Accurate information helps the device to eliminate
such header parsing and directly use the hardware accelerators
for GSO operation.

virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
The driver already complies to fill the correct value. Introduce the
feature and advertise it.

Note that virtio spec also includes following note for device
implementation:
"Caution should be taken by the implementation so as to prevent
 a malicious driver from attacking the device by setting
 an incorrect hdr_len."

There is a plan to support this feature in our emulated device.
A device of SolidRun offers this feature bit. They claim this feature
will save the device a few cycles for every GSO packet.

Signed-off-by: Jiri Pirko 
---
v1->v2:
- extended patch description
---
 drivers/net/virtio_net.c| 6 --
 include/uapi/linux/virtio_net.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fb5e68ed3ec2..e85b03988733 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_GUEST_CSUM,
VIRTIO_NET_F_GUEST_USO4,
-   VIRTIO_NET_F_GUEST_USO6
+   VIRTIO_NET_F_GUEST_USO6,
+   VIRTIO_NET_F_GUEST_HDRLEN
 };
 
 #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
@@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
-   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
+   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
+   VIRTIO_NET_F_GUEST_HDRLEN
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index b4062bed186a..12c1c9699935 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -61,6 +61,7 @@
 #define VIRTIO_NET_F_GUEST_USO655  /* Guest can handle USOv6 in. */
 #define VIRTIO_NET_F_HOST_USO  56  /* Host can handle USO in. */
 #define VIRTIO_NET_F_HASH_REPORT  57   /* Supports hash report */
+#define VIRTIO_NET_F_GUEST_HDRLEN  59  /* Guest provides the exact hdr_len 
value. */
 #define VIRTIO_NET_F_RSS 60/* Supports RSS RX steering */
 #define VIRTIO_NET_F_RSC_EXT 61/* extended coalescing info */
 #define VIRTIO_NET_F_STANDBY 62/* Act as standby for another device
-- 
2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
Tue, Feb 21, 2023 at 02:39:31PM CET, alvaro.ka...@solid-run.com wrote:
>Hi,
>
>Our device offers this feature bit as well.
>I don't have concrete numbers, but this feature will save our device a
>few cycles for every GSO packet.

Cool. I will try to include this into the patch description. Thanks!

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
Mon, Feb 20, 2023 at 11:43:52PM CET, m...@redhat.com wrote:
>On Mon, Feb 20, 2023 at 02:56:08PM +0100, Jiri Pirko wrote:
>> Mon, Feb 20, 2023 at 01:55:33PM CET, m...@redhat.com wrote:
>> >On Mon, Feb 20, 2023 at 09:35:00AM +0100, Jiri Pirko wrote:
>> >> Fri, Feb 17, 2023 at 02:47:36PM CET, m...@redhat.com wrote:
>> >> >On Fri, Feb 17, 2023 at 01:53:55PM +0100, Jiri Pirko wrote:
>> >> >> Fri, Feb 17, 2023 at 01:22:01PM CET, m...@redhat.com wrote:
>> >> >> >On Fri, Feb 17, 2023 at 01:15:47PM +0100, Jiri Pirko wrote:
>> >> >> >> From: Jiri Pirko 
>> >> >> >> 
>> >> >> >> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> >> >> >> 
>> >> >> >> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which 
>> >> >> >> when
>> >> >> >> set implicates that the driver provides the exact size of the 
>> >> >> >> header.
>> >> >> >> 
>> >> >> >> The driver already complies to fill the correct value. Introduce the
>> >> >> >> feature and advertise it.
>> >> >> >> 
>> >> >> >> Signed-off-by: Jiri Pirko 
>> >> >> >
>> >> >> >Could you add a bit of motivation just for the record?
>> >> >> >Does this improve performance for some card? By how much?
>> >> >> >Expected to help some future card?
>> >> >> 
>> >> >> I can get that info, but isn't that rather something to be appended to
>> >> >> the virtio-spec patch? I mean, the feature is there, this is just
>> >> >> implementing it in one driver.
>> >> >
>> >> >It is more like using it in the driver.  It's not like we have to use
>> >> >everything - it could be useful for e.g. dpdk but not linux.
>> >> >Implementing it in the Linux driver has support costs - for example what
>> >> >if there's a bug and sometimes the length is incorrect?
>> >> >We'll be breaking things.
>> >> 
>> >> I understand. To my understanding this feature just fixes the original
>> >> ambiguity in the virtio spec.
>> >> 
>> >> Quoting the original virtio spec:
>> >> "hdr_len is a hint to the device as to how much of the header needs to
>> >>  be kept to copy into each packet"
>> >> 
>> >> "a hint" might not be clear for the reader what does it mean, if it is
>> >> "maybe like that" of "exactly like that". This feature just makes it
>> >> crystal clear.
>> >> 
>> >> If you look at the tap implementation, it uses hdr_len to alloc
>> >> skb linear part. No hint, it counts with the provided value.
>> >> So if the driver is currently not precise, it breaks tap.
>> >
>> >Well that's only for gso though right?
>> 
>> Yep.
>> 
>> 
>> >And making it bigger than necessary works fine ...
>> 
>> Well yeah. But tap does not do that, does it? it uses hdr_len directly.
>> 
>
>I mean if hdr_len is bigger than necessary tap does work.
>
>
>> >
>> >> I will add this to the patch description and send v2.
>> >> 
>> >
>> >I feel this does not answer the question yet, or maybe I am being dense.
>> >My point was not about making hdr_len precise.  My point was that we are
>> >making a change here for no apparent reason. I am guessing you are not
>> >doing it for fun - so why? Is there a device with this feature bit
>> >you are aware of?
>> 
>> Afaik real hw which does emulation of virtio_net would benefit from
>> that, our hw including.
>
>OK so do you have hardware which exposes this feature?

I believe it is not implemented yet, but it most certainly will be in
near future.

>That is the bit I am missing. Maybe mention the make
>in the commit log so
>we know where to turn if we need to make changes here?
>Or "under development" if it is not on the market yet.

Will put a note in the commit log.


>
>> 
>> >
>> >
>> >
>> >> 
>> >> >
>> >> >The patch was submitted by Marvell but they never bothered with
>> >> >using it in Linux. I guess they are using it for something else?
>> >> >CC Vitaly who put this in.
>> 

Re: [patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-21 Thread Jiri Pirko
Tue, Feb 21, 2023 at 03:38:10AM CET, jasow...@redhat.com wrote:
>
>在 2023/2/20 21:56, Jiri Pirko 写道:
>> Mon, Feb 20, 2023 at 01:55:33PM CET, m...@redhat.com wrote:
>> > On Mon, Feb 20, 2023 at 09:35:00AM +0100, Jiri Pirko wrote:
>> > > Fri, Feb 17, 2023 at 02:47:36PM CET, m...@redhat.com wrote:
>> > > > On Fri, Feb 17, 2023 at 01:53:55PM +0100, Jiri Pirko wrote:
>> > > > > Fri, Feb 17, 2023 at 01:22:01PM CET, m...@redhat.com wrote:
>> > > > > > On Fri, Feb 17, 2023 at 01:15:47PM +0100, Jiri Pirko wrote:
>> > > > > > > From: Jiri Pirko 
>> > > > > > > 
>> > > > > > > virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> > > > > > > 
>> > > > > > > Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which 
>> > > > > > > when
>> > > > > > > set implicates that the driver provides the exact size of the 
>> > > > > > > header.
>> > > > > > > 
>> > > > > > > The driver already complies to fill the correct value. Introduce 
>> > > > > > > the
>> > > > > > > feature and advertise it.
>> > > > > > > 
>> > > > > > > Signed-off-by: Jiri Pirko 
>> > > > > > Could you add a bit of motivation just for the record?
>> > > > > > Does this improve performance for some card? By how much?
>> > > > > > Expected to help some future card?
>> > > > > I can get that info, but isn't that rather something to be appended 
>> > > > > to
>> > > > > the virtio-spec patch? I mean, the feature is there, this is just
>> > > > > implementing it in one driver.
>> > > > It is more like using it in the driver.  It's not like we have to use
>> > > > everything - it could be useful for e.g. dpdk but not linux.
>> > > > Implementing it in the Linux driver has support costs - for example 
>> > > > what
>> > > > if there's a bug and sometimes the length is incorrect?
>> > > > We'll be breaking things.
>> > > I understand. To my understanding this feature just fixes the original
>> > > ambiguity in the virtio spec.
>> > > 
>> > > Quoting the original virtio spec:
>> > > "hdr_len is a hint to the device as to how much of the header needs to
>> > >   be kept to copy into each packet"
>> > > 
>> > > "a hint" might not be clear for the reader what does it mean, if it is
>> > > "maybe like that" of "exactly like that". This feature just makes it
>> > > crystal clear.
>> > > 
>> > > If you look at the tap implementation, it uses hdr_len to alloc
>> > > skb linear part. No hint, it counts with the provided value.
>> > > So if the driver is currently not precise, it breaks tap.
>> > Well that's only for gso though right?
>> Yep.
>> 
>> 
>> > And making it bigger than necessary works fine ...
>> Well yeah. But tap does not do that, does it? it uses hdr_len directly.
>
>
>tap_get_user() limit the head length:
>
>
>static ssize_t tap_get_user(struct tap_queue *q, void *msg_control,
>    struct iov_iter *from, int noblock)
>{
>    int good_linear = SKB_MAX_HEAD(TAP_RESERVE);
>...
>
>
>> 
>> 
>> > > I will add this to the patch description and send v2.
>> > > 
>> > I feel this does not answer the question yet, or maybe I am being dense.
>> > My point was not about making hdr_len precise.  My point was that we are
>> > making a change here for no apparent reason. I am guessing you are not
>> > doing it for fun - so why? Is there a device with this feature bit
>> > you are aware of?
>> Afaik real hw which does emulation of virtio_net would benefit from
>> that, our hw including.
>
>
>Note that driver can choose to no negotiate this feature, so malicious
>drivers can still try to use illegal value.

That's probably why the spec says:
5.1.6.2.2
...
Note: Caution should be taken by the implementation so as to prevent a 
malicious driver from attacking
the device by setting an incorrect hdr_len.

And that is exactly what tun does by caping the linear size to
SKB_MAX_HEAD(TAP_RESERVE)



>
>Thanks
>
>
>> 
>> 
>> > 
>> > 
>> > > >

Re: [patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-20 Thread Jiri Pirko
Mon, Feb 20, 2023 at 01:55:33PM CET, m...@redhat.com wrote:
>On Mon, Feb 20, 2023 at 09:35:00AM +0100, Jiri Pirko wrote:
>> Fri, Feb 17, 2023 at 02:47:36PM CET, m...@redhat.com wrote:
>> >On Fri, Feb 17, 2023 at 01:53:55PM +0100, Jiri Pirko wrote:
>> >> Fri, Feb 17, 2023 at 01:22:01PM CET, m...@redhat.com wrote:
>> >> >On Fri, Feb 17, 2023 at 01:15:47PM +0100, Jiri Pirko wrote:
>> >> >> From: Jiri Pirko 
>> >> >> 
>> >> >> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> >> >> 
>> >> >> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> >> >> set implicates that the driver provides the exact size of the header.
>> >> >> 
>> >> >> The driver already complies to fill the correct value. Introduce the
>> >> >> feature and advertise it.
>> >> >> 
>> >> >> Signed-off-by: Jiri Pirko 
>> >> >
>> >> >Could you add a bit of motivation just for the record?
>> >> >Does this improve performance for some card? By how much?
>> >> >Expected to help some future card?
>> >> 
>> >> I can get that info, but isn't that rather something to be appended to
>> >> the virtio-spec patch? I mean, the feature is there, this is just
>> >> implementing it in one driver.
>> >
>> >It is more like using it in the driver.  It's not like we have to use
>> >everything - it could be useful for e.g. dpdk but not linux.
>> >Implementing it in the Linux driver has support costs - for example what
>> >if there's a bug and sometimes the length is incorrect?
>> >We'll be breaking things.
>> 
>> I understand. To my understanding this feature just fixes the original
>> ambiguity in the virtio spec.
>> 
>> Quoting the original virtio spec:
>> "hdr_len is a hint to the device as to how much of the header needs to
>>  be kept to copy into each packet"
>> 
>> "a hint" might not be clear for the reader what does it mean, if it is
>> "maybe like that" of "exactly like that". This feature just makes it
>> crystal clear.
>> 
>> If you look at the tap implementation, it uses hdr_len to alloc
>> skb linear part. No hint, it counts with the provided value.
>> So if the driver is currently not precise, it breaks tap.
>
>Well that's only for gso though right?

Yep.


>And making it bigger than necessary works fine ...

Well yeah. But tap does not do that, does it? it uses hdr_len directly.


>
>> I will add this to the patch description and send v2.
>> 
>
>I feel this does not answer the question yet, or maybe I am being dense.
>My point was not about making hdr_len precise.  My point was that we are
>making a change here for no apparent reason. I am guessing you are not
>doing it for fun - so why? Is there a device with this feature bit
>you are aware of?

Afaik real hw which does emulation of virtio_net would benefit from
that, our hw including.


>
>
>
>> 
>> >
>> >The patch was submitted by Marvell but they never bothered with
>> >using it in Linux. I guess they are using it for something else?
>> >CC Vitaly who put this in.
>> >
>> >> 
>> >> >
>> >> >thanks!
>> >> >
>> >> >
>> >> >> ---
>> >> >>  drivers/net/virtio_net.c| 6 --
>> >> >>  include/uapi/linux/virtio_net.h | 1 +
>> >> >>  2 files changed, 5 insertions(+), 2 deletions(-)
>> >> >> 
>> >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> >> index fb5e68ed3ec2..e85b03988733 100644
>> >> >> --- a/drivers/net/virtio_net.c
>> >> >> +++ b/drivers/net/virtio_net.c
>> >> >> @@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
>> >> >>VIRTIO_NET_F_GUEST_UFO,
>> >> >>VIRTIO_NET_F_GUEST_CSUM,
>> >> >>VIRTIO_NET_F_GUEST_USO4,
>> >> >> -  VIRTIO_NET_F_GUEST_USO6
>> >> >> +  VIRTIO_NET_F_GUEST_USO6,
>> >> >> +  VIRTIO_NET_F_GUEST_HDRLEN
>> >> >>  };
>> >> >>  
>> >> >>  #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) 
>> >> >> | \
>> >> >> @@ -4213,7 +4214,8 @@ static struct virtio_device_id id

Re: [patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-20 Thread Jiri Pirko
Fri, Feb 17, 2023 at 02:47:36PM CET, m...@redhat.com wrote:
>On Fri, Feb 17, 2023 at 01:53:55PM +0100, Jiri Pirko wrote:
>> Fri, Feb 17, 2023 at 01:22:01PM CET, m...@redhat.com wrote:
>> >On Fri, Feb 17, 2023 at 01:15:47PM +0100, Jiri Pirko wrote:
>> >> From: Jiri Pirko 
>> >> 
>> >> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> >> 
>> >> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> >> set implicates that the driver provides the exact size of the header.
>> >> 
>> >> The driver already complies to fill the correct value. Introduce the
>> >> feature and advertise it.
>> >> 
>> >> Signed-off-by: Jiri Pirko 
>> >
>> >Could you add a bit of motivation just for the record?
>> >Does this improve performance for some card? By how much?
>> >Expected to help some future card?
>> 
>> I can get that info, but isn't that rather something to be appended to
>> the virtio-spec patch? I mean, the feature is there, this is just
>> implementing it in one driver.
>
>It is more like using it in the driver.  It's not like we have to use
>everything - it could be useful for e.g. dpdk but not linux.
>Implementing it in the Linux driver has support costs - for example what
>if there's a bug and sometimes the length is incorrect?
>We'll be breaking things.

I understand. To my understanding this feature just fixes the original
ambiguity in the virtio spec.

Quoting the original virtio spec:
"hdr_len is a hint to the device as to how much of the header needs to
 be kept to copy into each packet"

"a hint" might not be clear for the reader what does it mean, if it is
"maybe like that" of "exactly like that". This feature just makes it
crystal clear.

If you look at the tap implementation, it uses hdr_len to alloc
skb linear part. No hint, it counts with the provided value.
So if the driver is currently not precise, it breaks tap.

I will add this to the patch description and send v2.


>
>The patch was submitted by Marvell but they never bothered with
>using it in Linux. I guess they are using it for something else?
>CC Vitaly who put this in.
>
>> 
>> >
>> >thanks!
>> >
>> >
>> >> ---
>> >>  drivers/net/virtio_net.c| 6 --
>> >>  include/uapi/linux/virtio_net.h | 1 +
>> >>  2 files changed, 5 insertions(+), 2 deletions(-)
>> >> 
>> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> index fb5e68ed3ec2..e85b03988733 100644
>> >> --- a/drivers/net/virtio_net.c
>> >> +++ b/drivers/net/virtio_net.c
>> >> @@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
>> >>   VIRTIO_NET_F_GUEST_UFO,
>> >>   VIRTIO_NET_F_GUEST_CSUM,
>> >>   VIRTIO_NET_F_GUEST_USO4,
>> >> - VIRTIO_NET_F_GUEST_USO6
>> >> + VIRTIO_NET_F_GUEST_USO6,
>> >> + VIRTIO_NET_F_GUEST_HDRLEN
>> >>  };
>> >>  
>> >>  #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
>> >> @@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
>> >>   VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> >>   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> >>   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
>> >> - VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
>> >> + VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
>> >> + VIRTIO_NET_F_GUEST_HDRLEN
>> >>  
>> >>  static unsigned int features[] = {
>> >>   VIRTNET_FEATURES,
>> >> diff --git a/include/uapi/linux/virtio_net.h 
>> >> b/include/uapi/linux/virtio_net.h
>> >> index b4062bed186a..12c1c9699935 100644
>> >> --- a/include/uapi/linux/virtio_net.h
>> >> +++ b/include/uapi/linux/virtio_net.h
>> >> @@ -61,6 +61,7 @@
>> >>  #define VIRTIO_NET_F_GUEST_USO6  55  /* Guest can handle USOv6 in. */
>> >>  #define VIRTIO_NET_F_HOST_USO56  /* Host can handle USO in. */
>> >>  #define VIRTIO_NET_F_HASH_REPORT  57 /* Supports hash report */
>> >> +#define VIRTIO_NET_F_GUEST_HDRLEN  59/* Guest provides the exact 
>> >> hdr_len value. */
>> >>  #define VIRTIO_NET_F_RSS   60/* Supports RSS RX steering */
>> >>  #define VIRTIO_NET_F_RSC_EXT   61/* extended coalescing info */
>> >>  #define VIRTIO_NET_F_STANDBY   62/* Act as standby for another 
>> >> device
>> >> -- 
>> >> 2.39.0
>> >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-17 Thread Jiri Pirko
Fri, Feb 17, 2023 at 01:22:01PM CET, m...@redhat.com wrote:
>On Fri, Feb 17, 2023 at 01:15:47PM +0100, Jiri Pirko wrote:
>> From: Jiri Pirko 
>> 
>> virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).
>> 
>> Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
>> set implicates that the driver provides the exact size of the header.
>> 
>> The driver already complies to fill the correct value. Introduce the
>> feature and advertise it.
>> 
>> Signed-off-by: Jiri Pirko 
>
>Could you add a bit of motivation just for the record?
>Does this improve performance for some card? By how much?
>Expected to help some future card?

I can get that info, but isn't that rather something to be appended to
the virtio-spec patch? I mean, the feature is there, this is just
implementing it in one driver.


>
>thanks!
>
>
>> ---
>>  drivers/net/virtio_net.c| 6 --
>>  include/uapi/linux/virtio_net.h | 1 +
>>  2 files changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index fb5e68ed3ec2..e85b03988733 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
>>  VIRTIO_NET_F_GUEST_UFO,
>>  VIRTIO_NET_F_GUEST_CSUM,
>>  VIRTIO_NET_F_GUEST_USO4,
>> -VIRTIO_NET_F_GUEST_USO6
>> +VIRTIO_NET_F_GUEST_USO6,
>> +VIRTIO_NET_F_GUEST_HDRLEN
>>  };
>>  
>>  #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
>> @@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
>>  VIRTIO_NET_F_CTRL_MAC_ADDR, \
>>  VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>>  VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
>> -VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
>> +VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
>> +VIRTIO_NET_F_GUEST_HDRLEN
>>  
>>  static unsigned int features[] = {
>>  VIRTNET_FEATURES,
>> diff --git a/include/uapi/linux/virtio_net.h 
>> b/include/uapi/linux/virtio_net.h
>> index b4062bed186a..12c1c9699935 100644
>> --- a/include/uapi/linux/virtio_net.h
>> +++ b/include/uapi/linux/virtio_net.h
>> @@ -61,6 +61,7 @@
>>  #define VIRTIO_NET_F_GUEST_USO6 55  /* Guest can handle USOv6 in. */
>>  #define VIRTIO_NET_F_HOST_USO   56  /* Host can handle USO in. */
>>  #define VIRTIO_NET_F_HASH_REPORT  57/* Supports hash report */
>> +#define VIRTIO_NET_F_GUEST_HDRLEN  59   /* Guest provides the exact 
>> hdr_len value. */
>>  #define VIRTIO_NET_F_RSS  60/* Supports RSS RX steering */
>>  #define VIRTIO_NET_F_RSC_EXT  61/* extended coalescing info */
>>  #define VIRTIO_NET_F_STANDBY  62/* Act as standby for another 
>> device
>> -- 
>> 2.39.0
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next] net: virtio_net: implement exact header length guest feature

2023-02-17 Thread Jiri Pirko
From: Jiri Pirko 

virtio_net_hdr_from_skb() fills up hdr_len to skb_headlen(skb).

Virtio spec introduced a feature VIRTIO_NET_F_GUEST_HDRLEN which when
set implicates that the driver provides the exact size of the header.

The driver already complies to fill the correct value. Introduce the
feature and advertise it.

Signed-off-by: Jiri Pirko 
---
 drivers/net/virtio_net.c| 6 --
 include/uapi/linux/virtio_net.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fb5e68ed3ec2..e85b03988733 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -62,7 +62,8 @@ static const unsigned long guest_offloads[] = {
VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_GUEST_CSUM,
VIRTIO_NET_F_GUEST_USO4,
-   VIRTIO_NET_F_GUEST_USO6
+   VIRTIO_NET_F_GUEST_USO6,
+   VIRTIO_NET_F_GUEST_HDRLEN
 };
 
 #define GUEST_OFFLOAD_GRO_HW_MASK ((1ULL << VIRTIO_NET_F_GUEST_TSO4) | \
@@ -4213,7 +4214,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY, \
-   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL
+   VIRTIO_NET_F_RSS, VIRTIO_NET_F_HASH_REPORT, VIRTIO_NET_F_NOTF_COAL, \
+   VIRTIO_NET_F_GUEST_HDRLEN
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index b4062bed186a..12c1c9699935 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -61,6 +61,7 @@
 #define VIRTIO_NET_F_GUEST_USO655  /* Guest can handle USOv6 in. */
 #define VIRTIO_NET_F_HOST_USO  56  /* Host can handle USO in. */
 #define VIRTIO_NET_F_HASH_REPORT  57   /* Supports hash report */
+#define VIRTIO_NET_F_GUEST_HDRLEN  59  /* Guest provides the exact hdr_len 
value. */
 #define VIRTIO_NET_F_RSS 60/* Supports RSS RX steering */
 #define VIRTIO_NET_F_RSC_EXT 61/* extended coalescing info */
 #define VIRTIO_NET_F_STANDBY 62/* Act as standby for another device
-- 
2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio-net: Maintain reverse cleanup order

2023-02-02 Thread Jiri Pirko
Thu, Feb 02, 2023 at 04:10:56PM CET, pa...@nvidia.com wrote:
>
>> From: Jiri Pirko 
>> Sent: Thursday, February 2, 2023 7:26 AM
>> 
>> Thu, Feb 02, 2023 at 06:00:38AM CET, pa...@nvidia.com wrote:
>> >To easily audit the code, better to keep the device stop() sequence to
>> >be mirror of the device open() sequence.
>> >
>> >Signed-off-by: Parav Pandit 
>> 
>> Reviewed-by: Jiri Pirko 
>> 
>> If this is not fixing bug (which I believe is the case), you should target 
>> it to net-
>> next ([patch net-next] ..).
>> 
>Yes. Right. First one was fix for net-rc, second was for net-next. And 2nd 
>depends on the first to avoid merge conflicts.
>So, I was unsure how to handle it.
>Can you please suggest?

1) Send the fix to -net
2) Wait until -net is merged into -net-next
3) Send the second patch to -net-next

>
>
>> 
>> >---
>> > drivers/net/virtio_net.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> >diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index
>> >b7d0b54c3bb0..1f8168e0f64d 100644
>> >--- a/drivers/net/virtio_net.c
>> >+++ b/drivers/net/virtio_net.c
>> >@@ -2279,9 +2279,9 @@ static int virtnet_close(struct net_device *dev)
>> >cancel_delayed_work_sync(>refill);
>> >
>> >for (i = 0; i < vi->max_queue_pairs; i++) {
>> >+   virtnet_napi_tx_disable(>sq[i].napi);
>> >napi_disable(>rq[i].napi);
>> >xdp_rxq_info_unreg(>rq[i].xdp_rxq);
>> >-   virtnet_napi_tx_disable(>sq[i].napi);
>> >}
>> >
>> >return 0;
>> >--
>> >2.26.2
>> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio-net: Maintain reverse cleanup order

2023-02-02 Thread Jiri Pirko
Thu, Feb 02, 2023 at 06:00:38AM CET, pa...@nvidia.com wrote:
>To easily audit the code, better to keep the device stop()
>sequence to be mirror of the device open() sequence.
>
>Signed-off-by: Parav Pandit 

Reviewed-by: Jiri Pirko 

If this is not fixing bug (which I believe is the case), you should
target it to net-next ([patch net-next] ..).


>---
> drivers/net/virtio_net.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index b7d0b54c3bb0..1f8168e0f64d 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -2279,9 +2279,9 @@ static int virtnet_close(struct net_device *dev)
>   cancel_delayed_work_sync(>refill);
> 
>   for (i = 0; i < vi->max_queue_pairs; i++) {
>+  virtnet_napi_tx_disable(>sq[i].napi);
>   napi_disable(>rq[i].napi);
>   xdp_rxq_info_unreg(>rq[i].xdp_rxq);
>-  virtnet_napi_tx_disable(>sq[i].napi);
>   }
> 
>   return 0;
>-- 
>2.26.2
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] virtio-net: Keep stop() to follow mirror sequence of open()

2023-02-02 Thread Jiri Pirko
Thu, Feb 02, 2023 at 06:00:37AM CET, pa...@nvidia.com wrote:
>Cited commit in fixes tag frees rxq xdp info while RQ NAPI is
>still enabled and packet processing may be ongoing.
>
>Follow the mirror sequence of open() in the stop() callback.
>This ensures that when rxq info is unregistered, no rx
>packet processing is ongoing.
>
>Fixes: 754b8a21a96d ("virtio_net: setup xdp_rxq_info")
>Signed-off-by: Parav Pandit 

Reviewed-by: Jiri Pirko 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next 10/19] pds_core: devlink params for enabling VIF support

2022-11-29 Thread Jiri Pirko
Tue, Nov 29, 2022 at 12:39:22AM CET, k...@kernel.org wrote:
>On Tue, 29 Nov 2022 00:29:42 +0100 Andrew Lunn wrote:
>> > How about:
>> >DEVLINK_PARAM_GENERIC_ID_ENABLE_LIVE_MIGRATION  
>> 
>> Much better.
>
>+1, although I care much less about the define name which is stupidly
>long anyway and more about the actual value that the user will see

We have patches that introduce live migration as a generic port function
capability bit. It is an attribute of the function.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net v4] failover: allow name change on IFF_UP slave interfaces

2019-03-29 Thread Jiri Pirko
Fri, Mar 29, 2019 at 04:15:12PM CET, step...@networkplumber.org wrote:
>On Thu, 28 Mar 2019 19:47:27 -0400
>Si-Wei Liu  wrote:
>
>> +if (unlikely(dev->flags & IFF_UP)) {
>> +struct netdev_notifier_change_info change_info;
>> +
>> +change_info.flags_changed = 0;
>
>Simpler to use structure initialization, which also avoid any chance
>of unititialized fields.
>
>   struct netdev_notifier_change_info change_info
>   = { .flags_changed =  0 };
 
In fact, you can do just:
struct netdev_notifier_change_info change_info = {};
to achieve the same.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net v3] failover: allow name change on IFF_UP slave interfaces

2019-03-27 Thread Jiri Pirko
Wed, Mar 27, 2019 at 12:48:13AM CET, si-wei@oracle.com wrote:
>When a netdev appears through hot plug then gets enslaved by a failover
>master that is already up and running, the slave will be opened
>right away after getting enslaved. Today there's a race that userspace
>(udev) may fail to rename the slave if the kernel (net_failover)
>opens the slave earlier than when the userspace rename happens.
>Unlike bond or team, the primary slave of failover can't be renamed by
>userspace ahead of time, since the kernel initiated auto-enslavement is
>unable to, or rather, is never meant to be synchronized with the rename
>request from userspace.
>
>As the failover slave interfaces are not designed to be operated
>directly by userspace apps: IP configuration, filter rules with
>regard to network traffic passing and etc., should all be done on master
>interface. In general, userspace apps only care about the
>name of master interface, while slave names are less important as long
>as admin users can see reliable names that may carry
>other information describing the netdev. For e.g., they can infer that
>"ens3nsby" is a standby slave of "ens3", while for a
>name like "eth0" they can't tell which master it belongs to.
>
>Historically the name of IFF_UP interface can't be changed because
>there might be admin script or management software that is already
>relying on such behavior and assumes that the slave name can't be
>changed once UP. But failover is special: with the in-kernel
>auto-enslavement mechanism, the userspace expectation for device
>enumeration and bring-up order is already broken. Previously initramfs
>and various userspace config tools were modified to bypass failover
>slaves because of auto-enslavement and duplicate MAC address. Similarly,
>in case that users care about seeing reliable slave name, the new type
>of failover slaves needs to be taken care of specifically in userspace
>anyway.
>
>It's less risky to lift up the rename restriction on failover slave
>which is already UP. Although it's possible this change may potentially
>break userspace component (most likely configuration scripts or
>management software) that assumes slave name can't be changed while
>UP, it's relatively a limited and controllable set among all userspace
>components, which can be fixed specifically to listen for the rename
>and/or link down/up events on failover slaves. Userspace component
>interacting with slaves is expected to be changed to operate on failover
>master interface instead, as the failover slave is dynamic in nature
>which may come and go at any point.  The goal is to make the role of
>failover slaves less relevant, and userspace components should only
>deal with failover master in the long run.
>
>Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
>Signed-off-by: Si-Wei Liu 
>Reviewed-by: Liran Alon 
>
>--
>v1 -> v2:
>- Drop configurable module parameter (Sridhar)
>
>v2 -> v3:
>- Drop additional IFF_SLAVE_RENAME_OK flag (Sridhar)
>- Send down and up events around rename (Michael S. Tsirkin)
>---
> net/core/dev.c | 37 ++---
> 1 file changed, 34 insertions(+), 3 deletions(-)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 722d50d..3e0cd80 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -1171,6 +1171,7 @@ int dev_get_valid_name(struct net *net, struct 
>net_device *dev,
> int dev_change_name(struct net_device *dev, const char *newname)
> {
>   unsigned char old_assign_type;
>+  bool reopen_needed = false;
>   char oldname[IFNAMSIZ];
>   int err = 0;
>   int ret;
>@@ -1180,8 +1181,24 @@ int dev_change_name(struct net_device *dev, const char 
>*newname)
>   BUG_ON(!dev_net(dev));
> 
>   net = dev_net(dev);
>-  if (dev->flags & IFF_UP)
>-  return -EBUSY;
>+
>+  /* Allow failover slave to rename even when
>+   * it is up and running.
>+   *
>+   * Failover slaves are special, since userspace
>+   * might rename the slave after the interface
>+   * has been brought up and running due to
>+   * auto-enslavement.
>+   *
>+   * Failover users don't actually care about slave
>+   * name change, as they are only expected to operate
>+   * on master interface directly.
>+   */
>+  if (dev->flags & IFF_UP) {
>+  if (likely(!(dev->priv_flags & IFF_FAILOVER_SLAVE)))
>+  return -EBUSY;
>+  reopen_needed = true;
>+  }
> 
>   write_seqcount_begin(_rename_seq);
> 
>@@ -1198,6 +1215,9 @@ int dev_change_name(struct net_device *dev, const char 
>*newname)
>   return err;
>   }
> 
>+  if (reopen_needed)
>+  dev_close(dev);

Ugh. Don't dev_close/dev_open on name change.


>+
>   if (oldname[0] && !strchr(oldname, '%'))
>   netdev_info(dev, "renamed from %s\n", oldname);
> 
>@@ -1210,7 +1230,9 @@ int dev_change_name(struct net_device *dev, const char 
>*newname)
>   

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread Jiri Pirko
Tue, Mar 05, 2019 at 01:50:59AM CET, si-wei@oracle.com wrote:
>When a netdev appears through hot plug then gets enslaved by a failover
>master that is already up and running, the slave will be opened
>right away after getting enslaved. Today there's a race that userspace
>(udev) may fail to rename the slave if the kernel (net_failover)
>opens the slave earlier than when the userspace rename happens.
>Unlike bond or team, the primary slave of failover can't be renamed by
>userspace ahead of time, since the kernel initiated auto-enslavement is
>unable to, or rather, is never meant to be synchronized with the rename
>request from userspace.
>
>As the failover slave interfaces are not designed to be operated
>directly by userspace apps: IP configuration, filter rules with
>regard to network traffic passing and etc., should all be done on master
>interface. In general, userspace apps only care about the
>name of master interface, while slave names are less important as long
>as admin users can see reliable names that may carry
>other information describing the netdev. For e.g., they can infer that
>"ens3nsby" is a standby slave of "ens3", while for a
>name like "eth0" they can't tell which master it belongs to.
>
>Historically the name of IFF_UP interface can't be changed because
>there might be admin script or management software that is already
>relying on such behavior and assumes that the slave name can't be
>changed once UP. But failover is special: with the in-kernel
>auto-enslavement mechanism, the userspace expectation for device
>enumeration and bring-up order is already broken. Previously initramfs
>and various userspace config tools were modified to bypass failover
>slaves because of auto-enslavement and duplicate MAC address. Similarly,
>in case that users care about seeing reliable slave name, the new type
>of failover slaves needs to be taken care of specifically in userspace
>anyway.
>
>For that to work, now introduce a module-level tunable,
>"slave_rename_ok" that allows users to lift up the rename restriction on
>failover slave which is already UP. Although it's possible this change
>potentially break userspace component (most likely configuration scripts
>or management software) that assumes slave name can't be changed while
>UP, it's relatively a limited and controllable set among all userspace
>components, which can be fixed specifically to work with the new naming
>behavior of the failover slave. Userspace component interacting with
>slaves should be changed to operate on failover master instead, as the
>failover slave is dynamic in nature which may come and go at any point.
>The goal is to make the role of failover slaves less relevant, and
>all userspace should only deal with master in the long run. The default
>for the "slave_rename_ok" is set to true(1). If userspace doesn't have
>the right support in place meanwhile users don't care about reliable
>userspace naming, the value can be set to false(0).
>
>Signed-off-by: si-wei@oracle.com
>Reviewed-by: Liran Alon 
>---
> include/linux/netdevice.h |  3 +++
> net/core/dev.c|  3 ++-
> net/core/failover.c   | 11 +--
> 3 files changed, 14 insertions(+), 3 deletions(-)
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index 857f8ab..6d9e4e0 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1487,6 +1487,7 @@ struct net_device_ops {
>  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
>  * @IFF_FAILOVER: device is a failover master device
>  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
>  */
> enum netdev_priv_flags {
>   IFF_802_1Q_VLAN = 1<<0,
>@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
>   IFF_NO_RX_HANDLER   = 1<<26,
>   IFF_FAILOVER= 1<<27,
>   IFF_FAILOVER_SLAVE  = 1<<28,
>+  IFF_SLAVE_RENAME_OK = 1<<29,
> };
> 
> #define IFF_802_1Q_VLAN   IFF_802_1Q_VLAN
>@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
> #define IFF_FAILOVER  IFF_FAILOVER
> #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE
>+#define IFF_SLAVE_RENAME_OK   IFF_SLAVE_RENAME_OK
> 
> /**
>  *struct net_device - The DEVICE structure.
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 722d50d..ae070de 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
>*newname)
>   BUG_ON(!dev_net(dev));
> 
>   net = dev_net(dev);
>-  if (dev->flags & IFF_UP)
>+  if (dev->flags & IFF_UP &&
>+  !(dev->priv_flags & IFF_SLAVE_RENAME_OK))
>   return -EBUSY;
> 
>   write_seqcount_begin(_rename_seq);
>diff --git a/net/core/failover.c b/net/core/failover.c
>index 4a92a98..1fd8bbb 100644

Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-26 Thread Jiri Pirko
Sat, May 26, 2018 at 09:22:18AM CEST, sridhar.samudr...@intel.com wrote:
>On 5/25/2018 4:28 PM, Stephen Hemminger wrote:
>> On Fri, 25 May 2018 16:11:47 -0700
>> "Samudrala, Sridhar"  wrote:
>> 
>> > On 5/25/2018 3:34 PM, Stephen Hemminger wrote:
>> > > On Thu, 24 May 2018 09:55:14 -0700
>> > > Sridhar Samudrala  wrote:
>> > > > --- a/drivers/net/hyperv/Kconfig
>> > > > +++ b/drivers/net/hyperv/Kconfig
>> > > > @@ -2,5 +2,6 @@ config HYPERV_NET
>> > > >tristate "Microsoft Hyper-V virtual network driver"
>> > > >depends on HYPERV
>> > > >select UCS2_STRING
>> > > > +  select FAILOVER
>> > > When I take a working kernel config, add the patches then do
>> > > make oldconfig
>> > > 
>> > > It is not autoselecting FAILOVER, it prompts me for it. This means
>> > > if user says no then a non-working netvsc device is made.
>> > I see
>> >  Generic failover module (FAILOVER) [M/y/?] (NEW)
>> > 
>> > So the user is given an option to either build as a Module or part of the
>> > kernel. 'n' is not an option.
>> With most libraries there is no prompt at all.
>
>Not sure what you meant by this.
>Without any patches applied, i had a .config file with HYPERV_NET configured
>as a module.
>Then after applying the first 2 patches in this series, i did a
>  make oldconfig
>and i see the above prompt.
>
>Are you saying that on some distros, 'make oldconfig creates a .config
>file without any prompt and FAILOVER is not getting selected even when 
>HYPERV_NET
>is enabled?
>
>

Well the thing is that for a user, it makes no sense to select
"FAILOVER" by hand. It is a lib, so it should be only select it by a
user. It has no sense to have it turned on by hand - no lib user.
You can achieve that by simply removing "help" for the Kconfig
item. Same thing for "NET_FAILOVER".
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-26 Thread Jiri Pirko
Sat, May 26, 2018 at 12:37:44AM CEST, step...@networkplumber.org wrote:
>On Thu, 24 May 2018 09:55:13 -0700
>Sridhar Samudrala  wrote:
>
>
>> +spin_lock(_lock);
>
>Since register is not in fast path, this should be a mutex?

I don't get it. Why would you prefer mutex over spinlock here?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-23 Thread Jiri Pirko
Tue, May 22, 2018 at 10:54:29PM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 5/22/2018 9:12 AM, Jiri Pirko wrote:
>> Fixing the subj, sorry about that.
>> 
>> Tue, May 22, 2018 at 05:46:21PM CEST, m...@redhat.com wrote:
>> > On Tue, May 22, 2018 at 05:36:14PM +0200, Jiri Pirko wrote:
>> > > Tue, May 22, 2018 at 05:28:42PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 5/22/2018 2:08 AM, Jiri Pirko wrote:
>> > > > > Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> > > > > > Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com 
>> > > > > > wrote:
>> > > > > > > Use the registration/notification framework supported by the 
>> > > > > > > generic
>> > > > > > > failover infrastructure.
>> > > > > > > 
>> > > > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> > > > > > In previous patchset versions, the common code did
>> > > > > > netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> > > > > > (netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> > > > > > 
>> > > > > > This should be part of the common "failover" code.
>> > > > Based on Stephen's feedback on earlier patches, i tried to minimize 
>> > > > the changes to
>> > > > netvsc and only commonize the notifier and the main event handler 
>> > > > routine.
>> > > > Another complication is that netvsc does part of registration in a 
>> > > > delayed workqueue.
>> > > :( This kind of degrades the whole efford of having single solution
>> > > in "failover" module. I think that common parts, as
>> > > netdev_rx_handler_register() and others certainly is should be inside
>> > > the common module. This is not a good time to minimize changes. Let's do
>> > > the thing properly and fix the netvsc mess now.
>> > > 
>> > > 
>> > > > It should be possible to move some of the code from net_failover.c to 
>> > > > generic
>> > > > failover.c in future if Stephen is ok with it.
>> > > > 
>> > > > 
>> > > > > Also note that in the current patchset you use IFF_FAILOVER flag for
>> > > > > master, yet for the slave you use IFF_SLAVE. That is wrong.
>> > > > > IFF_FAILOVER_SLAVE should be used.
>> > > > Not sure which code you are referring to.  I only set 
>> > > > IFF_FAILOVER_SLAVE
>> > > > in patch 3.
>> > > The existing netvsc driver.
>> > We really can't change netvsc's flags now, even if it's interface is
>> > messy, it's being used in the field. We can add a flag that makes netvsc
>> > behave differently, and if this flag also allows enhanced functionality
>> > userspace will gradually switch.
>> Okay, although in this case, it really does not make much sense, so be
>> it. Leave the netvsc set the ->priv flag to IFF_SLAVE as it is doing
>> now. (This once-wrong-forever-wrong policy is flustrating me).
>> 
>> But since this patchset introduces private flag IFF_FAILOVER and
>> IFF_FAILOVER_SLAVE, and we set IFF_FAILOVER to the netvsc netdev
>> instance, we should also set IFF_FAILOVER_SLAVE to the enslaved VF
>> netdevice to get at least some consistency between virtio_net and
>> netvsc.
>
>OK. I can make this change to set/unset IFF_FAILOVER_SLAVE in the netvsc
>register/unregister routines so that it is consistent with virtio_net.
>
>Based on your discussion with mst, i think we can even remove IFF_SLAVE
>setting on netvsc as it should not impact userspace.  If Stephen is OK
>we can make this change too.
>
>Do you see any other items that need to be resolved for this series to go in
>this merge window?

As I wrote previously, the common code including rx_handler registration
and setting of flags and master link should be done in a common code,
moved away from netvsc code.

Thanks.


>
>
>
>> 
>> > Anything breaking userspace I fully expect Stephen to nack and
>> > IMO with good reason.
>> > 
>> > -- 
>> > MST
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 06:52:21PM CEST, m...@redhat.com wrote:
>On Tue, May 22, 2018 at 05:45:01PM +0200, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote:
>> >On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote:
>> >> Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:
>> >> >On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:
>> >> >> Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>> >> >> >On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:
>> >> >> >> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:
>> >> >> >> >On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>> >> >> >> >> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> >> >> >> >> >Tue, May 22, 2018 at 04:06:18AM CEST, 
>> >> >> >> >> >sridhar.samudr...@intel.com wrote:
>> >> >> >> >> >>Use the registration/notification framework supported by the 
>> >> >> >> >> >>generic
>> >> >> >> >> >>failover infrastructure.
>> >> >> >> >> >>
>> >> >> >> >> >>Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> >> >> >> >
>> >> >> >> >> >In previous patchset versions, the common code did
>> >> >> >> >> >netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> >> >> >> >> >(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> >> >> >> >> >
>> >> >> >> >> >This should be part of the common "failover" code.
>> >> >> >> >> >
>> >> >> >> >> 
>> >> >> >> >> Also note that in the current patchset you use IFF_FAILOVER flag 
>> >> >> >> >> for
>> >> >> >> >> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> >> >> >> >> IFF_FAILOVER_SLAVE should be used.
>> >> >> >> >
>> >> >> >> >Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE?
>> >> >> >> 
>> >> >> >> No. IFF_SLAVE is for bonding.
>> >> >> >
>> >> >> >What breaks if we reuse it for failover?
>> >> >> 
>> >> >> This is exposed to userspace. IFF_SLAVE is expected for bonding slaves.
>> >> >> And failover slave is not a bonding slave.
>> >> >
>> >> >That does not really answer the question.  I'd claim it's sufficiently
>> >> >like a bond slave for IFF_SLAVE to make sense.
>> >> >
>> >> >In fact you will find that netvsc already sets IFF_SLAVE, and so
>> >> 
>> >> netvsc does the whole failover thing in a wrong way. This patchset is
>> >> trying to fix it.
>> >
>> >Maybe, but we don't need gratuitous changes either, especially if they
>> >break userspace.
>> 
>> What do you mean by the "break"? It was a mistake to reuse IFF_SLAVE at
>> the first place, lets fix it. If some userspace depends on that flag, it
>> is broken anyway.
>> 
>> 
>> >
>> >> >does e.g. the eql driver.
>> >> >
>> >> >The advantage of using IFF_SLAVE is that userspace knows to skip it.  If
>> >> 
>> >> The userspace should know how to skip other types of slaves - team,
>> >> bridge, ovs, etc.
>> >> The "master link" should be the one to look at.
>> >> 
>> >
>> >How should existing userspace know which ones to skip and which one is
>> >the master?  Right now userspace seems to assume whatever does not have
>> >IFF_SLAVE should be looked at. Are you saying that's not the right thing
>> 
>> Why do you say so? What do you mean by "looked at"? Certainly not.
>> IFLA_MASTER is the attribute that should be looked at, nothing else.
>> 
>> 
>> >to do and userspace should be fixed? What should userspace do in
>> >your opinion that will be forward compatible with future kernels?
>> >
>> >> 
>> >> >we don't set IFF_SLAVE existing userspace tries to use the lowerde

Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Fixing the subj, sorry about that.

Tue, May 22, 2018 at 05:46:21PM CEST, m...@redhat.com wrote:
>On Tue, May 22, 2018 at 05:36:14PM +0200, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 05:28:42PM CEST, sridhar.samudr...@intel.com wrote:
>> >
>> >On 5/22/2018 2:08 AM, Jiri Pirko wrote:
>> >> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> >> > Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:
>> >> > > Use the registration/notification framework supported by the generic
>> >> > > failover infrastructure.
>> >> > > 
>> >> > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> > In previous patchset versions, the common code did
>> >> > netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> >> > (netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> >> > 
>> >> > This should be part of the common "failover" code.
>> >
>> >Based on Stephen's feedback on earlier patches, i tried to minimize the 
>> >changes to
>> >netvsc and only commonize the notifier and the main event handler routine.
>> >Another complication is that netvsc does part of registration in a delayed 
>> >workqueue.
>> 
>> :( This kind of degrades the whole efford of having single solution
>> in "failover" module. I think that common parts, as
>> netdev_rx_handler_register() and others certainly is should be inside
>> the common module. This is not a good time to minimize changes. Let's do
>> the thing properly and fix the netvsc mess now.
>> 
>> 
>> >
>> >It should be possible to move some of the code from net_failover.c to 
>> >generic
>> >failover.c in future if Stephen is ok with it.
>> >
>> >
>> >> > 
>> >> Also note that in the current patchset you use IFF_FAILOVER flag for
>> >> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> >> IFF_FAILOVER_SLAVE should be used.
>> >
>> >Not sure which code you are referring to.  I only set IFF_FAILOVER_SLAVE
>> >in patch 3.
>> 
>> The existing netvsc driver.
>
>We really can't change netvsc's flags now, even if it's interface is
>messy, it's being used in the field. We can add a flag that makes netvsc
>behave differently, and if this flag also allows enhanced functionality
>userspace will gradually switch.

Okay, although in this case, it really does not make much sense, so be
it. Leave the netvsc set the ->priv flag to IFF_SLAVE as it is doing
now. (This once-wrong-forever-wrong policy is flustrating me).

But since this patchset introduces private flag IFF_FAILOVER and
IFF_FAILOVER_SLAVE, and we set IFF_FAILOVER to the netvsc netdev
instance, we should also set IFF_FAILOVER_SLAVE to the enslaved VF
netdevice to get at least some consistency between virtio_net and
netvsc.


>
>Anything breaking userspace I fully expect Stephen to nack and
>IMO with good reason.
>
>-- 
>MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 05:32:30PM CEST, m...@redhat.com wrote:
>On Tue, May 22, 2018 at 05:13:43PM +0200, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:
>> >On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:
>> >> Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>> >> >On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:
>> >> >> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:
>> >> >> >On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>> >> >> >> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> >> >> >> >Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com 
>> >> >> >> >wrote:
>> >> >> >> >>Use the registration/notification framework supported by the 
>> >> >> >> >>generic
>> >> >> >> >>failover infrastructure.
>> >> >> >> >>
>> >> >> >> >>Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> >> >> >
>> >> >> >> >In previous patchset versions, the common code did
>> >> >> >> >netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> >> >> >> >(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> >> >> >> >
>> >> >> >> >This should be part of the common "failover" code.
>> >> >> >> >
>> >> >> >> 
>> >> >> >> Also note that in the current patchset you use IFF_FAILOVER flag for
>> >> >> >> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> >> >> >> IFF_FAILOVER_SLAVE should be used.
>> >> >> >
>> >> >> >Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE?
>> >> >> 
>> >> >> No. IFF_SLAVE is for bonding.
>> >> >
>> >> >What breaks if we reuse it for failover?
>> >> 
>> >> This is exposed to userspace. IFF_SLAVE is expected for bonding slaves.
>> >> And failover slave is not a bonding slave.
>> >
>> >That does not really answer the question.  I'd claim it's sufficiently
>> >like a bond slave for IFF_SLAVE to make sense.
>> >
>> >In fact you will find that netvsc already sets IFF_SLAVE, and so
>> 
>> netvsc does the whole failover thing in a wrong way. This patchset is
>> trying to fix it.
>
>Maybe, but we don't need gratuitous changes either, especially if they
>break userspace.

What do you mean by the "break"? It was a mistake to reuse IFF_SLAVE at
the first place, lets fix it. If some userspace depends on that flag, it
is broken anyway.


>
>> >does e.g. the eql driver.
>> >
>> >The advantage of using IFF_SLAVE is that userspace knows to skip it.  If
>> 
>> The userspace should know how to skip other types of slaves - team,
>> bridge, ovs, etc.
>> The "master link" should be the one to look at.
>> 
>
>How should existing userspace know which ones to skip and which one is
>the master?  Right now userspace seems to assume whatever does not have
>IFF_SLAVE should be looked at. Are you saying that's not the right thing

Why do you say so? What do you mean by "looked at"? Certainly not.
IFLA_MASTER is the attribute that should be looked at, nothing else.


>to do and userspace should be fixed? What should userspace do in
>your opinion that will be forward compatible with future kernels?
>
>> 
>> >we don't set IFF_SLAVE existing userspace tries to use the lowerdev.
>> 
>> Each master type has a IFF_ master flag and IFF_ slave flag.
>
>Could you give some examples please?

enum netdev_priv_flags {
IFF_EBRIDGE = 1<<1,
IFF_BRIDGE_PORT = 1<<9,
IFF_OPENVSWITCH = 1<<20,
IFF_OVS_DATAPATH= 1<<10,
IFF_L3MDEV_MASTER   = 1<<18,
IFF_L3MDEV_SLAVE= 1<<21,
IFF_TEAM= 1<<22,
IFF_TEAM_PORT   = 1<<13,
};


>
>> In private
>> flag. I don't see no reason to break this pattern here.
>
>Other masters are setup from userspace, this one is set up automatically
>by kernel. So the bar is higher, we need an interface that existing
>userspace knows about.  We can't just say "oh if userspace set this up
>it should know to skip lowerdevs".
>
>Otherwise multiple interfaces with same mac tend to confuse userspace.

No difference, really.
Regardless who does the setup, and independent userspace deamon should
react accordingly.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Shepherd request (P83): Multipath TCP: Present Use Cases and an Upstream Future

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 05:28:42PM CEST, sridhar.samudr...@intel.com wrote:
>
>On 5/22/2018 2:08 AM, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> > Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:
>> > > Use the registration/notification framework supported by the generic
>> > > failover infrastructure.
>> > > 
>> > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> > In previous patchset versions, the common code did
>> > netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> > (netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> > 
>> > This should be part of the common "failover" code.
>
>Based on Stephen's feedback on earlier patches, i tried to minimize the 
>changes to
>netvsc and only commonize the notifier and the main event handler routine.
>Another complication is that netvsc does part of registration in a delayed 
>workqueue.

:( This kind of degrades the whole efford of having single solution
in "failover" module. I think that common parts, as
netdev_rx_handler_register() and others certainly is should be inside
the common module. This is not a good time to minimize changes. Let's do
the thing properly and fix the netvsc mess now.


>
>It should be possible to move some of the code from net_failover.c to generic
>failover.c in future if Stephen is ok with it.
>
>
>> > 
>> Also note that in the current patchset you use IFF_FAILOVER flag for
>> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> IFF_FAILOVER_SLAVE should be used.
>
>Not sure which code you are referring to.  I only set IFF_FAILOVER_SLAVE
>in patch 3.

The existing netvsc driver.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 03:39:33PM CEST, m...@redhat.com wrote:
>On Tue, May 22, 2018 at 03:26:26PM +0200, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>> >On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:
>> >> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:
>> >> >On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>> >> >> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> >> >> >Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com 
>> >> >> >wrote:
>> >> >> >>Use the registration/notification framework supported by the generic
>> >> >> >>failover infrastructure.
>> >> >> >>
>> >> >> >>Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> >> >
>> >> >> >In previous patchset versions, the common code did
>> >> >> >netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> >> >> >(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> >> >> >
>> >> >> >This should be part of the common "failover" code.
>> >> >> >
>> >> >> 
>> >> >> Also note that in the current patchset you use IFF_FAILOVER flag for
>> >> >> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> >> >> IFF_FAILOVER_SLAVE should be used.
>> >> >
>> >> >Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE?
>> >> 
>> >> No. IFF_SLAVE is for bonding.
>> >
>> >What breaks if we reuse it for failover?
>> 
>> This is exposed to userspace. IFF_SLAVE is expected for bonding slaves.
>> And failover slave is not a bonding slave.
>
>That does not really answer the question.  I'd claim it's sufficiently
>like a bond slave for IFF_SLAVE to make sense.
>
>In fact you will find that netvsc already sets IFF_SLAVE, and so

netvsc does the whole failover thing in a wrong way. This patchset is
trying to fix it.

>does e.g. the eql driver.
>
>The advantage of using IFF_SLAVE is that userspace knows to skip it.  If

The userspace should know how to skip other types of slaves - team,
bridge, ovs, etc. The "master link" should be the one to look at.


>we don't set IFF_SLAVE existing userspace tries to use the lowerdev.

Each master type has a IFF_ master flag and IFF_ slave flag. In private
flag. I don't see no reason to break this pattern here.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 03:17:37PM CEST, m...@redhat.com wrote:
>On Tue, May 22, 2018 at 03:14:22PM +0200, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:
>> >On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>> >> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> >> >Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:
>> >> >>Use the registration/notification framework supported by the generic
>> >> >>failover infrastructure.
>> >> >>
>> >> >>Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> >
>> >> >In previous patchset versions, the common code did
>> >> >netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> >> >(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> >> >
>> >> >This should be part of the common "failover" code.
>> >> >
>> >> 
>> >> Also note that in the current patchset you use IFF_FAILOVER flag for
>> >> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> >> IFF_FAILOVER_SLAVE should be used.
>> >
>> >Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE?
>> 
>> No. IFF_SLAVE is for bonding.
>
>What breaks if we reuse it for failover?

This is exposed to userspace. IFF_SLAVE is expected for bonding slaves.
And failover slave is not a bonding slave.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 03:12:40PM CEST, m...@redhat.com wrote:
>On Tue, May 22, 2018 at 11:08:53AM +0200, Jiri Pirko wrote:
>> Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>> >Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:
>> >>Use the registration/notification framework supported by the generic
>> >>failover infrastructure.
>> >>
>> >>Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >
>> >In previous patchset versions, the common code did
>> >netdev_rx_handler_register() and netdev_upper_dev_link() etc
>> >(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>> >
>> >This should be part of the common "failover" code.
>> >
>> 
>> Also note that in the current patchset you use IFF_FAILOVER flag for
>> master, yet for the slave you use IFF_SLAVE. That is wrong.
>> IFF_FAILOVER_SLAVE should be used.
>
>Or drop IFF_FAILOVER_SLAVE and set both IFF_FAILOVER and IFF_SLAVE?

No. IFF_SLAVE is for bonding.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 11:06:37AM CEST, j...@resnulli.us wrote:
>Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:
>>Use the registration/notification framework supported by the generic
>>failover infrastructure.
>>
>>Signed-off-by: Sridhar Samudrala 
>
>In previous patchset versions, the common code did
>netdev_rx_handler_register() and netdev_upper_dev_link() etc
>(netvsc_vf_join()). Now, this is still done in netvsc. Why?
>
>This should be part of the common "failover" code.
>

Also note that in the current patchset you use IFF_FAILOVER flag for
master, yet for the slave you use IFF_SLAVE. That is wrong.
IFF_FAILOVER_SLAVE should be used.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 04:06:18AM CEST, sridhar.samudr...@intel.com wrote:
>Use the registration/notification framework supported by the generic
>failover infrastructure.
>
>Signed-off-by: Sridhar Samudrala 

In previous patchset versions, the common code did
netdev_rx_handler_register() and netdev_upper_dev_link() etc
(netvsc_vf_join()). Now, this is still done in netvsc. Why?

This should be part of the common "failover" code.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v11 3/5] net: Introduce net_failover driver

2018-05-22 Thread Jiri Pirko
Tue, May 22, 2018 at 04:06:19AM CEST, sridhar.samudr...@intel.com wrote:
>The net_failover driver provides an automated failover mechanism via APIs
>to create and destroy a failover master netdev and mananges a primary and
>standby slave netdevs that get registered via the generic failover
>infrastructure.
>
>The failover netdev acts a master device and controls 2 slave devices. The
>original paravirtual interface gets registered as 'standby' slave netdev and
>a passthru/vf device with the same MAC gets registered as 'primary' slave
>netdev. Both 'standby' and 'failover' netdevs are associated with the same
>'pci' device. The user accesses the network interface via 'failover' netdev.
>The 'failover' netdev chooses 'primary' netdev as default for transmits when
>it is available with link up and running.
>
>This can be used by paravirtual drivers to enable an alternate low latency
>datapath. It also enables hypervisor controlled live migration of a VM with
>direct attached VF by failing over to the paravirtual datapath when the VF
>is unplugged.
>
>Signed-off-by: Sridhar Samudrala 

[...]

>+
>+The net_failover driver provides an automated failover mechanism via APIs
>+to create and destroy a failover master netdev and mananges a primary and

s/mananges/manages/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 07:51:12PM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 5/2/2018 9:15 AM, Jiri Pirko wrote:
>> Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:
>> > Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>> [...]
>> 
>> 
>> > > +
>> > > +err = netdev_rx_handler_register(slave_dev, 
>> > > net_failover_handle_frame,
>> > > + failover_dev);
>> > > +if (err) {
>> > > +netdev_err(slave_dev, "can not register failover rx 
>> > > handler (err = %d)\n",
>> > > +   err);
>> > > +goto err_handler_register;
>> > > +}
>> > > +
>> > > +err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> > Please use netdev_master_upper_dev_link().
>> Don't forget to fillup struct netdev_lag_upper_info - 
>> NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
>> 
>> 
>> Also, please call netdev_lower_state_changed() when the active slave
>> device changes from primary->backup of backup->primary and whenever link
>> state of a slave changes
>> 
>Sure. will look into it.  Do you think this will help with the issue
>you saw with having to change mac on standy twice to get the init scripts
>working? We are now going to block changing the mac on both standby and
>failover.

I don't see any relation to that.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 07:51:12PM CEST, sridhar.samudr...@intel.com wrote:
>
>
>On 5/2/2018 9:15 AM, Jiri Pirko wrote:
>> Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:
>> > Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>> [...]
>> 
>> 
>> > > +
>> > > +err = netdev_rx_handler_register(slave_dev, 
>> > > net_failover_handle_frame,
>> > > + failover_dev);
>> > > +if (err) {
>> > > +netdev_err(slave_dev, "can not register failover rx 
>> > > handler (err = %d)\n",
>> > > +   err);
>> > > +goto err_handler_register;
>> > > +}
>> > > +
>> > > +err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> > Please use netdev_master_upper_dev_link().
>> Don't forget to fillup struct netdev_lag_upper_info - 
>> NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
>> 
>> 
>> Also, please call netdev_lower_state_changed() when the active slave
>> device changes from primary->backup of backup->primary and whenever link
>> state of a slave changes
>> 
>Sure. will look into it.  Do you think this will help with the issue
>you saw with having to change mac on standy twice to get the init scripts
>working? We are now going to block changing the mac on both standby and
>failover.
>
>Also, i was wondering if we should set dev->flags to IFF_MASTER on failover
>and IFF_SLAVE on primary and standby. netvsc does this.

No. Don't set it. It is wrong.



>Does this help with the init scripts and network manager to skip slave
>devices for dhcp requests?
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-05-02 Thread Jiri Pirko
Sat, Apr 28, 2018 at 11:06:01AM CEST, j...@resnulli.us wrote:
>Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:

[...]


>>+
>>+ err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
>>+  failover_dev);
>>+ if (err) {
>>+ netdev_err(slave_dev, "can not register failover rx handler 
>>(err = %d)\n",
>>+err);
>>+ goto err_handler_register;
>>+ }
>>+
>>+ err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>
>Please use netdev_master_upper_dev_link().

Don't forget to fillup struct netdev_lag_upper_info - 
NETDEV_LAG_TX_TYPE_ACTIVEBACKUP


Also, please call netdev_lower_state_changed() when the active slave
device changes from primary->backup of backup->primary and whenever link
state of a slave changes

[...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 05:34:44PM CEST, sridhar.samudr...@intel.com wrote:
>On 5/2/2018 12:50 AM, Jiri Pirko wrote:
>> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> > > > > Now I try to change mac of the failover master:
>> > > > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> > > > > RTNETLINK answers: Operation not supported
>> > > > > 
>> > > > > That I did expect to work. I would expect this would change the mac 
>> > > > > of
>> > > > > the master and both standby and primary slaves.
>> > > > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> > > Note that at this point, I have no VF. So I'm not sure why you mention
>> > > that.
>> > > 
>> > > 
>> > > > in this mode we are assuming that the hypervisor has assigned the MAC 
>> > > > and
>> > > > guest is not expected to change the MAC.
>> > > Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> > > mac and all works fine. How is this different? Change mac on "failover
>> > > instance" should work and should propagate the mac down to its slaves.
>> > > 
>> > > 
>> > > > For the initial implementation, i would propose not allowing the guest 
>> > > > to
>> > > > change the MAC of failover or standby dev.
>> > > I see no reason for such restriction.
>> > > 
>> > It is true that a VM user can change mac address of a normal virtio-net 
>> > interface,
>> > however when it is in STANDBY mode i think we should not allow this change 
>> > specifically
>> > because we are creating a failover instance based on a MAC that is 
>> > assigned by the
>> > hypervisor.
>> > 
>> > Moreover,  in a cloud environment i would think that PF/hypervisor assigns 
>> > a MAC to
>> > the VF and it cannot be changed by the guest.
>> So that is easy. You allow the change of the mac and in the "failover"
>> mac change implementation you propagate the change down to slaves. If
>> one slave does not support the change, you bail out. And since VF does
>> not allow it as you say, once it will be enslaved, the mac change could
>> not be done. Seems like a correct behavior to me and is in-sync with how
>> bond/team behaves.
>
>If we allow the mac to be changed when the VF is not yet plugged in
>and the guest changes the mac, then VF cannot join the failover when
>it is hot plugged with the original mac assigned by the hypervisor.
>Specifically there could be timing window during the live migration where
>VF would be unplugged at the source and if we allow the guest to change the
>failover mac, then VF would not be able to register with the failover after
>the VM is migrated to the destination.

Okay. So as suggested by mst, just forbid the mac changes all together.


>
>> 
>> > So for the initial implementation, do you see any issues with having this 
>> > restriction
>> > in STANDBY mode.
>> > 
>> > 
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 05:47:27PM CEST, m...@redhat.com wrote:
>On Wed, May 02, 2018 at 09:50:21AM +0200, Jiri Pirko wrote:
>> Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudr...@intel.com wrote:
>> >On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> >> 
>> >> > > Now I try to change mac of the failover master:
>> >> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> >> > > RTNETLINK answers: Operation not supported
>> >> > > 
>> >> > > That I did expect to work. I would expect this would change the mac of
>> >> > > the master and both standby and primary slaves.
>> >> > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> >> Note that at this point, I have no VF. So I'm not sure why you mention
>> >> that.
>> >> 
>> >> 
>> >> > in this mode we are assuming that the hypervisor has assigned the MAC 
>> >> > and
>> >> > guest is not expected to change the MAC.
>> >> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> >> mac and all works fine. How is this different? Change mac on "failover
>> >> instance" should work and should propagate the mac down to its slaves.
>> >> 
>> >> 
>> >> > For the initial implementation, i would propose not allowing the guest 
>> >> > to
>> >> > change the MAC of failover or standby dev.
>> >> I see no reason for such restriction.
>> >> 
>> >
>> >It is true that a VM user can change mac address of a normal virtio-net 
>> >interface,
>> >however when it is in STANDBY mode i think we should not allow this change 
>> >specifically
>> >because we are creating a failover instance based on a MAC that is assigned 
>> >by the
>> >hypervisor.
>> >
>> >Moreover,  in a cloud environment i would think that PF/hypervisor assigns 
>> >a MAC to
>> >the VF and it cannot be changed by the guest.
>> 
>> So that is easy. You allow the change of the mac and in the "failover"
>> mac change implementation you propagate the change down to slaves. If
>> one slave does not support the change, you bail out. And since VF does
>
>I wish people would say primary/standby and not "VF" :)

Sure, sorry.


>
>> not allow it as you say, once it will be enslaved, the mac change could
>> not be done. Seems like a correct behavior to me
>
>
>what if primary does not allow mac changes and is attached after
>mac is changed on standy?

Mac should be changed on failover. In that case, the primary would have
a different mac and therefore it won't get enslaved.

>
>
>> and is in-sync with how
>> bond/team behaves.
>
>I think in the end virtio can just block MAC changes for simplicity.
>
>I personally don't see softmac as a must have feature in v1,
>we can add it later.

Okay.


>
>What's the situation with init scripts and whether it's
>possible to make them work well would be a better question.
>
>> 
>> >
>> >So for the initial implementation, do you see any issues with having this 
>> >restriction
>> >in STANDBY mode.
>> >
>> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-05-02 Thread Jiri Pirko
Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudr...@intel.com wrote:
>On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> 
>> > > Now I try to change mac of the failover master:
>> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> > > RTNETLINK answers: Operation not supported
>> > > 
>> > > That I did expect to work. I would expect this would change the mac of
>> > > the master and both standby and primary slaves.
>> > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> Note that at this point, I have no VF. So I'm not sure why you mention
>> that.
>> 
>> 
>> > in this mode we are assuming that the hypervisor has assigned the MAC and
>> > guest is not expected to change the MAC.
>> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> mac and all works fine. How is this different? Change mac on "failover
>> instance" should work and should propagate the mac down to its slaves.
>> 
>> 
>> > For the initial implementation, i would propose not allowing the guest to
>> > change the MAC of failover or standby dev.
>> I see no reason for such restriction.
>> 
>
>It is true that a VM user can change mac address of a normal virtio-net 
>interface,
>however when it is in STANDBY mode i think we should not allow this change 
>specifically
>because we are creating a failover instance based on a MAC that is assigned by 
>the
>hypervisor.
>
>Moreover,  in a cloud environment i would think that PF/hypervisor assigns a 
>MAC to
>the VF and it cannot be changed by the guest.

So that is easy. You allow the change of the mac and in the "failover"
mac change implementation you propagate the change down to slaves. If
one slave does not support the change, you bail out. And since VF does
not allow it as you say, once it will be enslaved, the mac change could
not be done. Seems like a correct behavior to me and is in-sync with how
bond/team behaves.


>
>So for the initial implementation, do you see any issues with having this 
>restriction
>in STANDBY mode.
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-05-01 Thread Jiri Pirko
Mon, Apr 30, 2018 at 09:26:34PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/30/2018 12:12 AM, Jiri Pirko wrote:
>> Mon, Apr 30, 2018 at 05:00:33AM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/28/2018 1:24 AM, Jiri Pirko wrote:
>> > > Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > > > netdev is present with the same MAC address. It allows live migration
>> > > > of a VM with a direct attached VF without the need to setup a bond/team
>> > > > between a VF and virtio net device in the guest.
>> > > > 
>> > > > The hypervisor needs to enable only one datapath at any time so that
>> > > > packets don't get looped back to the VM over the other datapath. When 
>> > > > a VF
>> > > Why? Both datapaths could be enabled at a time. Why the loop on
>> > > hypervisor side would be a problem. This in not an issue for
>> > > bonding/team as well.
>> > Somehow the hypervisor needs to make sure that the broadcasts/multicasts 
>> > from the VM
>> > sent over the VF datapath don't get looped back to the VM via the 
>> > virtio-net datapth.
>> Why? Please see below.
>> 
>> 
>> > This can happen if both datapaths are enabled at the same time.
>> > 
>> > I would think this is an issue even with bonding/team as well when 
>> > virtio-net and
>> > VF are backed by the same PF.
>> > 
>> > 
>> I believe that the scenario is the same as on an ordinary nic/swich
>> network:
>> 
>> ...
>> 
>>host
>> bond0
>>/ \
>>  eth0   eth1
>>   |   |
>> ...
>>   |   |
>>   p1  p2
>> 
>>switch
>> 
>> ...
>> 
>> It is perfectly valid to p1 and p2 be up and "bridged" together. Bond
>> has to cope with loop-backed frames. "Failover driver" should too,
>> it's the same scenario.
>
>OK. So looks like we should be able to handle this by returning 
>RX_HANDLER_EXACT
>for frames received on standby device when primary is present.

Yep.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-30 Thread Jiri Pirko
Mon, Apr 30, 2018 at 06:16:58AM CEST, sridhar.samudr...@intel.com wrote:
>On 4/28/2018 2:42 AM, Jiri Pirko wrote:
>> Fri, Apr 27, 2018 at 07:06:59PM CEST,sridhar.samudr...@intel.com  wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> > 
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> > 
>> > It uses the generic failover framework that provides 2 functions to create
>> > and destroy a master failover netdev. When STANDBY feature is enabled, an
>> > additional netdev(failover netdev) is created that acts as a master device
>> > and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>> > is marked as 'standby' netdev and a passthru device with the same MAC is
>> > registered as 'primary' netdev.
>> > 
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization=151189725224231=2
>> > 
>> When I enabled the standby feature (hardcoded), I have 2 netdevices now:
>> 4: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
>> group default qlen 1000
>>  link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
>>  inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>> valid_lft forever preferred_lft forever
>> 5: ens3n_sby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
>> state UP group default qlen 1000
>>  link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
>>  inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>> valid_lft forever preferred_lft forever
>> 
>> However, it seems to confuse my initscripts on Fedora:
>> [root@test1 ~]# ifup ens3
>> ./network-functions: line 78: [: /etc/dhcp/dhclient-ens3: binary operator 
>> expected
>> ./network-functions: line 80: [: /etc/dhclient-ens3: binary operator expected
>> ./network-functions: line 69: [: /var/lib/dhclient/dhclient-ens3: binary 
>> operator expected
>> 
>> Determining IP information for ens3
>> ens3n_sby...Cannot find device "ens3n_sby.pid"
>> Cannot find device "ens3n_sby.lease"
>>   failed.
>> 
>> I tried to change the standby device mac:
>> ip link set ens3n_sby addr 52:54:00:b2:a7:f2
>> [root@test1 ~]# ifup ens3
>> 
>> Determining IP information for ens3... done.
>> [root@test1 ~]#
>> 
>> But now the network does not work. I think that the mac change on
>> standby device should be probably refused, no?
>
>Yes. we should block changing standby device mac.
>
>> When I change the mac back, all works fine.
>
>This is strange. So you had to change the standby device mac twice to
>get dhcp working.

Yes. First to some other mac to not to confuse initscripts, second back
to the "failover mac" in order to make frames go through (I'm guessing
that virtio_net also has mac filter for incoming frames).


>
>I do see NetworkManager trying to get dhcp address on standby device, but

Not using NM. Just good old Fedora initscripts.


>i don't see any issue with connectivity.  To be totally transparent, we
>need to only expose one netdev.

Yep.


>
>> 
>> Now I try to change mac of the failover master:
>> [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> RTNETLINK answers: Operation not supported
>> 
>> That I did expect to work. I would expect this would change the mac of
>> the master and both standby and primary slaves.
>
>If a VF is untrusted, a VM will not able to change its MAC and moreover

Note that at this point, I have no VF. So I'm not sure why you mention
that.


>in this mode we are assuming that the hypervisor has assigned the MAC and
>guest is not expected to change the MAC.

Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
mac and all works fine. How is this differe

Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-30 Thread Jiri Pirko
Mon, Apr 30, 2018 at 05:00:33AM CEST, sridhar.samudr...@intel.com wrote:
>
>On 4/28/2018 1:24 AM, Jiri Pirko wrote:
>> Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>> > This patch enables virtio_net to switch over to a VF datapath when a VF
>> > netdev is present with the same MAC address. It allows live migration
>> > of a VM with a direct attached VF without the need to setup a bond/team
>> > between a VF and virtio net device in the guest.
>> > 
>> > The hypervisor needs to enable only one datapath at any time so that
>> > packets don't get looped back to the VM over the other datapath. When a VF
>> Why? Both datapaths could be enabled at a time. Why the loop on
>> hypervisor side would be a problem. This in not an issue for
>> bonding/team as well.
>
>Somehow the hypervisor needs to make sure that the broadcasts/multicasts from 
>the VM
>sent over the VF datapath don't get looped back to the VM via the virtio-net 
>datapth.

Why? Please see below.


>This can happen if both datapaths are enabled at the same time.
>
>I would think this is an issue even with bonding/team as well when virtio-net 
>and
>VF are backed by the same PF.
>
>

I believe that the scenario is the same as on an ordinary nic/swich
network:

...

  host
 
   bond0
  / \
eth0   eth1
 |   |
...
 |   |
 p1  p2

  switch

...

It is perfectly valid to p1 and p2 be up and "bridged" together. Bond
has to cope with loop-backed frames. "Failover driver" should too,
it's the same scenario.


>> 
>> 
>> > is plugged, the virtio datapath link state can be marked as down. The
>> > hypervisor needs to unplug the VF device from the guest on the source host
>> > and reset the MAC filter of the VF to initiate failover of datapath to
>> "reset the MAC filter of the VF" - you mean "set the VF mac"?
>
>Yes.  the PF should take away the MAC address assigned to the VF so that the PF
>starts receiving those packets.

Okay, got it. Please put this in the description.


>
>> 
>> 
>> > virtio before starting the migration. After the migration is completed,
>> > the destination hypervisor sets the MAC filter on the VF and plugs it back
>> > to the guest to switch over to VF datapath.
>> > 
>> > It uses the generic failover framework that provides 2 functions to create
>> > and destroy a master failover netdev. When STANDBY feature is enabled, an
>> > additional netdev(failover netdev) is created that acts as a master device
>> > and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>> > is marked as 'standby' netdev and a passthru device with the same MAC is
>> > registered as 'primary' netdev.
>> > 
>> > This patch is based on the discussion initiated by Jesse on this thread.
>> > https://marc.info/?l=linux-virtualization=151189725224231=2
>> [...]
>> 
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit

2018-04-30 Thread Jiri Pirko
Mon, Apr 30, 2018 at 04:47:03AM CEST, sridhar.samudr...@intel.com wrote:
>On 4/28/2018 12:50 AM, Jiri Pirko wrote:
>> Fri, Apr 27, 2018 at 07:06:57PM CEST,sridhar.samudr...@intel.com  wrote:
>> > This feature bit can be used by hypervisor to indicate virtio_net device to
>> > act as a standby for another device with the same MAC address.
>> > 
>> > VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
>> > 
>> > Signed-off-by: Sridhar Samudrala<sridhar.samudr...@intel.com>
>> > ---
>> > drivers/net/virtio_net.c| 2 +-
>> > include/uapi/linux/virtio_net.h | 3 +++
>> > 2 files changed, 4 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > index 3b5991734118..51a085b1a242 100644
>> > --- a/drivers/net/virtio_net.c
>> > +++ b/drivers/net/virtio_net.c
>> > @@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
>> >VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>> >VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> >VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> > -  VIRTIO_NET_F_SPEED_DUPLEX
>> > +  VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
>> This is not part of current qemu master (head 
>> 6f0c4706b35dead265509115ddbd2a8d1af516c1)
>> Were I can find the qemu code?
>> 
>> Also, I think it makes sense to push HW (qemu HW in this case) first
>> and only then the driver.
>
>I had sent qemu patch with a couple of earlier versions of this patchset.
>Will include it when i send out v10.

The point was, don't you want to push it to qemu first? Did you at least
send RFC to qemu?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-29 Thread Jiri Pirko
Sun, Apr 29, 2018 at 10:56:30AM CEST, losewe...@gmail.com wrote:
>On Sat, Apr 28, 2018 at 2:42 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>>>This patch enables virtio_net to switch over to a VF datapath when a VF
>>>netdev is present with the same MAC address. It allows live migration
>>>of a VM with a direct attached VF without the need to setup a bond/team
>>>between a VF and virtio net device in the guest.
>>>
>>>The hypervisor needs to enable only one datapath at any time so that
>>>packets don't get looped back to the VM over the other datapath. When a VF
>>>is plugged, the virtio datapath link state can be marked as down. The
>>>hypervisor needs to unplug the VF device from the guest on the source host
>>>and reset the MAC filter of the VF to initiate failover of datapath to
>>>virtio before starting the migration. After the migration is completed,
>>>the destination hypervisor sets the MAC filter on the VF and plugs it back
>>>to the guest to switch over to VF datapath.
>>>
>>>It uses the generic failover framework that provides 2 functions to create
>>>and destroy a master failover netdev. When STANDBY feature is enabled, an
>>>additional netdev(failover netdev) is created that acts as a master device
>>>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>>>is marked as 'standby' netdev and a passthru device with the same MAC is
>>>registered as 'primary' netdev.
>>>
>>>This patch is based on the discussion initiated by Jesse on this thread.
>>>https://marc.info/?l=linux-virtualization=151189725224231=2
>>>
>>
>> When I enabled the standby feature (hardcoded), I have 2 netdevices now:
>> 4: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
>> group default qlen 1000
>> link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
>> inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>>valid_lft forever preferred_lft forever
>> 5: ens3n_sby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel 
>> state UP group default qlen 1000
>> link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
>> inet6 fe80::5054:ff:feb2:a7f1/64 scope link
>>valid_lft forever preferred_lft forever
>>
>> However, it seems to confuse my initscripts on Fedora:
>> [root@test1 ~]# ifup ens3
>> ./network-functions: line 78: [: /etc/dhcp/dhclient-ens3: binary operator 
>> expected
>> ./network-functions: line 80: [: /etc/dhclient-ens3: binary operator expected
>> ./network-functions: line 69: [: /var/lib/dhclient/dhclient-ens3: binary 
>> operator expected
>>
>You should teach Fedora and all cloud vendors to upgrade their
>initscripts and other userspace tools, no?

I just wanted to point out that the conversion from "nostandby" to
"standby" isn't always that smooth as claimed. The claim was "no change
for the current user" iirc.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-28 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>It uses the generic failover framework that provides 2 functions to create
>and destroy a master failover netdev. When STANDBY feature is enabled, an
>additional netdev(failover netdev) is created that acts as a master device
>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>is marked as 'standby' netdev and a passthru device with the same MAC is
>registered as 'primary' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization=151189725224231=2
>

When I enabled the standby feature (hardcoded), I have 2 netdevices now:
4: ens3:  mtu 1500 qdisc noqueue state UP 
group default qlen 1000
link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5054:ff:feb2:a7f1/64 scope link 
   valid_lft forever preferred_lft forever
5: ens3n_sby:  mtu 1500 qdisc fq_codel state 
UP group default qlen 1000
link/ether 52:54:00:b2:a7:f1 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5054:ff:feb2:a7f1/64 scope link 
   valid_lft forever preferred_lft forever

However, it seems to confuse my initscripts on Fedora:
[root@test1 ~]# ifup ens3
./network-functions: line 78: [: /etc/dhcp/dhclient-ens3: binary operator 
expected
./network-functions: line 80: [: /etc/dhclient-ens3: binary operator expected
./network-functions: line 69: [: /var/lib/dhclient/dhclient-ens3: binary 
operator expected

Determining IP information for ens3
ens3n_sby...Cannot find device "ens3n_sby.pid"
Cannot find device "ens3n_sby.lease"
 failed.

I tried to change the standby device mac:
ip link set ens3n_sby addr 52:54:00:b2:a7:f2
[root@test1 ~]# ifup ens3

Determining IP information for ens3... done.
[root@test1 ~]#

But now the network does not work. I think that the mac change on
standby device should be probably refused, no?

When I change the mac back, all works fine.


Now I try to change mac of the failover master:
[root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
RTNETLINK answers: Operation not supported

That I did expect to work. I would expect this would change the mac of
the master and both standby and primary slaves.


Now I tried to add a primary pci device. I don't have any fancy VF on my
test setup, but I expected the good old 8139cp to work:
[root@test1 ~]# ethtool -i ens9
driver: 8139cp

[root@test1 ~]# ip link set ens9 addr 52:54:00:b2:a7:f1

I see no message in dmesg, so I guess the failover module did not
enslave this netdev. The mac change is not monitored. I would expect
that it is and whenever a device changes mac to the failover one, it
should be enslaved and whenever it changes mac back to something else,
it should be released - the primary one ofcourse.



[...]

>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>+size_t len)
>+{
>+  struct virtnet_info *vi = netdev_priv(dev);
>+  int ret;
>+
>+  if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_STANDBY))
>+  return -EOPNOTSUPP;
>+
>+  ret = snprintf(buf, len, "_sby");

please avoid the "_".

[...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-04-28 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>net_failover_create()
>net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via
>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>net_failover_register()
>net_failover_unregister()
>

First of all, I like this v9 very much. Nice progress!
Couple of notes inlined.


>Signed-off-by: Sridhar Samudrala 
>---
> include/linux/netdevice.h  |  16 +
> include/net/net_failover.h |  62 
> net/Kconfig|  10 +
> net/core/Makefile  |   1 +
> net/core/net_failover.c| 892 +
> 5 files changed, 981 insertions(+)
> create mode 100644 include/net/net_failover.h
> create mode 100644 net/core/net_failover.c

[...]


>+static int net_failover_slave_register(struct net_device *slave_dev)
>+{
>+  struct net_failover_info *nfo_info;
>+  struct net_failover_ops *nfo_ops;
>+  struct net_device *failover_dev;
>+  bool slave_is_standby;
>+  u32 orig_mtu;
>+  int err;
>+
>+  ASSERT_RTNL();
>+
>+  failover_dev = net_failover_get_bymac(slave_dev->perm_addr, _ops);
>+  if (!failover_dev)
>+  goto done;
>+
>+  if (failover_dev->type != slave_dev->type)
>+  goto done;
>+
>+  if (nfo_ops && nfo_ops->slave_register)
>+  return nfo_ops->slave_register(slave_dev, failover_dev);
>+
>+  nfo_info = netdev_priv(failover_dev);
>+  slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);

No parentheses needed.


>+  if (slave_is_standby ? rtnl_dereference(nfo_info->standby_dev) :
>+  rtnl_dereference(nfo_info->primary_dev)) {
>+  netdev_err(failover_dev, "%s attempting to register as slave 
>dev when %s already present\n",
>+ slave_dev->name,
>+ slave_is_standby ? "standby" : "primary");
>+  goto done;
>+  }
>+
>+  /* We want to allow only a direct attached VF device as a primary
>+   * netdev. As there is no easy way to check for a VF device, restrict
>+   * this to a pci device.
>+   */
>+  if (!slave_is_standby && (!slave_dev->dev.parent ||
>+!dev_is_pci(slave_dev->dev.parent)))

Yeah, this is good for now.


>+  goto done;
>+
>+  if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
>+  vlan_uses_dev(failover_dev)) {
>+  netdev_err(failover_dev, "Device %s is VLAN challenged and 
>failover device has VLAN set up\n",
>+ failover_dev->name);
>+  goto done;
>+  }
>+
>+  /* Align MTU of slave with failover dev */
>+  orig_mtu = slave_dev->mtu;
>+  err = dev_set_mtu(slave_dev, failover_dev->mtu);
>+  if (err) {
>+  netdev_err(failover_dev, "unable to change mtu of %s to %u 
>register failed\n",
>+ slave_dev->name, failover_dev->mtu);
>+  goto done;
>+  }
>+
>+  dev_hold(slave_dev);
>+
>+  if (netif_running(failover_dev)) {
>+  err = dev_open(slave_dev);
>+  if (err && (err != -EBUSY)) {
>+  netdev_err(failover_dev, "Opening slave %s failed 
>err:%d\n",
>+ slave_dev->name, err);
>+  goto err_dev_open;
>+  }
>+  }
>+
>+  netif_addr_lock_bh(failover_dev);
>+  dev_uc_sync_multiple(slave_dev, failover_dev);
>+  dev_uc_sync_multiple(slave_dev, failover_dev);
>+  netif_addr_unlock_bh(failover_dev);
>+

Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-28 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:59PM CEST, sridhar.samudr...@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF

Why? Both datapaths could be enabled at a time. Why the loop on
hypervisor side would be a problem. This in not an issue for
bonding/team as well.


>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to

"reset the MAC filter of the VF" - you mean "set the VF mac"?


>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>It uses the generic failover framework that provides 2 functions to create
>and destroy a master failover netdev. When STANDBY feature is enabled, an
>additional netdev(failover netdev) is created that acts as a master device
>and tracks the state of the 2 lower netdevs. The original virtio_net netdev
>is marked as 'standby' netdev and a passthru device with the same MAC is
>registered as 'primary' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization=151189725224231=2

[...]

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-04-28 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>net_failover_create()
>net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via

'standby' and 'primary' netdevs are not associated with the same 'pci'
device.
"Primary" is the VF netdevice and "standby" is virtio_net. Each
associated with different pci device.


>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>net_failover_register()
>net_failover_unregister()

[...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 1/4] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit

2018-04-28 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:57PM CEST, sridhar.samudr...@intel.com wrote:
>This feature bit can be used by hypervisor to indicate virtio_net device to
>act as a standby for another device with the same MAC address.
>
>VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
>
>Signed-off-by: Sridhar Samudrala 
>---
> drivers/net/virtio_net.c| 2 +-
> include/uapi/linux/virtio_net.h | 3 +++
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index 3b5991734118..51a085b1a242 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
>   VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>   VIRTIO_NET_F_CTRL_MAC_ADDR, \
>   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>-  VIRTIO_NET_F_SPEED_DUPLEX
>+  VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY


This is not part of current qemu master (head 
6f0c4706b35dead265509115ddbd2a8d1af516c1)
Were I can find the qemu code?

Also, I think it makes sense to push HW (qemu HW in this case) first
and only then the driver.



> 
> static unsigned int features[] = {
>   VIRTNET_FEATURES,
>diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>index 5de6ed37695b..a3715a3224c1 100644
>--- a/include/uapi/linux/virtio_net.h
>+++ b/include/uapi/linux/virtio_net.h
>@@ -57,6 +57,9 @@
>* Steering */
> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> 
>+#define VIRTIO_NET_F_STANDBY62/* Act as standby for another device
>+   * with the same MAC.
>+   */
> #define VIRTIO_NET_F_SPEED_DUPLEX 63  /* Device set linkspeed and duplex */
> 
> #ifndef VIRTIO_NET_NO_LEGACY
>-- 
>2.14.3
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device

2018-04-27 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:53:01PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/27/2018 10:45 AM, Jiri Pirko wrote:
>> Fri, Apr 27, 2018 at 07:06:56PM CEST, sridhar.samudr...@intel.com wrote:

[...]

>> 
>> No changes in v9?
>
>I listed v9 updates at the start of the message.

Hmm, odd. I expected that at the end, in the changelog among other Vs
changes.

Will review this patchset tomorrow.
Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 2/4] net: Introduce generic failover module

2018-04-27 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>   the failover module provides interfaces to create/destroy additional
>   master netdev and all the slave events are managed internally.
>net_failover_create()
>net_failover_destroy()
>   A failover netdev is created that acts a master device and controls 2
>   slave devices. The original virtio_net netdev is registered as 'standby'
>   netdev and a passthru/vf device with the same MAC gets registered as
>   'primary' netdev. Both 'standby' and 'primary' netdevs are associated
>   with the same 'pci' device.  The user accesses the network interface via
>   'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>   default for transmits when it is available with link up and running.
>2. For existing netvsc driver that uses 2 netdev model, no master netdev
>   is created. The paravirtual driver registers each instance of netvsc
>   as a 'failover' netdev  along with a set of ops to manage the slave
>   events. There is no 'standby' netdev in this model. A passthru/vf device
>   with the same MAC gets registered as 'primary' netdev.
>net_failover_register()
>net_failover_unregister()
>
>Signed-off-by: Sridhar Samudrala 
>---
> include/linux/netdevice.h  |  16 +
> include/net/net_failover.h |  62 
> net/Kconfig|  10 +
> net/core/Makefile  |   1 +
> net/core/net_failover.c| 892 +
> 5 files changed, 981 insertions(+)
> create mode 100644 include/net/net_failover.h
> create mode 100644 net/core/net_failover.c

checkpatch says:

_exportax/0002-net-Introduce-generic-failover-module.patch
--
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#92: 
new file mode 100644

Please add an entry to the MAINTAINERS file.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v9 0/4] Enable virtio_net to act as a standby for a passthru device

2018-04-27 Thread Jiri Pirko
Fri, Apr 27, 2018 at 07:06:56PM CEST, sridhar.samudr...@intel.com wrote:
>v9:
>Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET 
>are enabled. (stephen)
>
>Tested live migration with virtio-net/AVF(i40evf) configured in 
>failover mode while running iperf in background.
>Build tested netvsc module.
>
>The main motivation for this patch is to enable cloud service providers
>to provide an accelerated datapath to virtio-net enabled VMs in a 
>transparent manner with no/minimal guest userspace changes. This also
>enables hypervisor controlled live migration to be supported with VMs that
>have direct attached SR-IOV VF devices.
>
>Patch 1 introduces a new feature bit VIRTIO_NET_F_STANDBY that can be
>used by hypervisor to indicate that virtio_net interface should act as
>a standby for another device with the same MAC address.
>
>Patch 2 introduces a failover module that provides a generic interface for 
>paravirtual drivers to listen for netdev register/unregister/link change
>events from pci ethernet devices with the same MAC and takeover their
>datapath. The notifier and event handling code is based on the existing
>netvsc implementation. It provides 2 sets of interfaces to paravirtual 
>drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.
>
>Patch 3 extends virtio_net to use alternate datapath when available and
>registered. When STANDBY feature is enabled, virtio_net driver creates
>an additional 'failover' netdev that acts as a master device and controls
>2 slave devices.  The original virtio_net netdev is registered as
>'standby' netdev and a passthru/vf device with the same MAC gets
>registered as 'primary' netdev. Both 'standby' and 'primary' netdevs are
>associated with the same 'pci' device.  The user accesses the network
>interface via 'failover' netdev. The 'failover' netdev chooses 'primary'
>netdev as default for transmits when it is available with link up and
>running.
>
>Patch 4 refactors netvsc to use the registration/notification framework
>supported by failover module.
>
>As this patch series is initially focusing on usecases where hypervisor 
>fully controls the VM networking and the guest is not expected to directly 
>configure any hardware settings, it doesn't expose all the ndo/ethtool ops
>that are supported by virtio_net at this time. To support additional usecases,
>it should be possible to enable additional ops later by caching the state
>in virtio netdev and replaying when the 'primary' netdev gets registered. 
> 
>The hypervisor needs to enable only one datapath at any time so that packets
>don't get looped back to the VM over the other datapath. When a VF is
>plugged, the virtio datapath link state can be marked as down.
>At the time of live migration, the hypervisor needs to unplug the VF device
>from the guest on the source host and reset the MAC filter of the VF to
>initiate failover of datapath to virtio before starting the migration. After
>the migration is completed, the destination hypervisor sets the MAC filter
>on the VF and plugs it back to the guest to switch over to VF datapath.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization=151189725224231=2


No changes in v9?


>
>v8:
>- Made the failover managment routines more robust by updating the feature 
>  bits/other fields in the failover netdev when slave netdevs are 
>  registered/unregistered. (mst)
>- added support for handling vlans.
>- Limited the changes in netvsc to only use the notifier/event/lookups
>  from the failover module. The slave register/unregister/link-change 
>  handlers are only updated to use the getbymac routine to get the 
>  upper netdev. There is no change in their functionality. (stephen)
>- renamed structs/function/file names to use net_failover prefix. (mst)
>
>v7
>- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
>  (jiri, mst)
>- re-arranged dev_open() and dev_set_mtu() calls in the register routines
>  so that they don't get called for 2-netdev model. (stephen)
>- fixed select_queue() routine to do queue selection based on VF if it is
>  registered as primary. (stephen)
>-  minor bugfixes
>
>v6 RFC:
>  Simplified virtio_net changes by moving all the ndo_ops of the 
>  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
>  avoided 2 phase registration(driver + instances).
>  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
>  replaced mutex with a spinlock
>
>v5 RFC:
>  Based on Jiri's comments, moved the common functionality to a 'bypass'
>  module so that the same notifier and event handlers to handle child
>  register/unregister/link change events can be shared between virtio_net
>  and netvsc.
>  Improved error handling based on Siwei's comments.
>v4:
>- Based on the review comments on the v3 version of the RFC patch and
>  Jakub's suggestion for the naming issue with 3 netdev solution,
>  proposed 3 netdev in-driver bonding solution for 

Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-23 Thread Jiri Pirko
Mon, Apr 23, 2018 at 07:04:06PM CEST, step...@networkplumber.org wrote:
>On Fri, 20 Apr 2018 18:00:58 +0200
>Jiri Pirko <j...@resnulli.us> wrote:
>
>> Fri, Apr 20, 2018 at 05:28:02PM CEST, step...@networkplumber.org wrote:
>> >On Thu, 19 Apr 2018 18:42:04 -0700
>> >Sridhar Samudrala <sridhar.samudr...@intel.com> wrote:
>> >  
>> >> Use the registration/notification framework supported by the generic
>> >> failover infrastructure.
>> >> 
>> >> Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>  
>> >
>> >Do what you want to other devices but leave netvsc alone.
>> >Adding these failover ops does not reduce the code size, and really is
>> >no benefit.  The netvsc device driver needs to be backported to several
>> >other distributions and doing this makes that harder.  
>> 
>> We should not care about the backport burden when we are trying to make
>> things right. And things are not right. The current netvsc approach is
>> just plain wrong shortcut. It should have been done in a generic way
>> from the very beginning. We are just trying to fix this situation.
>> 
>> Moreover, I believe that part of the fix is to convert netvsc to 3
>> netdev solution too. 2 netdev model is wrong.
>> 
>> 
>> >
>> >I will NAK patches to change to common code for netvsc especially the
>> >three device model.  MS worked hard with distro vendors to support 
>> >transparent
>> >mode, ans we really can't have a new model; or do backport.
>> >
>> >Plus, DPDK is now dependent on existing model.  
>> 
>> Sorry, but nobody here cares about dpdk or other similar oddities.
>
>The network device model is a userspace API, and DPDK is a userspace 
>application.
>You can't go breaking userspace even if you don't like the application.

I don't understand how you can break anything by exposing
just-another-netdevice. If you break it, it is already broken...

And how you can break anything in userspace by doing refactoring inside
the kernel is puzzling me even more...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-20 Thread Jiri Pirko
Fri, Apr 20, 2018 at 05:28:02PM CEST, step...@networkplumber.org wrote:
>On Thu, 19 Apr 2018 18:42:04 -0700
>Sridhar Samudrala  wrote:
>
>> Use the registration/notification framework supported by the generic
>> failover infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala 
>
>Do what you want to other devices but leave netvsc alone.
>Adding these failover ops does not reduce the code size, and really is
>no benefit.  The netvsc device driver needs to be backported to several
>other distributions and doing this makes that harder.

We should not care about the backport burden when we are trying to make
things right. And things are not right. The current netvsc approach is
just plain wrong shortcut. It should have been done in a generic way
from the very beginning. We are just trying to fix this situation.

Moreover, I believe that part of the fix is to convert netvsc to 3
netdev solution too. 2 netdev model is wrong.


>
>I will NAK patches to change to common code for netvsc especially the
>three device model.  MS worked hard with distro vendors to support transparent
>mode, ans we really can't have a new model; or do backport.
>
>Plus, DPDK is now dependent on existing model.

Sorry, but nobody here cares about dpdk or other similar oddities.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-19 Thread Jiri Pirko
Thu, Apr 19, 2018 at 06:08:58AM CEST, m...@redhat.com wrote:
>On Wed, Apr 18, 2018 at 10:32:06PM +0200, Jiri Pirko wrote:
>> >> >> > With regards to alternate names for 'active', you suggested 
>> >> >> > 'stolen', but i
>> >> >> > am not too happy with it.
>> >> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 
>> >> >> > 'passthru'
>> >> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> >> I think "stolen" is quite appropriate since it describes the modus
>> >> >> operandi. The bypass master steals some netdevice according to some
>> >> >> match.
>> >> >> 
>> >> >> But I don't insist on "stolen". Just sounds right.
>> >> >
>> >> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this 
>> >> >feature, So i think
>> >> >'backup' name is consistent.
>> >> 
>> >> It perhaps makes sense from the view of virtio device. However, as I
>> >> described couple of times, for master/slave device the name "backup" is
>> >> highly misleading.
>> >
>> >virtio is the backup. You are supposed to use another
>> >(typically passthrough) device, if that fails use virtio.
>> >It does seem appropriate to me. If you like, we can
>> >change that to "standby".  Active I don't like either. "main"?
>> 
>> Sounds much better, yes.
>
>Excuse me, which of the versions are better in your eyes?

standby is okay. main/primary is fine too.

>
>
>> 
>> >
>> >In fact would failover be better than bypass?
>> 
>> Also, much better.
>> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-19 Thread Jiri Pirko
Thu, Apr 19, 2018 at 12:46:11AM CEST, sridhar.samudr...@intel.com wrote:
>On 4/18/2018 1:32 PM, Jiri Pirko wrote:
>> > > > > > > You still use "active"/"backup" names which is highly misleading 
>> > > > > > > as
>> > > > > > > it has completely different meaning that in bond for example.
>> > > > > > > I noted that in my previous review already. Please change it.
>> > > > > > I guess the issue is with only the 'active'  name. 'backup' should 
>> > > > > > be fine as it also
>> > > > > > matches with the BACKUP feature bit we are adding to virtio_net.
>> > > > > I think that "backup" is also misleading. Both "active" and "backup"
>> > > > > mean a *state* of slaves. This should be named differently.
>> > > > > 
>> > > > > 
>> > > > > 
>> > > > > > With regards to alternate names for 'active', you suggested 
>> > > > > > 'stolen', but i
>> > > > > > am not too happy with it.
>> > > > > > netvsc uses vf_netdev, are you OK with this? Or another option is 
>> > > > > > 'passthru'
>> > > > > No. The netdev could be any netdevice. It does not have to be a "VF".
>> > > > > I think "stolen" is quite appropriate since it describes the modus
>> > > > > operandi. The bypass master steals some netdevice according to some
>> > > > > match.
>> > > > > 
>> > > > > But I don't insist on "stolen". Just sounds right.
>> > > > We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this 
>> > > > feature, So i think
>> > > > 'backup' name is consistent.
>> > > It perhaps makes sense from the view of virtio device. However, as I
>> > > described couple of times, for master/slave device the name "backup" is
>> > > highly misleading.
>> > virtio is the backup. You are supposed to use another
>> > (typically passthrough) device, if that fails use virtio.
>> > It does seem appropriate to me. If you like, we can
>> > change that to "standby".  Active I don't like either. "main"?
>> Sounds much better, yes.
>
>OK. Will change backup to 'standby'.
>'main' is fine, what about 'primary'?

Primary is also bonding terminology. But in this case, I think it would
fit. The primary slave is used as the active one whenever the link is
up.


>
>
>> 
>> 
>> > In fact would failover be better than bypass?
>> Also, much better.
>
>So do we want to change all 'bypass' references to 'failover' including
>the filenames.(net/core/failover.c and include/net/failover.h)
>
>
>
>
>
>> 
>> 
>> > 
>> > > > The intent is to restrict the 'active' netdev to be a VF. If there is 
>> > > > a way to check that
>> > > > a PCI device is a VF in the guest kernel, we could restrict 'active' 
>> > > > netdev to be a VF.
>> > > > 
>> > > > Will look for any suggestions in the next day or two. If i don't get 
>> > > > any, i will go
>> > > > with 'stolen'
>> > > > 
>> > > > 
>> > > > 
>> > > > 
>> > > > > +
>> > > > > +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> > > > > +  struct bypass_ops 
>> > > > > **ops)
>> > > > > +{
>> > > > > +struct bypass_master *bypass_master;
>> > > > > +struct net_device *bypass_netdev;
>> > > > > +
>> > > > > +spin_lock(_lock);
>> > > > > +list_for_each_entry(bypass_master, _master_list, list) {
>> > > > > > > As I wrote the last time, you don't need this list, spinlock.
>> > > > > > > You can do just something like:
>> > > > > > >for_each_net(net) {
>> > > > > > >for_each_netdev(net, dev) {
>> > > > > > >  if (netif_is_bypass_master(dev)) {
>> > > > > > This function returns the upper netdev as well as the ops 
>> > > > > > associated
>> > > > > > with that netdev.
>> > > > > >

Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-18 Thread Jiri Pirko
Wed, Apr 18, 2018 at 09:46:04PM CEST, m...@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudr...@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudr...@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudr...@intel.com 
>> >> > > wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier 
>> >> > > > and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > > 
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, 
>> >> > > > no
>> >> > > > master netdev is created. The paravirtual driver registers each 
>> >> > > > bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > >   bypass_master_register()
>> >> > > >   bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this 
>> >> > > > model,
>> >> > > > the bypass module provides interfaces to create/destroy additional 
>> >> > > > master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > >bypass_master_create()
>> >> > > >bypass_master_destroy()
>> >> > > > 
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h |  14 +
>> >> > > > include/net/bypass.h  |  96 ++
>> >> > > > net/Kconfig   |  18 +
>> >> > > > net/core/Makefile |   1 +
>> >> > > > net/core/bypass.c | 844 
>> >> > > > ++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > > 
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > IFF_PHONY_HEADROOM  = 1<<24,
>> >> > > > IFF_MACSEC  = 1<<25,
>> >> > > > IFF_NO_RX_HANDLER   = 1<<26,
>> >> > > > +   IFF_BYPASS  = 1 << 27,
>> >> > > > +   IFF_BYPASS_SLAVE= 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, 
>> >> > > please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I 
>> >> > can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >> 
>> >> 
>> >> > > 
>> >> > > > };
>> >> > > > 
>> >> > > > #define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC  IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER   IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE   IFF_BYPASS_SLAVE
>> >> > > > 
>> >> > > > /**
>> 

Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-18 Thread Jiri Pirko
Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > This provides a generic interface for paravirtual drivers to listen
>> > > > for netdev register/unregister/link change events from pci ethernet
>> > > > devices with the same MAC and takeover their datapath. The notifier and
>> > > > event handling code is based on the existing netvsc implementation.
>> > > > 
>> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > > > master netdev is created. The paravirtual driver registers each bypass
>> > > > instance along with a set of ops to manage the slave events.
>> > > >   bypass_master_register()
>> > > >   bypass_master_unregister()
>> > > > 2. new virtio_net based solution that uses 3 netdev model. In this 
>> > > > model,
>> > > > the bypass module provides interfaces to create/destroy additional 
>> > > > master
>> > > > netdev and all the slave events are managed internally.
>> > > >bypass_master_create()
>> > > >bypass_master_destroy()
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> > > > ---
>> > > > include/linux/netdevice.h |  14 +
>> > > > include/net/bypass.h  |  96 ++
>> > > > net/Kconfig   |  18 +
>> > > > net/core/Makefile |   1 +
>> > > > net/core/bypass.c | 844 
>> > > > ++
>> > > > 5 files changed, 973 insertions(+)
>> > > > create mode 100644 include/net/bypass.h
>> > > > create mode 100644 net/core/bypass.c
>> > > > 
>> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > > > index cf44503ea81a..587293728f70 100644
>> > > > --- a/include/linux/netdevice.h
>> > > > +++ b/include/linux/netdevice.h
>> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > > >IFF_PHONY_HEADROOM  = 1<<24,
>> > > >IFF_MACSEC  = 1<<25,
>> > > >IFF_NO_RX_HANDLER   = 1<<26,
>> > > > +  IFF_BYPASS  = 1 << 27,
>> > > > +  IFF_BYPASS_SLAVE= 1 << 28,
>> > > I wonder, why you don't follow the existing coding style... Also, please
>> > > add these to into the comment above.
>> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can 
>> > switch back
>> > to the existing coding style to be consistent.
>> Please do.
>> 
>> 
>> > > 
>> > > > };
>> > > > 
>> > > > #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
>> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > > > #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
>> > > > #define IFF_MACSEC IFF_MACSEC
>> > > > #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
>> > > > +#define IFF_BYPASSIFF_BYPASS
>> > > > +#define IFF_BYPASS_SLAVE  IFF_BYPASS_SLAVE
>> > > > 
>> > > > /**
>> > > >*   struct net_device - The DEVICE structure.
>> > > > @@ -4308,6 +4312,16 @@ static inline bool 
>> > > > netif_is_rxfh_configured(const struct net_device *dev)
>> > > >return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > > > }
>> > > > 
>> > > > +static inline bool netif_is_bypass_master(const struct net_device 
>> > > > *dev)
>> > > > +{
>> > > > +  return dev->priv_flags & IFF_BYPASS;
>> > > > +}
>> > > > +
>> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > > > +{
>> > > > +  return dev->priv_flags & IFF_BYPASS_SLAVE;
>>

Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-18 Thread Jiri Pirko
Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudr...@intel.com wrote:
>> > This provides a generic interface for paravirtual drivers to listen
>> > for netdev register/unregister/link change events from pci ethernet
>> > devices with the same MAC and takeover their datapath. The notifier and
>> > event handling code is based on the existing netvsc implementation.
>> > 
>> > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > master netdev is created. The paravirtual driver registers each bypass
>> > instance along with a set of ops to manage the slave events.
>> >  bypass_master_register()
>> >  bypass_master_unregister()
>> > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > the bypass module provides interfaces to create/destroy additional master
>> > netdev and all the slave events are managed internally.
>> >   bypass_master_create()
>> >   bypass_master_destroy()
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> > ---
>> > include/linux/netdevice.h |  14 +
>> > include/net/bypass.h  |  96 ++
>> > net/Kconfig   |  18 +
>> > net/core/Makefile |   1 +
>> > net/core/bypass.c | 844 
>> > ++
>> > 5 files changed, 973 insertions(+)
>> > create mode 100644 include/net/bypass.h
>> > create mode 100644 net/core/bypass.c
>> > 
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index cf44503ea81a..587293728f70 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >IFF_PHONY_HEADROOM  = 1<<24,
>> >IFF_MACSEC  = 1<<25,
>> >IFF_NO_RX_HANDLER   = 1<<26,
>> > +  IFF_BYPASS  = 1 << 27,
>> > +  IFF_BYPASS_SLAVE= 1 << 28,
>> I wonder, why you don't follow the existing coding style... Also, please
>> add these to into the comment above.
>
>To avoid checkpatch warnings. If it is OK to ignore these warnings, I can 
>switch back
>to the existing coding style to be consistent.

Please do.


>
>> 
>> 
>> > };
>> > 
>> > #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
>> > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
>> > #define IFF_MACSEC IFF_MACSEC
>> > #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
>> > +#define IFF_BYPASSIFF_BYPASS
>> > +#define IFF_BYPASS_SLAVE  IFF_BYPASS_SLAVE
>> > 
>> > /**
>> >   *struct net_device - The DEVICE structure.
>> > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const 
>> > struct net_device *dev)
>> >return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > }
>> > 
>> > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > +{
>> > +  return dev->priv_flags & IFF_BYPASS;
>> > +}
>> > +
>> > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > +{
>> > +  return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > +}
>> > +
>> > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() 
>> > */
>> > static inline void netif_keep_dst(struct net_device *dev)
>> > {
>> > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > new file mode 100644
>> > index ..86b02cb894cf
>> > --- /dev/null
>> > +++ b/include/net/bypass.h
>> > @@ -0,0 +1,96 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +/* Copyright (c) 2018, Intel Corporation. */
>> > +
>> > +#ifndef _NET_BYPASS_H
>> > +#define _NET_BYPASS_H
>> > +
>> > +#include 
>> > +
>> > +struct bypass_ops {
>> > +  int (*slave_pre_register)(struct net_device *slave_netdev,
>> > +struct net_device *bypass_netdev);
>> > +  int (*slave_join)(struct net_device *slave_ne

Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-11 Thread Jiri Pirko
Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
> bypass_master_register()
> bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
>  bypass_master_create()
>  bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala 
>---
> include/linux/netdevice.h |  14 +
> include/net/bypass.h  |  96 ++
> net/Kconfig   |  18 +
> net/core/Makefile |   1 +
> net/core/bypass.c | 844 ++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>   IFF_PHONY_HEADROOM  = 1<<24,
>   IFF_MACSEC  = 1<<25,
>   IFF_NO_RX_HANDLER   = 1<<26,
>+  IFF_BYPASS  = 1 << 27,
>+  IFF_BYPASS_SLAVE= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


> };
> 
> #define IFF_802_1Q_VLAN   IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED   IFF_RXFH_CONFIGURED
> #define IFF_MACSECIFF_MACSEC
> #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
>+#define IFF_BYPASSIFF_BYPASS
>+#define IFF_BYPASS_SLAVE  IFF_BYPASS_SLAVE
> 
> /**
>  *struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const 
>struct net_device *dev)
>   return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
> 
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+  return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+  return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index ..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include 
>+
>+struct bypass_ops {
>+  int (*slave_pre_register)(struct net_device *slave_netdev,
>+struct net_device *bypass_netdev);
>+  int (*slave_join)(struct net_device *slave_netdev,
>+struct net_device *bypass_netdev);
>+  int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+  struct net_device *bypass_netdev);
>+  int (*slave_release)(struct net_device *slave_netdev,
>+   struct net_device *bypass_netdev);
>+  int (*slave_link_change)(struct net_device *slave_netdev,
>+   struct net_device *bypass_netdev);
>+  rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+  struct list_head list;
>+  struct net_device __rcu *bypass_netdev;
>+  struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+  /* passthru netdev with same MAC */
>+  struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


>+
>+  /* virtio_net netdev */
>+  struct net_device __rcu *backup_netdev;
>+
>+  /* active netdev stats */
>+  struct rtnl_link_stats64 active_stats;
>+
>+  /* backup netdev stats */
>+  struct rtnl_link_stats64 backup_stats;
>+
>+  /* aggregated stats */
>+  struct rtnl_link_stats64 bypass_stats;
>+
>+  /* spinlock while updating stats */
>+  spinlock_t stats_lock;
>+};
>+
>+#if 

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-11 Thread Jiri Pirko
Wed, Apr 11, 2018 at 04:45:07PM CEST, m...@redhat.com wrote:
>On Wed, Apr 11, 2018 at 10:03:32AM +0200, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudr...@intel.com wrote:
>> >On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> >> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudr...@intel.com wrote:
>> >> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com 
>> >> > > wrote:
>> >> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> >> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com 
>> >> > > > > wrote:
>> >> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> >> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, 
>> >> > > > > > > sridhar.samudr...@intel.com wrote:
>> >> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> >> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, 
>> >> > > > > > > > > sridhar.samudr...@intel.com wrote:
>> >> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> >> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>> >> > > > > > > > > > > sridhar.samudr...@intel.com wrote:
>> >> > > > > > > > > [...]
>> >> > > > > > > > > 
>> >> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct 
>> >> > > > > > > > > > > > net_device *bypass_netdev,
>> >> > > > > > > > > > > > +struct net_device 
>> >> > > > > > > > > > > > *child_netdev)
>> >> > > > > > > > > > > > +{
>> >> > > > > > > > > > > > +   struct virtnet_bypass_info *vbi;
>> >> > > > > > > > > > > > +   bool backup;
>> >> > > > > > > > > > > > +
>> >> > > > > > > > > > > > +   vbi = netdev_priv(bypass_netdev);
>> >> > > > > > > > > > > > +   backup = (child_netdev->dev.parent == 
>> >> > > > > > > > > > > > bypass_netdev->dev.parent);
>> >> > > > > > > > > > > > +   if (backup ? 
>> >> > > > > > > > > > > > rtnl_dereference(vbi->backup_netdev) :
>> >> > > > > > > > > > > > +   
>> >> > > > > > > > > > > > rtnl_dereference(vbi->active_netdev)) {
>> >> > > > > > > > > > > > +   netdev_info(bypass_netdev,
>> >> > > > > > > > > > > > +   "%s attempting to join 
>> >> > > > > > > > > > > > bypass dev when %s already present\n",
>> >> > > > > > > > > > > > +   child_netdev->name, backup 
>> >> > > > > > > > > > > > ? "backup" : "active");
>> >> > > > > > > > > > > Bypass module should check if there is already some 
>> >> > > > > > > > > > > other netdev
>> >> > > > > > > > > > > enslaved and refuse right there.
>> >> > > > > > > > > > This will work for virtio-net with 3 netdev model, but 
>> >> > > > > > > > > > this check has to be done by netvsc
>> >> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> >> > > > > > > > > > Will add a flag while registering with the bypass 
>> >> > > > > > > > > > module to indicate if the driver is doing
>> >> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag 
>> >> > > > > > > > > > this check can be done in bypass module
>> >> >

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-11 Thread Jiri Pirko
Wed, Apr 11, 2018 at 08:24:43AM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 11:03 PM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> > > > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, 
>> > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, 
>> > > > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>> > > > > > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > > > [...]
>> > > > > > > > > 
>> > > > > > > > > > > > +static int virtnet_bypass_join_child(struct 
>> > > > > > > > > > > > net_device *bypass_netdev,
>> > > > > > > > > > > > +   struct net_device 
>> > > > > > > > > > > > *child_netdev)
>> > > > > > > > > > > > +{
>> > > > > > > > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > > > > > > > +  bool backup;
>> > > > > > > > > > > > +
>> > > > > > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > > > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > > > > > > > bypass_netdev->dev.parent);
>> > > > > > > > > > > > +  if (backup ? 
>> > > > > > > > > > > > rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > > > > > +  
>> > > > > > > > > > > > rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > > > > > +  netdev_info(bypass_netdev,
>> > > > > > > > > > > > +  "%s attempting to join 
>> > > > > > > > > > > > bypass dev when %s already present\n",
>> > > > > > > > > > > > +  child_netdev->name, backup 
>> > > > > > > > > > > > ? "backup" : "active");
>> > > > > > > > > > > Bypass module should check if there is already some 
>> > > > > > > > > > > other netdev
>> > > > > > > > > > > enslaved and refuse right there.
>> > > > > > > > > > This will work for virtio-net with 3 netdev model, but 
>> > > > > > > > > > this check has to be done by netvsc
>> > > > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > > > > > Will add a flag while registering with the bypass module 
>> > > > > > > > > > to indicate if the driver is doing
>> > > > > > > > > > a 2 netdev or 3 netdev model and based on that flag this 
>> > > > > > > > > > check can be done in bypass module
>> > > > > > > > > > for 3 netdev scenario.
>> > > > > > > > > Just let me undestand it clearly. What I expect the 
>> > > > > > > > > difference would be
>> > > > > > > > > between 2netdev and3 netdev model is this:
>> > > > > > > > > 2netdev:
>> > > > > > > > >bypass_master
>> > > > > > > > >   /
>> > > > > > > > >  /
>> > 

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-11 Thread Jiri Pirko
Tue, Apr 10, 2018 at 11:26:08PM CEST, step...@networkplumber.org wrote:
>On Tue, 10 Apr 2018 11:59:50 -0700
>Sridhar Samudrala  wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala 
>> ---
>
>Thanks for doing this.  Your current version has couple show stopper
>issues.
>
>First, the slave device is instantly taking over the slave.
>This doesn't allow udev/systemd to do its device rename of the slave
>device. Netvsc uses a delayed work to workaround this.

Wait. Why the fact a device is enslaved has to affect the udev in any
way? If it does, smells like a bug in udev.


>
>Secondly, the select queue needs to call queue selection in VF.
>The bonding/teaming logic doesn't work well for UDP flows.
>Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>fixed this performance problem.
>
>Lastly, more indirection is bad in current climate.
>
>I am not completely adverse to this but it needs to be fast, simple
>and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-11 Thread Jiri Pirko
Wed, Apr 11, 2018 at 01:28:51AM CEST, m...@redhat.com wrote:
>On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala  wrote:
>> 
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> > 
>> > Signed-off-by: Sridhar Samudrala 
>> > ---
>> 
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>> 
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
>Interesting. Does this mean udev must act within a specific time window
>then?

Yeah. That is scarry. Also, wrong.


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>> 
>> Lastly, more indirection is bad in current climate.
>> 
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-11 Thread Jiri Pirko
Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, 
>> > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>> > > > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > [...]
>> > > > > > > 
>> > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>> > > > > > > > > > *bypass_netdev,
>> > > > > > > > > > +   struct net_device 
>> > > > > > > > > > *child_netdev)
>> > > > > > > > > > +{
>> > > > > > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > > > > > +  bool backup;
>> > > > > > > > > > +
>> > > > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > > > > > bypass_netdev->dev.parent);
>> > > > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > > > +  netdev_info(bypass_netdev,
>> > > > > > > > > > +  "%s attempting to join bypass dev 
>> > > > > > > > > > when %s already present\n",
>> > > > > > > > > > +  child_netdev->name, backup ? 
>> > > > > > > > > > "backup" : "active");
>> > > > > > > > > Bypass module should check if there is already some other 
>> > > > > > > > > netdev
>> > > > > > > > > enslaved and refuse right there.
>> > > > > > > > This will work for virtio-net with 3 netdev model, but this 
>> > > > > > > > check has to be done by netvsc
>> > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > > > Will add a flag while registering with the bypass module to 
>> > > > > > > > indicate if the driver is doing
>> > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check 
>> > > > > > > > can be done in bypass module
>> > > > > > > > for 3 netdev scenario.
>> > > > > > > Just let me undestand it clearly. What I expect the difference 
>> > > > > > > would be
>> > > > > > > between 2netdev and3 netdev model is this:
>> > > > > > > 2netdev:
>> > > > > > >   bypass_master
>> > > > > > >  /
>> > > > > > > /
>> > > > > > > VF_slave
>> > > > > > > 
>> > > > > > > 3netdev:
>> > > > > > >   bypass_master
>> > > > > > >  / \
>> > > > > > > /   \
>> > > > > > > VF_slave   backup_slave
>> > > > > > > 
>> > > > > > > Is that correct? If not, how does it look like?
>> > > > > > > 
>> > > > > > > 
>> > > > > > Looks correct.
>> > > > > > VF_slave and backup_slave are the original netdevs and are present 
>> > > > > > in both the models.
>> > > > > > In the 3 netdev model, bypass_master netdev is crea

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>> > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > [...]
>> > > > > 
>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>> > > > > > > > *bypass_netdev,
>> > > > > > > > +   struct net_device 
>> > > > > > > > *child_netdev)
>> > > > > > > > +{
>> > > > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > > > +  bool backup;
>> > > > > > > > +
>> > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > > > bypass_netdev->dev.parent);
>> > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > +  netdev_info(bypass_netdev,
>> > > > > > > > +  "%s attempting to join bypass dev 
>> > > > > > > > when %s already present\n",
>> > > > > > > > +  child_netdev->name, backup ? 
>> > > > > > > > "backup" : "active");
>> > > > > > > Bypass module should check if there is already some other netdev
>> > > > > > > enslaved and refuse right there.
>> > > > > > This will work for virtio-net with 3 netdev model, but this check 
>> > > > > > has to be done by netvsc
>> > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > Will add a flag while registering with the bypass module to 
>> > > > > > indicate if the driver is doing
>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check can 
>> > > > > > be done in bypass module
>> > > > > > for 3 netdev scenario.
>> > > > > Just let me undestand it clearly. What I expect the difference would 
>> > > > > be
>> > > > > between 2netdev and3 netdev model is this:
>> > > > > 2netdev:
>> > > > >  bypass_master
>> > > > > /
>> > > > >/
>> > > > > VF_slave
>> > > > > 
>> > > > > 3netdev:
>> > > > >  bypass_master
>> > > > > / \
>> > > > >/   \
>> > > > > VF_slave   backup_slave
>> > > > > 
>> > > > > Is that correct? If not, how does it look like?
>> > > > > 
>> > > > > 
>> > > > Looks correct.
>> > > > VF_slave and backup_slave are the original netdevs and are present in 
>> > > > both the models.
>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave 
>> > > > and backup_slave are
>> > > > marked as the 2 slaves of this new netdev.
>> > > You say it looks correct and in another sentence you provide completely
>> > > different description. Could you please look again?
>> > > 
>> > To be exact, 2 netdev model with netvsc looks like this.
>> > 
>> > netvsc_netdev
>> >   /
>> >  /
>> > VF_slave
>> > 
>> > With virtio_net, 3 netdev model
>> > 
>> >   bypass_netdev
>> >   / \
>> >  /   \
>> > VF_slave   virtio_net netdev
>> Could you also mark the original netdev which is there now? is it
>> bypass_netdev or virtio_net_netdev ?
>
> bypass_netdev
> / \
>/   \
>VF_slave   virtio_net netdev (original)

That does not make sense.
1) You diverge from the behaviour of the netvsc, where the original
   netdev is a master of the VF
2) If the original netdev is a slave, you cannot have any IP address
   configured on it (well you could, but the rx_handler would eat every
   incoming packet). So you will break the user bacause he would have to
   move the configuration to the new master device.
This only makes sense that the original netdev becomes the master for both
netvsc and virtio_net.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > [...]
>> > > 
>> > > > > > +static int virtnet_bypass_join_child(struct net_device 
>> > > > > > *bypass_netdev,
>> > > > > > +   struct net_device *child_netdev)
>> > > > > > +{
>> > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > +  bool backup;
>> > > > > > +
>> > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > bypass_netdev->dev.parent);
>> > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > > > +  netdev_info(bypass_netdev,
>> > > > > > +  "%s attempting to join bypass dev when %s 
>> > > > > > already present\n",
>> > > > > > +  child_netdev->name, backup ? "backup" : 
>> > > > > > "active");
>> > > > > Bypass module should check if there is already some other netdev
>> > > > > enslaved and refuse right there.
>> > > > This will work for virtio-net with 3 netdev model, but this check has 
>> > > > to be done by netvsc
>> > > > as its bypass_netdev is same as the backup_netdev.
>> > > > Will add a flag while registering with the bypass module to indicate 
>> > > > if the driver is doing
>> > > > a 2 netdev or 3 netdev model and based on that flag this check can be 
>> > > > done in bypass module
>> > > > for 3 netdev scenario.
>> > > Just let me undestand it clearly. What I expect the difference would be
>> > > between 2netdev and3 netdev model is this:
>> > > 2netdev:
>> > > bypass_master
>> > >/
>> > >   /
>> > > VF_slave
>> > > 
>> > > 3netdev:
>> > > bypass_master
>> > >/ \
>> > >   /   \
>> > > VF_slave   backup_slave
>> > > 
>> > > Is that correct? If not, how does it look like?
>> > > 
>> > > 
>> > Looks correct.
>> > VF_slave and backup_slave are the original netdevs and are present in both 
>> > the models.
>> > In the 3 netdev model, bypass_master netdev is created and VF_slave and 
>> > backup_slave are
>> > marked as the 2 slaves of this new netdev.
>> You say it looks correct and in another sentence you provide completely
>> different description. Could you please look again?
>> 
>To be exact, 2 netdev model with netvsc looks like this.
>
>netvsc_netdev
>  /
> /
> VF_slave
>
>With virtio_net, 3 netdev model
>
>  bypass_netdev
>  / \
> /   \
>VF_slave   virtio_net netdev

Could you also mark the original netdev which is there now? is it
bypass_netdev or virtio_net_netdev ?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:
>> [...]
>> 
>> > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > +   struct net_device *child_netdev)
>> > > > +{
>> > > > +  struct virtnet_bypass_info *vbi;
>> > > > +  bool backup;
>> > > > +
>> > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > +  backup = (child_netdev->dev.parent == 
>> > > > bypass_netdev->dev.parent);
>> > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > +  netdev_info(bypass_netdev,
>> > > > +  "%s attempting to join bypass dev when %s 
>> > > > already present\n",
>> > > > +  child_netdev->name, backup ? "backup" : 
>> > > > "active");
>> > > Bypass module should check if there is already some other netdev
>> > > enslaved and refuse right there.
>> > This will work for virtio-net with 3 netdev model, but this check has to 
>> > be done by netvsc
>> > as its bypass_netdev is same as the backup_netdev.
>> > Will add a flag while registering with the bypass module to indicate if 
>> > the driver is doing
>> > a 2 netdev or 3 netdev model and based on that flag this check can be done 
>> > in bypass module
>> > for 3 netdev scenario.
>> Just let me undestand it clearly. What I expect the difference would be
>> between 2netdev and3 netdev model is this:
>> 2netdev:
>>bypass_master
>>   /
>>  /
>> VF_slave
>> 
>> 3netdev:
>>bypass_master
>>   / \
>>  /   \
>> VF_slave   backup_slave
>> 
>> Is that correct? If not, how does it look like?
>> 
>> 
>Looks correct.
>VF_slave and backup_slave are the original netdevs and are present in both the 
>models.
>In the 3 netdev model, bypass_master netdev is created and VF_slave and 
>backup_slave are
>marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?

[...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-09 Thread Jiri Pirko
Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:
>On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]

>> > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > +   struct net_device *child_netdev)
>> > +{
>> > +  struct virtnet_bypass_info *vbi;
>> > +  bool backup;
>> > +
>> > +  vbi = netdev_priv(bypass_netdev);
>> > +  backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>> > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > +  rtnl_dereference(vbi->active_netdev)) {
>> > +  netdev_info(bypass_netdev,
>> > +  "%s attempting to join bypass dev when %s already 
>> > present\n",
>> > +  child_netdev->name, backup ? "backup" : "active");
>> Bypass module should check if there is already some other netdev
>> enslaved and refuse right there.
>
>This will work for virtio-net with 3 netdev model, but this check has to be 
>done by netvsc
>as its bypass_netdev is same as the backup_netdev.
>Will add a flag while registering with the bypass module to indicate if the 
>driver is doing
>a 2 netdev or 3 netdev model and based on that flag this check can be done in 
>bypass module
>for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
  bypass_master
 /
/
VF_slave

3netdev:
  bypass_master
 / \
/   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?

Thanks!
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 2/4] net: Introduce generic bypass module

2018-04-06 Thread Jiri Pirko
Thu, Apr 05, 2018 at 11:08:21PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation. A
>paravirtual driver can use this module by registering a set of ops and
>each instance of the device when it is probed.
>
>Signed-off-by: Sridhar Samudrala 
>---
> include/net/bypass.h |  80 ++
> net/Kconfig  |  18 +++
> net/core/Makefile|   1 +
> net/core/bypass.c| 406 +++
> 4 files changed, 505 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index ..e2dd122f951a
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,80 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include 
>+
>+struct bypass_ops {

Perhaps "net_bypass_" would be better prefix for this module structs
and functions. No strong opinion though.


>+  int (*register_child)(struct net_device *bypass_netdev,
>+struct net_device *child_netdev);

We have master/slave upper/lower netdevices. This adds "child". Consider
using some existing names. Not sure if possible without loss of meaning.


>+  int (*join_child)(struct net_device *bypass_netdev,
>+struct net_device *child_netdev);
>+  int (*unregister_child)(struct net_device *bypass_netdev,
>+  struct net_device *child_netdev);
>+  int (*release_child)(struct net_device *bypass_netdev,
>+   struct net_device *child_netdev);
>+  int (*update_link)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
>+  rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_instance {
>+  struct list_head list;
>+  struct net_device __rcu *bypass_netdev;
>+  struct bypass *bypass;
>+};
>+
>+struct bypass {
>+  struct list_head list;
>+  const struct bypass_ops *ops;
>+  const struct net_device_ops *netdev_ops;
>+  struct list_head instance_list;
>+  struct mutex lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+const struct net_device_ops *netdev_ops);
>+void bypass_unregister_driver(struct bypass *bypass);
>+
>+int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
>+int bypass_unregister_instance(struct bypass *bypass, struct net_device   
>*dev);
>+
>+int bypass_unregister_child(struct net_device *child_netdev);
>+
>+#else
>+
>+static inline
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+const struct net_device_ops *netdev_ops)
>+{
>+  return NULL;
>+}
>+
>+static inline void bypass_unregister_driver(struct bypass *bypass)
>+{
>+}
>+
>+static inline int bypass_register_instance(struct bypass *bypass,
>+ struct net_device *dev)
>+{
>+  return 0;
>+}
>+
>+static inline int bypass_unregister_instance(struct bypass *bypass,
>+   struct net_device *dev)
>+{
>+  return 0;
>+}
>+
>+static inline int bypass_unregister_child(struct net_device *child_netdev)
>+{
>+  return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> on MAY_USE_DEVLINK to ensure they do not cause link errors when
> devlink is a loadable module and the driver using it is built-in.
> 
>+config NET_BYPASS
>+  tristate "Bypass interface"
>+  ---help---
>+This provides a generic interface for paravirtual drivers to listen
>+for netdev register/unregister/link change events from pci ethernet
>+devices with the same MAC and takeover their datapath. This also
>+enables live migration of a VM with direct attached VF by failing
>+over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+  tristate
>+  default m if NET_BYPASS=m
>+  default y if NET_BYPASS=y || NET_BYPASS=n
>+  help
>+Drivers using the bypass infrastructure should have a dependency
>+on MAY_USE_BYPASS to ensure they do not cause link errors when
>+bypass is a loadable module and the driver using it is built-in.
>+
> endif   # if NET
> 
> # Used by archs to tell that they support BPF JIT compiler 

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-06 Thread Jiri Pirko
Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>created that acts as a master device and tracks the state of the 2 lower
>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>passthru device with the same MAC is registered as 'active' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization=151189725224231=2
>
>Signed-off-by: Sridhar Samudrala 
>---
> drivers/net/Kconfig  |   1 +
> drivers/net/virtio_net.c | 612 ++-
> 2 files changed, 612 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>index 891846655000..9e2cf61fd1c1 100644
>--- a/drivers/net/Kconfig
>+++ b/drivers/net/Kconfig
>@@ -331,6 +331,7 @@ config VETH
> config VIRTIO_NET
>   tristate "Virtio network driver"
>   depends on VIRTIO
>+  depends on MAY_USE_BYPASS
>   ---help---
> This is the virtual network driver for virtio.  It can be used with
> QEMU based VMMs (like KVM or Xen).  Say Y or M.
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index befb5944f3fd..86b2f8f2947d 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -30,8 +30,11 @@
> #include 
> #include 
> #include 
>+#include 
>+#include 
> #include 
> #include 
>+#include 
> 
> static int napi_weight = NAPI_POLL_WEIGHT;
> module_param(napi_weight, int, 0444);
>@@ -206,6 +209,9 @@ struct virtnet_info {
>   u32 speed;
> 
>   unsigned long guest_offloads;
>+
>+  /* upper netdev created when BACKUP feature enabled */
>+  struct net_device __rcu *bypass_netdev;
> };
> 
> struct padded_vnet_hdr {
>@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
>netdev_bpf *xdp)
>   }
> }
> 
>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>+size_t len)
>+{
>+  struct virtnet_info *vi = netdev_priv(dev);
>+  int ret;
>+
>+  if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>+  return -EOPNOTSUPP;
>+
>+  ret = snprintf(buf, len, "_bkup");
>+  if (ret >= len)
>+  return -EOPNOTSUPP;
>+
>+  return 0;
>+}
>+
> static const struct net_device_ops virtnet_netdev = {
>   .ndo_open= virtnet_open,
>   .ndo_stop= virtnet_close,
>@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
>   .ndo_xdp_xmit   = virtnet_xdp_xmit,
>   .ndo_xdp_flush  = virtnet_xdp_flush,
>   .ndo_features_check = passthru_features_check,
>+  .ndo_get_phys_port_name = virtnet_get_phys_port_name,
> };
> 
> static void virtnet_config_changed_work(struct work_struct *work)
>@@ -2689,6 +2712,576 @@ static int virtnet_validate(struct virtio_device *vdev)
>   return 0;
> }
> 
>+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>+ * is created that acts as a master device and tracks the state of the
>+ * 2 lower netdevs. The original virtio_net netdev is registered as
>+ * 'backup' netdev and a passthru device with the same MAC is registered
>+ * as 'active' netdev.
>+ */
>+
>+/* bypass state maintained when BACKUP feature is enabled */
>+struct virtnet_bypass_info {
>+  /* passthru netdev with same MAC */
>+  struct net_device __rcu *active_netdev;
>+
>+  /* virtio_net netdev */
>+  struct net_device __rcu *backup_netdev;
>+
>+  /* active netdev stats */
>+  struct rtnl_link_stats64 active_stats;
>+
>+  /* backup netdev stats */
>+  struct rtnl_link_stats64 backup_stats;
>+
>+  /* aggregated stats */
>+  struct rtnl_link_stats64 bypass_stats;
>+
>+  /* spinlock while updating stats */
>+  spinlock_t stats_lock;
>+};
>+
>+static int virtnet_bypass_open(struct net_device *dev)
>+{
>+  struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+  struct net_device *active_netdev, 

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Jiri Pirko
Wed, Apr 04, 2018 at 07:37:49PM CEST, da...@davemloft.net wrote:
>From: David Ahern 
>Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
>I agree on this.
>
>What I really don't understand still is the use case... really.
>
>So there are control netdevs, what exactly is the problem with that?
>
>Are we not exporting enough information for applications to handle
>these devices sanely?  If so, then's let add that information.
>
>We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
>Another alternative is to add an interface flag like IFF_CONTROL or
>similar, and that probably is much nicer.
>
>Hiding the devices means that we acknowledge that applications are
>currently broken with control netdevs... and we want them to stay
>broken!
>
>That doesn't sound like a good plan to me.
>
>So let's fix handling of control netdevs instead of hiding them.

Exactly. Don't workaround userspace issues by kernel patches.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-04 Thread Jiri Pirko
Wed, Apr 04, 2018 at 03:04:26AM CEST, dsah...@gmail.com wrote:
>On 4/3/18 9:42 AM, Jiri Pirko wrote:
>>>
>>> There are other use cases that want to hide a device from userspace. I
>> 
>> What usecases do you have in mind?
>
>As mentioned in a previous response some kernel drivers create control
>netdevs. Just as in this case users should not be mucking with it, and

virtio_net. Any other drivers?


>S/W like lldpd should ignore it.

It's just a matter of identification of the netdevs, so the user knows
what to do.


>
>> 
>>> would prefer a better solution than playing games with name prefixes and
>>> one that includes an API for users to list all devices -- even ones
>>> hidden by default.
>> 
>> Netdevice hiding feels a bit scarry for me. This smells like a workaround
>> for userspace issues. Why can't the netdevice be visible always and
>> userspace would know what is it and what should it do with it?
>> 
>> Once we start with hiding, there are other things related to that which
>> appear. Like who can see what, levels of visibility etc...
>> 
>
>I would not advocate for any API that does not allow users to have full
>introspection. The intent is to hide the netdev by default but have an
>option to see it.

As an administrator, I want to see all by default. I think it is
reasonable requirements. Again, this awfully smells like a workaround...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-03 Thread Jiri Pirko
Sun, Apr 01, 2018 at 06:11:29PM CEST, dsah...@gmail.com wrote:
>On 4/1/18 3:13 AM, Si-Wei Liu wrote:
>> Hidden netdevice is not visible to userspace such that
>> typical network utilites e.g. ip, ifconfig and et al,
>> cannot sense its existence or configure it. Internally
>> hidden netdev may associate with an upper level netdev
>> that userspace has access to. Although userspace cannot
>> manipulate the lower netdev directly, user may control
>> or configure the underlying hidden device through the
>> upper-level netdev. For identification purpose, the
>> kobject for hidden netdev still presents in the sysfs
>> hierarchy, however, no uevent message will be generated
>> when the sysfs entry is created, modified or destroyed.
>> 
>> For that end, a separate namescope needs to be carved
>> out for IFF_HIDDEN netdevs. As of now netdev name that
>> starts with colon i.e. ':' is invalid in userspace,
>> since socket ioctls such as SIOCGIFCONF use ':' as the
>> separator for ifname. The absence of namescope started
>> with ':' can rightly be used as the namescope for
>> the kernel-only IFF_HIDDEN netdevs.
>> 
>> Signed-off-by: Si-Wei Liu 
>> ---
>>  include/linux/netdevice.h   |  12 ++
>>  include/net/net_namespace.h |   2 +
>>  net/core/dev.c  | 281 
>> ++--
>>  net/core/net_namespace.c|   1 +
>>  4 files changed, 263 insertions(+), 33 deletions(-)
>> 
>
>There are other use cases that want to hide a device from userspace. I

What usecases do you have in mind?

>would prefer a better solution than playing games with name prefixes and
>one that includes an API for users to list all devices -- even ones
>hidden by default.

Netdevice hiding feels a bit scarry for me. This smells like a workaround
for userspace issues. Why can't the netdevice be visible always and
userspace would know what is it and what should it do with it?

Once we start with hiding, there are other things related to that which
appear. Like who can see what, levels of visibility etc...


>
>https://github.com/dsahern/linux/commit/48a80a00eac284e58bae04af10a5a932dd7aee00
>
>https://github.com/dsahern/iproute2/commit/7563f5b26f5539960e99066e34a995d22ea908ed
>
>Also, why are you suggesting that the device should still be visible via
>/sysfs? That leads to inconsistent views of networking state - /sys
>shows a device but a link dump does not.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-28 Thread Jiri Pirko
Wed, Feb 28, 2018 at 04:45:39PM CET, m...@redhat.com wrote:
>On Wed, Feb 28, 2018 at 04:11:31PM +0100, Jiri Pirko wrote:
>> Wed, Feb 28, 2018 at 03:32:44PM CET, m...@redhat.com wrote:
>> >On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
>> >> Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote:
>> >> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
>> >> >> Basically we need some sort of PCI or PCIe topology mapping for the
>> >> >> devices that can be translated into something we can communicate over
>> >> >> the communication channel. 
>> >> >
>> >> >Hm.  This is probably a completely stupid idea, but if we need to
>> >> >start marshalling configuration requests/hints maybe the entire problem
>> >> >could be solved by opening a netlink socket from hypervisor?  Even make
>> >> >teamd run on the hypervisor side...
>> >> 
>> >> Interesting. That would be more trickier then just to fwd 1 genetlink
>> >> socket to the hypervisor.
>> >> 
>> >> Also, I think that the solution should handle multiple guest oses. What
>> >> I'm thinking about is some generic bonding description passed over some
>> >> communication channel into vm. The vm either use it for configuration,
>> >> or ignores it if it is not smart enough/updated enough.
>> >
>> >For sure, we could build virtio-bond to pass that info to guests.
>> 
>> What do you mean by "virtio-bond". virtio_net extension?
>
>I mean a new device supplying topology information to guests,
>with updates whenever VMs are started, stopped or migrated.

Good. Any idea how that device would look like? Also, any idea how to
handle in in kernel and how to pass along this info to userspace?
Is there anything similar out there?

Thanks!

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-28 Thread Jiri Pirko
Wed, Feb 28, 2018 at 03:32:44PM CET, m...@redhat.com wrote:
>On Wed, Feb 28, 2018 at 08:08:39AM +0100, Jiri Pirko wrote:
>> Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote:
>> >On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
>> >> Basically we need some sort of PCI or PCIe topology mapping for the
>> >> devices that can be translated into something we can communicate over
>> >> the communication channel. 
>> >
>> >Hm.  This is probably a completely stupid idea, but if we need to
>> >start marshalling configuration requests/hints maybe the entire problem
>> >could be solved by opening a netlink socket from hypervisor?  Even make
>> >teamd run on the hypervisor side...
>> 
>> Interesting. That would be more trickier then just to fwd 1 genetlink
>> socket to the hypervisor.
>> 
>> Also, I think that the solution should handle multiple guest oses. What
>> I'm thinking about is some generic bonding description passed over some
>> communication channel into vm. The vm either use it for configuration,
>> or ignores it if it is not smart enough/updated enough.
>
>For sure, we could build virtio-bond to pass that info to guests.

What do you mean by "virtio-bond". virtio_net extension?

>
>Such an advisory mechanism would not be a replacement for the mandatory
>passthrough fallback flag proposed, but OTOH it's much more flexible.
>
>-- 
>MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-27 Thread Jiri Pirko
Tue, Feb 27, 2018 at 10:41:49PM CET, kubak...@wp.pl wrote:
>On Tue, 27 Feb 2018 13:16:21 -0800, Alexander Duyck wrote:
>> Basically we need some sort of PCI or PCIe topology mapping for the
>> devices that can be translated into something we can communicate over
>> the communication channel. 
>
>Hm.  This is probably a completely stupid idea, but if we need to
>start marshalling configuration requests/hints maybe the entire problem
>could be solved by opening a netlink socket from hypervisor?  Even make
>teamd run on the hypervisor side...

Interesting. That would be more trickier then just to fwd 1 genetlink
socket to the hypervisor.

Also, I think that the solution should handle multiple guest oses. What
I'm thinking about is some generic bonding description passed over some
communication channel into vm. The vm either use it for configuration,
or ignores it if it is not smart enough/updated enough.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-27 Thread Jiri Pirko
Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>used by hypervisor to indicate that virtio_net interface should act as
>>>a backup for another device with the same MAC address.
>>>
>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>solution.  However, it creates some issues we'll get into in a moment.
>>>It extends virtio_net to use alternate datapath when available and
>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>an additional 'bypass' netdev that acts as a master device and controls
>>>2 slave devices.  The original virtio_net netdev is registered as
>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>associated with the same 'pci' device.  The user accesses the network
>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>as default for transmits when it is available with link up and running.
>>
>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>> of bonding driver as a part of NIC driver. Bond and team drivers
>> are mature solutions, well tested, broadly used, with lots of issues
>> resolved in the past. What you try to introduce is a weird shortcut
>> that already has couple of issues as you mentioned and will certanly
>> have many more. Also, I'm pretty sure that in future, someone comes up
>> with ideas like multiple VFs, LACP and similar bonding things.
>
>The problem with the bond and team drivers is they are too large and
>have too many interfaces available for configuration so as a result
>they can really screw this interface up.
>
>Essentially this is meant to be a bond that is more-or-less managed by
>the host, not the guest. We want the host to be able to configure it
>and have it automatically kick in on the guest. For now we want to
>avoid adding too much complexity as this is meant to be just the first
>step. Trying to go in and implement the whole solution right from the
>start based on existing drivers is going to be a massive time sink and
>will likely never get completed due to the fact that there is always
>going to be some other thing that will interfere.
>
>My personal hope is that we can look at doing a virtio-bond sort of
>device that will handle all this as well as providing a communication
>channel, but that is much further down the road. For now we only have
>a single bit so the goal for now is trying to keep this as simple as
>possible.

I have another usecase that would require the solution to be different
then what you suggest. Consider following scenario:
- baremetal has 2 sr-iov nics
- there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
- baremetal would like to somehow tell the VM to bond vf0 and vf1
  together and how this bonding should be configured, according to how
  the VF representors are configured on the baremetal (LACP for example)

The baremetal could decide to remove any VF during the VM runtime, it
can add another VF there. For migration, it can add virtio_net. The VM
should be inctructed to bond all interfaces together according to how
baremetal decided - as it knows better.

For this we need a separate communication channel from baremetal to VM
(perhaps something re-usable already exists), we need something to
listen to the events coming from this channel (kernel/userspace) and to
react accordingly (create bond/team, enslave, etc).

Now the question is: is it possible to merge the demands you have and
the generic needs I described into a single solution? From what I see,
that would be quite hard/impossible. So at the end, I think that we have
to end-up with 2 solutions:
1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
   solution that works for all (no matter what OS you use in VM)
2) team/bond solution with assistance of preferably userspace daemon
   getting info from baremetal. This is not 0config, but minimal config
   - user just have to define this "magic bonding" should be on.
   This covers all possible usecases, including multiple VFs, RDMA, etc.

Thoughts?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-27 Thread Jiri Pirko
Tue, Feb 27, 2018 at 02:18:12AM CET, m...@redhat.com wrote:
>On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
>> On Mon, 26 Feb 2018 08:19:24 +0100
>> Jiri Pirko <j...@resnulli.us> wrote:
>> 
>> > Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote:
>> > >On Thu, 22 Feb 2018 13:30:12 -0800
>> > >Alexander Duyck <alexander.du...@gmail.com> wrote:
>> > >  
>> > >> > Again, I undertand your motivation. Yet I don't like your solution.
>> > >> > But if the decision is made to do this in-driver bonding. I would like
>> > >> > to see it baing done some generic way:
>> > >> > 1) share the same "in-driver bonding core" code with netvsc
>> > >> >put to net/core.
>> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> > >> >like active-backup mode only, one vf, one backup, vf netdev type
>> > >> >check (so noone could enslave a tap or anything else)
>> > >> > If user would need something more, he should employ team/bond.
>> > >
>> > >Sharing would be good, but netvsc world would really like to only have
>> > >one visible network device.  
>> > 
>> > Why do you mind? All would be the same, there would be just another
>> > netdevice unused by the vm user (same as the vf netdev).
>> > 
>> 
>> I mind because our requirement is no changes to userspace.
>> No special udev rules, no bonding script, no setup.
>
>Agreed. It is mostly fine from this point of view, except that you need
>to know to skip the slaves.  Maybe we could look at some kind of
>trick e.g. pretending link is down for slaves?

:O Another hack. Please, don't.


>
>> Things like cloudinit running on current distro's expect to see a single
>> eth0.  The VF device show up can also be an issue because distro's have
>> stupid rules like Network Manager trying to start DHCP on every interface.
>> We deal with that now by doing stuff like udev rules to get it to stop
>> but that is still causing user errors.

So that means that with an extra netdev for "virtio_net bypass" you will
face exactly the same problems. Should not be an issue for you then.


>
>So the ideal of a single net device isn't achieved by netvsc.
>
>Since you have scripts to skip the PT device, can't they
>hind the PV slave too? How do they identify the device to skip?
>
>I agree it would be nice to have a way to hide the extra netdev
>from userspace.

"A hidden netdevice", hmm. I believe that instead of doing hacks like
this, we should fix userspace to treat particular netdevices correctly.


>
>The benefit of the separation is that each slave device can
>be configured with e.g. its own native ethtool commands for
>optimum performance.
>
>-- 
>MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-25 Thread Jiri Pirko
Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote:
>On Thu, 22 Feb 2018 13:30:12 -0800
>Alexander Duyck  wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >like active-backup mode only, one vf, one backup, vf netdev type
>> >check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.  
>
>Sharing would be good, but netvsc world would really like to only have
>one visible network device.

Why do you mind? All would be the same, there would be just another
netdevice unused by the vm user (same as the vf netdev).

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Jiri Pirko
Fri, Feb 23, 2018 at 11:22:36PM CET, losewe...@gmail.com wrote:

[...]

>>>
>>> No, that's not what I was talking about of course. I thought you
>>> mentioned the upgrade scenario this patch would like to address is to
>>> use the bypass interface "to take the place of the original virtio,
>>> and get udev to rename the bypass to what the original virtio_net
>>> was". That is one of the possible upgrade paths for sure. However the
>>> upgrade path I was seeking is to use the bypass interface to take the
>>> place of original VF interface while retaining the name and network
>>> configs, which generally can be done simply with kernel upgrade. It
>>> would become limiting as this patch makes the bypass interface share
>>> the same virtio pci device with virito backup. Can this bypass
>>> interface be made general to take place of any pci device other than
>>> virtio-net? This will be more helpful as the cloud users who has
>>> existing setup on VF interface don't have to recreate it on virtio-net
>>> and VF separately again.

How that could work? If you have the VF netdev with all configuration
including IPs and routes and whatever - now you want to do migration
so you add virtio_net and do some weird in-driver bonding with it. But
then, VF disappears and the VF netdev with that and also all
configuration it had.
I don't think this scenario is valid.


>>
>>
>> Yes. This sounds interesting. Looks like you want an existing VM image with
>> VF only configuration to get transparent live migration support by adding
>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>> between these 2 options.
>
>Yes, that's what I was thinking about. I have been building something
>like this before, and would like to get back after merging with your
>patch.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-22 Thread Jiri Pirko
Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz...@gmail.com wrote:
>On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote:
>
>>>The signaling isn't too much of an issue since we can just tweak the
>>>link state of the VF or virtio manually to report the link up or down
>>>prior to the hot-plug. Now that we are on the same page with the team0
>
>> Oh, so you just do "ip link set vfrepresentor down" in the host.
>> That makes sense. I'm pretty sure that this is not implemented for all
>> drivers now.
>
>mlx5 supports that, on the representor close ndo we take the VF link
>operational v-link down
>
>We should probably also put into the picture some/more aspects
>from the host side of things. The provisioning of the v-switch now
>have to deal with two channels going into the VM, the PV (virtio)
>one and the PT (VF) one.
>
>This should probably boil down to apply teaming/bonding between
>the VF representor and a PV backend device, e.g TAP.

Yes, that is correct.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-22 Thread Jiri Pirko
Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote:
>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
>>>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>>>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>>>
>>>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in 
>>>>>>>>>> both?
>>>>>>>>>
>>>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>>>
>>>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge 
>>>>>>>> it
>>>>>>>> and make the solution based on team/bond.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody 
>>>>>>>>>is
>>>>>>>>>using NM.
>>>>>>>>
>>>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>>>
>>>>>>>> Actually, I took a stab to implement this in teamd. Took me like an 
>>>>>>>> hour
>>>>>>>> and half.
>>>>>>>>
>>>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>>>
>>>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>>>> it enslaves it.
>>>>>>>>
>>>>>>>> Here's the patch (quick and dirty):
>>>>>>>>
>>>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>>>
>>>>>>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>>>>>>
>>>>>>>So this doesn't really address the original problem we were trying to
>>>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>>>has to do with configuration. Specifically what our patch is
>>>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>>>up and be enslaved by t

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jiri Pirko
Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote:
>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>>>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>>>> stuck with this ugly thing forever...
>>>>>>>>
>>>>>>>> Could you at least make some common code that is shared in between
>>>>>>>> netvsc and virtio_net so this is handled in exacly the same way in 
>>>>>>>> both?
>>>>>>>
>>>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>>>(virtio) bug-compatible with netvsc.
>>>>>>
>>>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>>>> and make the solution based on team/bond.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>>>using NM.
>>>>>>
>>>>>> Can be done in NM, networkd or other network management tools.
>>>>>> Even easier to do this in teamd and let them all benefit.
>>>>>>
>>>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>>>> and half.
>>>>>>
>>>>>> You can just run teamd with config option "kidnap" like this:
>>>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>>>
>>>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>>>> or whenever teamd sees another netdev to change mac to his,
>>>>>> it enslaves it.
>>>>>>
>>>>>> Here's the patch (quick and dirty):
>>>>>>
>>>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>>>
>>>>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>>>>
>>>>>So this doesn't really address the original problem we were trying to
>>>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>>>has to do with configuration. Specifically what our patch is
>>>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>>>upgrade their customer to SR-IOV support and live migration without
>>>>>requiring them to reconfigure their guest. So the general idea with
>>>>>our patch is to take a VM that is running with virtio_net only and
>>>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>>>name as the original virtio, and then have the virtio_net and VF come
>>>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>>>that was originally configured for virtio only.
>>>>>
>>>>>The problem with your solution is we already have teaming and bonding
>>>>>as you said. There is already a write-up from Red Hat on how to do it
>>>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>>>That is all well and good as long as you are willing to keep around
>>>>>two VM images, one for virtio, and one for SR-IOV with live

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jiri Pirko
Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>>>> stuck with this ugly thing forever...
>>>>>>
>>>>>> Could you at least make some common code that is shared in between
>>>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>>>
>>>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>>>Let's not make a far, far more commonly deployed and important driver
>>>>>(virtio) bug-compatible with netvsc.
>>>>
>>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>>> and make the solution based on team/bond.
>>>>
>>>>
>>>>>
>>>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>>>user space.  I think it may very well get done in next versions of NM,
>>>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>>>using NM.
>>>>
>>>> Can be done in NM, networkd or other network management tools.
>>>> Even easier to do this in teamd and let them all benefit.
>>>>
>>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>>> and half.
>>>>
>>>> You can just run teamd with config option "kidnap" like this:
>>>> # teamd/teamd -c '{"kidnap": true }'
>>>>
>>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>>> or whenever teamd sees another netdev to change mac to his,
>>>> it enslaves it.
>>>>
>>>> Here's the patch (quick and dirty):
>>>>
>>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>>
>>>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>>>
>>>So this doesn't really address the original problem we were trying to
>>>solve. You asked earlier why the netdev name mattered and it mostly
>>>has to do with configuration. Specifically what our patch is
>>>attempting to resolve is the issue of how to allow a cloud provider to
>>>upgrade their customer to SR-IOV support and live migration without
>>>requiring them to reconfigure their guest. So the general idea with
>>>our patch is to take a VM that is running with virtio_net only and
>>>allow it to instead spawn a virtio_bypass master using the same netdev
>>>name as the original virtio, and then have the virtio_net and VF come
>>>up and be enslaved by the bypass interface. Doing it this way we can
>>>allow for multi-vendor SR-IOV live migration support using a guest
>>>that was originally configured for virtio only.
>>>
>>>The problem with your solution is we already have teaming and bonding
>>>as you said. There is already a write-up from Red Hat on how to do it
>>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>>That is all well and good as long as you are willing to keep around
>>>two VM images, one for virtio, and one for SR-IOV with live migration.
>>
>> You don't need 2 images. You need only one. The one with the team setup.
>> That's it. If another netdev with the same mac appears, teamd will
>> enslave it and run traffic on it. If not, ok, you'll go only through
>> virtio_net.
>
>Isn't that going to cause the routing table to get messed up when we
>rearrange the netdevs? We don't want to have an significant disruption
> in traffic when we are adding/removing the VF. It seems like we would
>need to invalidate any entries that were configured for the virtio_net
>and reestablish them on the new team interface. Part of the criteria
>we have been working with is that we should be able t

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jiri Pirko
Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>>> stuck with this ugly thing forever...
>>>>
>>>> Could you at least make some common code that is shared in between
>>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>>
>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>Let's not make a far, far more commonly deployed and important driver
>>>(virtio) bug-compatible with netvsc.
>>
>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>> it was a huge mistake to merge it. I personally would vote to unmerge it
>> and make the solution based on team/bond.
>>
>>
>>>
>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>user space.  I think it may very well get done in next versions of NM,
>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>using NM.
>>
>> Can be done in NM, networkd or other network management tools.
>> Even easier to do this in teamd and let them all benefit.
>>
>> Actually, I took a stab to implement this in teamd. Took me like an hour
>> and half.
>>
>> You can just run teamd with config option "kidnap" like this:
>> # teamd/teamd -c '{"kidnap": true }'
>>
>> Whenever teamd sees another netdev to appear with the same mac as his,
>> or whenever teamd sees another netdev to change mac to his,
>> it enslaves it.
>>
>> Here's the patch (quick and dirty):
>>
>> Subject: [patch teamd] teamd: introduce kidnap feature
>>
>> Signed-off-by: Jiri Pirko <j...@mellanox.com>
>
>So this doesn't really address the original problem we were trying to
>solve. You asked earlier why the netdev name mattered and it mostly
>has to do with configuration. Specifically what our patch is
>attempting to resolve is the issue of how to allow a cloud provider to
>upgrade their customer to SR-IOV support and live migration without
>requiring them to reconfigure their guest. So the general idea with
>our patch is to take a VM that is running with virtio_net only and
>allow it to instead spawn a virtio_bypass master using the same netdev
>name as the original virtio, and then have the virtio_net and VF come
>up and be enslaved by the bypass interface. Doing it this way we can
>allow for multi-vendor SR-IOV live migration support using a guest
>that was originally configured for virtio only.
>
>The problem with your solution is we already have teaming and bonding
>as you said. There is already a write-up from Red Hat on how to do it
>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>That is all well and good as long as you are willing to keep around
>two VM images, one for virtio, and one for SR-IOV with live migration.

You don't need 2 images. You need only one. The one with the team setup.
That's it. If another netdev with the same mac appears, teamd will
enslave it and run traffic on it. If not, ok, you'll go only through
virtio_net.


>The problem is nobody wants to do that. What they want is to maintain
>one guest image and if they decide to upgrade to SR-IOV they still
>want their live migration and they don't want to have to reconfigure
>the guest.
>
>That said it does seem to make the existing Red Hat solution easier to
>manage since you wouldn't be guessing at ifname so I have provided
>some feedback below.
>
>> ---
>>  include/team.h |  7 +++
>>  libteam/ifinfo.c   | 20 
>>  teamd/teamd.c  | 17 +
>>  teamd/teamd.h  |  5 +
>>  teamd/teamd_events.c   | 17 +
>>  teamd/teamd_ifinfo_watch.c |  9 +
>>  teamd/teamd_per_port.c |  7 ++-
>>  7 files changed, 81 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/team.h b/include/team.h
>> index 9ae517d..b0c19c8 100644
>> --- a/include/team.h
>> +++ b/include/team.h
>> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
>> team_handle *th,
>

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jiri Pirko
Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>> stuck with this ugly thing forever...
>> 
>> Could you at least make some common code that is shared in between
>> netvsc and virtio_net so this is handled in exacly the same way in both?
>
>IMHO netvsc is a vendor specific driver which made a mistake on what
>behaviour it provides (or tried to align itself with Windows SR-IOV).
>Let's not make a far, far more commonly deployed and important driver
>(virtio) bug-compatible with netvsc.

Yeah. netvsc solution is a dangerous precedent here and in my opinition
it was a huge mistake to merge it. I personally would vote to unmerge it
and make the solution based on team/bond.


>
>To Jiri's initial comments, I feel the same way, in fact I've talked to
>the NetworkManager guys to get auto-bonding based on MACs handled in
>user space.  I think it may very well get done in next versions of NM,
>but isn't done yet.  Stephen also raised the point that not everybody is
>using NM.

Can be done in NM, networkd or other network management tools.
Even easier to do this in teamd and let them all benefit.

Actually, I took a stab to implement this in teamd. Took me like an hour
and half.

You can just run teamd with config option "kidnap" like this:
# teamd/teamd -c '{"kidnap": true }'

Whenever teamd sees another netdev to appear with the same mac as his,
or whenever teamd sees another netdev to change mac to his,
it enslaves it.

Here's the patch (quick and dirty):

Subject: [patch teamd] teamd: introduce kidnap feature

Signed-off-by: Jiri Pirko <j...@mellanox.com>
---
 include/team.h |  7 +++
 libteam/ifinfo.c   | 20 
 teamd/teamd.c  | 17 +
 teamd/teamd.h  |  5 +
 teamd/teamd_events.c   | 17 +
 teamd/teamd_ifinfo_watch.c |  9 +
 teamd/teamd_per_port.c |  7 ++-
 7 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/team.h b/include/team.h
index 9ae517d..b0c19c8 100644
--- a/include/team.h
+++ b/include/team.h
@@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
team_handle *th,
 #define team_for_each_ifinfo(ifinfo, th)   \
for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
 ifinfo = team_get_next_ifinfo(th, ifinfo))
+
+struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
+ struct team_ifinfo *ifinfo);
+#define team_for_each_unlinked_ifinfo(ifinfo, th)  \
+   for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
+ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
+
 /* ifinfo getters */
 bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
 uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
index 5c32a9c..8f9548e 100644
--- a/libteam/ifinfo.c
+++ b/libteam/ifinfo.c
@@ -494,6 +494,26 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
team_handle *th,
return NULL;
 }
 
+/**
+ * @param th   libteam library context
+ * @param ifinfo   ifinfo structure
+ *
+ * @details Get next unlinked ifinfo in list.
+ *
+ * @return Ifinfo next to ifinfo passed.
+ **/
+TEAM_EXPORT
+struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
+ struct team_ifinfo *ifinfo)
+{
+   do {
+   ifinfo = list_get_next_node_entry(>ifinfo_list, ifinfo, 
list);
+   if (ifinfo && !ifinfo->linked)
+   return ifinfo;
+   } while (ifinfo);
+   return NULL;
+}
+
 /**
  * @param ifinfo   ifinfo structure
  *
diff --git a/teamd/teamd.c b/teamd/teamd.c
index aac2511..069c7f0 100644
--- a/teamd/teamd.c
+++ b/teamd/teamd.c
@@ -926,8 +926,25 @@ static int teamd_event_watch_port_added(struct 
teamd_context *ctx,
return 0;
 }
 
+static int teamd_event_watch_unlinked_hwaddr_changed(struct teamd_context *ctx,
+struct team_ifinfo *ifinfo,
+void *priv)
+{
+   int err;
+   bool kidnap;
+
+   err = teamd_config_bool_get(ctx, , "$.kidnap");
+   if (err || !kidnap ||
+   ctx->hwaddr_len != team_get_ifinfo_hwaddr_len(ifinfo) ||
+   memcmp(team_get_ifinfo_hwaddr(ifinfo),
+  ctx->hwaddr, ctx->hwaddr_len))
+   return 0;
+   return teamd_port_add(ctx, team_get_ifinfo_ifindex(ifinfo));
+}
+
 static const struct teamd_event_watch_ops teamd_port_watch_ops = {
.port_added = teamd_event_watch_port_added,
+   .unlinked_

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jiri Pirko
Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudr...@intel.com wrote:
>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>> > > > used by hypervisor to indicate that virtio_net interface should act as
>> > > > a backup for another device with the same MAC address.
>> > > > 
>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>> > > > solution.  However, it creates some issues we'll get into in a moment.
>> > > > It extends virtio_net to use alternate datapath when available and
>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>> > > > an additional 'bypass' netdev that acts as a master device and controls
>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> > > > associated with the same 'pci' device.  The user accesses the network
>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' 
>> > > > netdev
>> > > > as default for transmits when it is available with link up and running.
>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>> > > are mature solutions, well tested, broadly used, with lots of issues
>> > > resolved in the past. What you try to introduce is a weird shortcut
>> > > that already has couple of issues as you mentioned and will certanly
>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>> > > with ideas like multiple VFs, LACP and similar bonding things.
>> > The problem with the bond and team drivers is they are too large and
>> > have too many interfaces available for configuration so as a result
>> > they can really screw this interface up.
>> What? Too large is which sense? Why "too many interfaces" is a problem?
>> Also, team has only one interface to userspace team-generic-netlink.
>> 
>> 
>> > Essentially this is meant to be a bond that is more-or-less managed by
>> > the host, not the guest. We want the host to be able to configure it
>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>> virtio_net, pci vf.
>> I don't see how host can do any managing of that, other than the
>> obvious. But still, the active/backup decision is done in guest. This is
>> a simple bond/team usecase. As I said, there is something needed to be
>> implemented in userspace in order to handle re-appear of vf netdev.
>> But that should be fairly easy to do in teamd.
>
>The host manages the active/backup decision by
>- assigning the same MAC address to both VF and virtio interfaces
>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>take
>  over the VFs datapath.
>- only enable one datapath at anytime so that packets don't get looped back
>- during live migration enable virtio datapth, unplug vf on the source and
>replug
>  vf on the destination.
>
>The VM is not expected and doesn't have any control of setting the MAC
>address
>or bringing up/down the links.
>
>This is the model that is currently supported with netvsc driver on Azure.

Yeah, I can see it now :( I guess that the ship has sailed and we are
stuck with this ugly thing forever...

Could you at least make some common code that is shared in between
netvsc and virtio_net so this is handled in exacly the same way in both?

The fact that the netvsc/virtio_net kidnaps a netdev only because it
has the same mac is going to give me some serious nighmares...
I think we need to introduce some more strict checks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jiri Pirko
Tue, Feb 20, 2018 at 06:23:49PM CET, alexander.du...@gmail.com wrote:
>On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>>>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>>>used by hypervisor to indicate that virtio_net interface should act as
>>>>>a backup for another device with the same MAC address.
>>>>>
>>>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>>>solution.  However, it creates some issues we'll get into in a moment.
>>>>>It extends virtio_net to use alternate datapath when available and
>>>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>an additional 'bypass' netdev that acts as a master device and controls
>>>>>2 slave devices.  The original virtio_net netdev is registered as
>>>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>associated with the same 'pci' device.  The user accesses the network
>>>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>>>as default for transmits when it is available with link up and running.
>>>>
>>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>>> are mature solutions, well tested, broadly used, with lots of issues
>>>> resolved in the past. What you try to introduce is a weird shortcut
>>>> that already has couple of issues as you mentioned and will certanly
>>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>>> with ideas like multiple VFs, LACP and similar bonding things.
>>>
>>>The problem with the bond and team drivers is they are too large and
>>>have too many interfaces available for configuration so as a result
>>>they can really screw this interface up.
>>
>> What? Too large is which sense? Why "too many interfaces" is a problem?
>> Also, team has only one interface to userspace team-generic-netlink.
>
>Specifically I was working with bond. I had overlooked team for the
>most part since it required an additional userspace daemon which
>basically broke our requirement of no user-space intervention.

Why? That sound artificial. Why the userspace cannot be part of the
solution?


>
>I was trying to focus on just doing an active/backup setup. The
>problem is there are debugfs, sysfs, and procfs interfaces exposed
>that we don't need and/or want. Adding any sort of interface to
>exclude these would just bloat up the bonding driver, and leaving them
>in would just be confusing since they would all need to be ignored. In
>addition the steps needed to get the name to come out the same as the
>original virtio interface would just bloat up bonding.

Why to you care about "name"? it's a netdev, isn't it all that matters?

The viewpoint of the user inside vm boils down to:
1) I have 2 netdevs
2) One is preferred
3) I setup team on top of them

That's should be it. It is the users responsibility to do it this way.


>
>>>
>>>Essentially this is meant to be a bond that is more-or-less managed by
>>>the host, not the guest. We want the host to be able to configure it
>>
>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>> virtio_net, pci vf.
>> I don't see how host can do any managing of that, other than the
>> obvious. But still, the active/backup decision is done in guest. This is
>> a simple bond/team usecase. As I said, there is something needed to be
>> implemented in userspace in order to handle re-appear of vf netdev.
>> But that should be fairly easy to do in teamd.
>>
>>
>>>and have it automatically kick in on the guest. For now we want to
>>>avoid adding too much complexity as this is meant to be just the first
>>
>> That's what I fear, "for now"..
>
>I used the expression "for now" as I see this being the first stage of
>a multi-stage process.

That is what I fear...


>
>Step 1 is to get a basic virtio-bypass driver added to virtio so that
>it is at least comparable to netvsc in terms of feature set and
>enables basic network live migrat

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jiri Pirko
Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko <j...@resnulli.us> wrote:
>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>>used by hypervisor to indicate that virtio_net interface should act as
>>>a backup for another device with the same MAC address.
>>>
>>>Ppatch 2 is in response to the community request for a 3 netdev
>>>solution.  However, it creates some issues we'll get into in a moment.
>>>It extends virtio_net to use alternate datapath when available and
>>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>>an additional 'bypass' netdev that acts as a master device and controls
>>>2 slave devices.  The original virtio_net netdev is registered as
>>>'backup' netdev and a passthru/vf device with the same MAC gets
>>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>associated with the same 'pci' device.  The user accesses the network
>>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>as default for transmits when it is available with link up and running.
>>
>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>> of bonding driver as a part of NIC driver. Bond and team drivers
>> are mature solutions, well tested, broadly used, with lots of issues
>> resolved in the past. What you try to introduce is a weird shortcut
>> that already has couple of issues as you mentioned and will certanly
>> have many more. Also, I'm pretty sure that in future, someone comes up
>> with ideas like multiple VFs, LACP and similar bonding things.
>
>The problem with the bond and team drivers is they are too large and
>have too many interfaces available for configuration so as a result
>they can really screw this interface up.

What? Too large is which sense? Why "too many interfaces" is a problem?
Also, team has only one interface to userspace team-generic-netlink.


>
>Essentially this is meant to be a bond that is more-or-less managed by
>the host, not the guest. We want the host to be able to configure it

How is it managed by the host? In your usecase the guest has 2 netdevs:
virtio_net, pci vf.
I don't see how host can do any managing of that, other than the
obvious. But still, the active/backup decision is done in guest. This is
a simple bond/team usecase. As I said, there is something needed to be
implemented in userspace in order to handle re-appear of vf netdev.
But that should be fairly easy to do in teamd.


>and have it automatically kick in on the guest. For now we want to
>avoid adding too much complexity as this is meant to be just the first

That's what I fear, "for now"..


>step. Trying to go in and implement the whole solution right from the
>start based on existing drivers is going to be a massive time sink and
>will likely never get completed due to the fact that there is always
>going to be some other thing that will interfere.

"implement the whole solution right from the start based on existing
drivers" - what solution are you talking about? I don't understand this
para.


>
>My personal hope is that we can look at doing a virtio-bond sort of
>device that will handle all this as well as providing a communication
>channel, but that is much further down the road. For now we only have
>a single bit so the goal for now is trying to keep this as simple as
>possible.

Oh. So there is really intention to do re-implementation of bonding
in virtio. That is plain-wrong in my opinion.

Could you just use bond/team, please, and don't reinvent the wheel with
this abomination?


>
>> What is the reason for this abomination? According to:
>> https://marc.info/?l=linux-virtualization=151189725224231=2
>> The reason is quite weak.
>> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
>> and that's it. This works now! If the vm lacks some userspace features,
>> let's fix it there! For example the MAC changes is something that could
>> be easily handled in teamd userspace deamon.
>
>I think you might have missed the point of this. This is meant to be a
>simple interface so the guest should not be able to change the MAC
>address, and it shouldn't require any userspace daemon to setup or
>tear down. Ideally with this solution the virtio bypass will come up
>and be assigned the name of the original virtio, and the "backup"
>interface will come up and be assigned the name of the original virtio
>with an additional "nbackup" tacked on via the phys_port_name, and
>then whene

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Jiri Pirko
Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>used by hypervisor to indicate that virtio_net interface should act as
>a backup for another device with the same MAC address.
>
>Ppatch 2 is in response to the community request for a 3 netdev
>solution.  However, it creates some issues we'll get into in a moment.
>It extends virtio_net to use alternate datapath when available and
>registered. When BACKUP feature is enabled, virtio_net driver creates
>an additional 'bypass' netdev that acts as a master device and controls
>2 slave devices.  The original virtio_net netdev is registered as
>'backup' netdev and a passthru/vf device with the same MAC gets
>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>associated with the same 'pci' device.  The user accesses the network
>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>as default for transmits when it is available with link up and running.

Sorry, but this is ridiculous. You are apparently re-implemeting part
of bonding driver as a part of NIC driver. Bond and team drivers
are mature solutions, well tested, broadly used, with lots of issues
resolved in the past. What you try to introduce is a weird shortcut
that already has couple of issues as you mentioned and will certanly
have many more. Also, I'm pretty sure that in future, someone comes up
with ideas like multiple VFs, LACP and similar bonding things.

What is the reason for this abomination? According to:
https://marc.info/?l=linux-virtualization=151189725224231=2
The reason is quite weak.
User in the vm sees 2 (or more) netdevices, he puts them in bond/team
and that's it. This works now! If the vm lacks some userspace features,
let's fix it there! For example the MAC changes is something that could
be easily handled in teamd userspace deamon.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: bonding (IEEE 802.3ad) not working with qemu/virtio

2016-01-30 Thread Jiri Pirko
Sat, Jan 30, 2016 at 07:59:24AM CET, da...@davemloft.net wrote:
>From: Nikolay Aleksandrov 
>Date: Fri, 29 Jan 2016 22:48:26 +0100
>
>> On 01/29/2016 10:45 PM, Jay Vosburgh wrote:
>>> Nikolay Aleksandrov  wrote:
>>> 
 On 01/25/2016 05:24 PM, Bjørnar Ness wrote:
> As subject says, 802.3ad bonding is not working with virtio network model.
>
> The only errors I see is:
>
> No 802.3ad response from the link partner for any adapters in the bond.
>
> Dumping the network traffic shows that no LACP packets are sent from the
> host running with virtio driver, changing to for example e1000 solves
> this problem
> with no configuration changes.
>
> Is this a known problem?
>
 [Including bonding maintainers for comments]

 Hi,
 Here's a workaround patch for virtio_net devices that "cheats" the
 duplex test (which is the actual problem). I've tested this locally
 and it works for me.
 I'd let the others comment on the implementation, there're other signs
 that can be used to distinguish a virtio_net device so I'm open to 
 suggestions.
 Also feedback if this is at all acceptable would be appreciated.
>>> 
>>> Should virtio instead provide an arbitrary speed and full duplex
>>> to ethtool, as veth does?
>>> 
>>> Creating a magic whitelist of devices deep inside the 802.3ad
>>> implementation seems less desirable.
>>> 
>> TBH, I absolutely agree. In fact here's what we've been doing:
>> add set_settings which allows the user to set any speed/duplex
>> and get_settings of course to retrieve that. This is also useful
>> for testing other stuff that requires speed and duplex, not only
>> for the bonding case.
>
>I also agree.  Having a whitelist is just rediculous.
>
>There should be a default speed/duplex setting for such devices as well.
>We can pick one that will be use universally for these kinds of devices.

Exposing made up speed for veth and virtio_net sounds odd to me. User
see 1Mb/s but it makes no sense at all. It is just confusing.

I believe this is bonding bug and should be fixed in there. Team works
fine with virtio_net device and lacp runner.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next 1/4] net: introduce new priv_flag indicating iface capable of change mac when running

2012-06-29 Thread Jiri Pirko
Fri, Jun 29, 2012 at 12:17:34AM CEST, bhutchi...@solarflare.com wrote:
On Thu, 2012-06-28 at 16:10 +0200, Jiri Pirko wrote:
 Introduce IFF_LIFE_ADDR_CHANGE priv_flag and use it to disable
 netif_running() check in eth_mac_addr()

 Signed-off-by: Jiri Pirko jpi...@redhat.com
 ---
  include/linux/if.h |2 ++
  net/ethernet/eth.c |2 +-
  2 files changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/if.h b/include/linux/if.h
 index f995c66..fd9ee7c 100644
 --- a/include/linux/if.h
 +++ b/include/linux/if.h
 @@ -81,6 +81,8 @@
  #define IFF_UNICAST_FLT 0x2 /* Supports unicast filtering   
 */
  #define IFF_TEAM_PORT   0x4 /* device used as team port */
  #define IFF_SUPP_NOFCS  0x8 /* device supports sending 
 custom FCS */
 +#define IFF_LIFE_ADDR_CHANGE 0x10   /* device supports hardware 
 address
 + * change when it's running */
[...]

Any device that has IFF_UNICAST_FLT can update the unicast MAC filter
while it's running; doesn't that go hand-in-hand with being able to
handle changes to the primary MAC address?  Is the new flag really
necessary at all?

Hmm, this makes sense. But, can you guarantee that all devices behave like this?

Also, there are many devices that does not support unicast filtering
and yet they support updating mac adress while running.

Jirka


Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next v2 0/4] net: introduce and use IFF_LIFE_ADDR_CHANGE

2012-06-29 Thread Jiri Pirko
three drivers updated, but this can be used in many others.

v1-v2:
%s/LIFE/LIVE

Jiri Pirko (4):
  net: introduce new priv_flag indicating iface capable of change mac
when running
  virtio_net: use IFF_LIVE_ADDR_CHANGE priv_flag
  team: use IFF_LIVE_ADDR_CHANGE priv_flag
  dummy: use IFF_LIVE_ADDR_CHANGE priv_flag

 drivers/net/dummy.c  |   15 ++-
 drivers/net/team/team.c  |9 +
 drivers/net/virtio_net.c |   11 +--
 include/linux/if.h   |2 ++
 net/ethernet/eth.c   |2 +-
 5 files changed, 15 insertions(+), 24 deletions(-)

-- 
1.7.10.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next v2 1/4] net: introduce new priv_flag indicating iface capable of change mac when running

2012-06-29 Thread Jiri Pirko
Introduce IFF_LIVE_ADDR_CHANGE priv_flag and use it to disable
netif_running() check in eth_mac_addr()

Signed-off-by: Jiri Pirko jpi...@redhat.com
---
 include/linux/if.h |2 ++
 net/ethernet/eth.c |2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/if.h b/include/linux/if.h
index f995c66..1ec407b 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -81,6 +81,8 @@
 #define IFF_UNICAST_FLT0x2 /* Supports unicast filtering   
*/
 #define IFF_TEAM_PORT  0x4 /* device used as team port */
 #define IFF_SUPP_NOFCS 0x8 /* device supports sending custom FCS */
+#define IFF_LIVE_ADDR_CHANGE 0x10  /* device supports hardware address
+* change when it's running */
 
 
 #define IF_GET_IFACE   0x0001  /* for querying only */
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 36e5880..db6a6c1 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -283,7 +283,7 @@ int eth_mac_addr(struct net_device *dev, void *p)
 {
struct sockaddr *addr = p;
 
-   if (netif_running(dev))
+   if (!(dev-priv_flags  IFF_LIVE_ADDR_CHANGE)  netif_running(dev))
return -EBUSY;
if (!is_valid_ether_addr(addr-sa_data))
return -EADDRNOTAVAIL;
-- 
1.7.10.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next v2 3/4] team: use IFF_LIVE_ADDR_CHANGE priv_flag

2012-06-29 Thread Jiri Pirko
Signed-off-by: Jiri Pirko jpi...@redhat.com
---
 drivers/net/team/team.c |9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 89853c3..9b94f53 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1188,10 +1188,11 @@ static int team_set_mac_address(struct net_device *dev, 
void *p)
 {
struct team *team = netdev_priv(dev);
struct team_port *port;
-   struct sockaddr *addr = p;
+   int err;
 
-   dev-addr_assign_type = ~NET_ADDR_RANDOM;
-   memcpy(dev-dev_addr, addr-sa_data, ETH_ALEN);
+   err = eth_mac_addr(dev, p);
+   if (err)
+   return err;
rcu_read_lock();
list_for_each_entry_rcu(port, team-port_list, list)
if (team-ops.port_change_mac)
@@ -1393,7 +1394,7 @@ static void team_setup(struct net_device *dev)
 * bring us to promisc mode in case a unicast addr is added.
 * Let this up to underlay drivers.
 */
-   dev-priv_flags |= IFF_UNICAST_FLT;
+   dev-priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
 
dev-features |= NETIF_F_LLTX;
dev-features |= NETIF_F_GRO;
-- 
1.7.10.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next v2 4/4] dummy: use IFF_LIVE_ADDR_CHANGE priv_flag

2012-06-29 Thread Jiri Pirko
Signed-off-by: Jiri Pirko jpi...@redhat.com
---
 drivers/net/dummy.c |   15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index bab0158..9d6a067 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -40,18 +40,6 @@
 
 static int numdummies = 1;
 
-static int dummy_set_address(struct net_device *dev, void *p)
-{
-   struct sockaddr *sa = p;
-
-   if (!is_valid_ether_addr(sa-sa_data))
-   return -EADDRNOTAVAIL;
-
-   dev-addr_assign_type = ~NET_ADDR_RANDOM;
-   memcpy(dev-dev_addr, sa-sa_data, ETH_ALEN);
-   return 0;
-}
-
 /* fake multicast ability */
 static void set_multicast_list(struct net_device *dev)
 {
@@ -118,7 +106,7 @@ static const struct net_device_ops dummy_netdev_ops = {
.ndo_start_xmit = dummy_xmit,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_rx_mode= set_multicast_list,
-   .ndo_set_mac_address= dummy_set_address,
+   .ndo_set_mac_address= eth_mac_addr,
.ndo_get_stats64= dummy_get_stats64,
 };
 
@@ -134,6 +122,7 @@ static void dummy_setup(struct net_device *dev)
dev-tx_queue_len = 0;
dev-flags |= IFF_NOARP;
dev-flags = ~IFF_MULTICAST;
+   dev-priv_flags |= IFF_LIVE_ADDR_CHANGE;
dev-features   |= NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO;
dev-features   |= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
eth_hw_addr_random(dev);
-- 
1.7.10.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch net-next] virtio_net: allow to change mac when iface is running

2012-06-28 Thread Jiri Pirko
Thu, Jun 28, 2012 at 06:30:46AM CEST, da...@davemloft.net wrote:
From: Jiri Pirko jpi...@redhat.com
Date: Wed, 27 Jun 2012 17:27:46 +0200

 Signed-off-by: Jiri Pirko jpi...@redhat.com

Applied, but this seriously makes eth_mac_addr() completely useless.

Technically, every eth_mac_addr() user in a software/virtual device
should behave the way virtio_net does now.

I guess to. But for some HW devices eth_mac_addr() is needed (when they
does not support life mac change)


It therefore probably makes sense to add a boolean arg which when true
elides the netif_running() check then fixup and audit every caller.

I was thinking about this. Maybe probably __eth_mac_addr() which does
not have netif_running() check and eth_mac_addr() calling
netif_running() check and __eth_mac_addr() after that.

What do you think?

Jirka
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[patch net-next 1/4] net: introduce new priv_flag indicating iface capable of change mac when running

2012-06-28 Thread Jiri Pirko
Introduce IFF_LIFE_ADDR_CHANGE priv_flag and use it to disable
netif_running() check in eth_mac_addr()

Signed-off-by: Jiri Pirko jpi...@redhat.com
---
 include/linux/if.h |2 ++
 net/ethernet/eth.c |2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/if.h b/include/linux/if.h
index f995c66..fd9ee7c 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -81,6 +81,8 @@
 #define IFF_UNICAST_FLT0x2 /* Supports unicast filtering   
*/
 #define IFF_TEAM_PORT  0x4 /* device used as team port */
 #define IFF_SUPP_NOFCS 0x8 /* device supports sending custom FCS */
+#define IFF_LIFE_ADDR_CHANGE 0x10  /* device supports hardware address
+* change when it's running */
 
 
 #define IF_GET_IFACE   0x0001  /* for querying only */
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 36e5880..8f8ded4 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -283,7 +283,7 @@ int eth_mac_addr(struct net_device *dev, void *p)
 {
struct sockaddr *addr = p;
 
-   if (netif_running(dev))
+   if (!(dev-priv_flags  IFF_LIFE_ADDR_CHANGE)  netif_running(dev))
return -EBUSY;
if (!is_valid_ether_addr(addr-sa_data))
return -EADDRNOTAVAIL;
-- 
1.7.10.4

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   >