Re: [RFC 0/3] Adding config get/set to devlink

2017-10-13 Thread Roopa Prabhu
On Fri, Oct 13, 2017 at 12:11 AM, Jiri Pirko  wrote:
> Thu, Oct 12, 2017 at 11:53:56PM CEST, ro...@cumulusnetworks.com wrote:
>>On Thu, Oct 12, 2017 at 12:20 PM, Florian Fainelli  
>>wrote:
>>> On 10/12/2017 12:06 PM, David Miller wrote:
 From: Florian Fainelli 
 Date: Thu, 12 Oct 2017 08:43:59 -0700

> Once we move ethtool (or however we name its successor) over to
> netlink there is an opportunity for accessing objects that do and do
> not have a netdevice representor today (e.g: management ports on
> switches) with the same interface, and devlink could be used for
> that.

 That is an interesting angle for including this in devlink.

 I'm not so sure what to do about this.

 One suggestion is that devlink is used for getting ethtool stats for
 objects lacking netdev representor's, and a new genetlink family is
 used for netdev based ethtool.
>>>
>>> Right, I was also thinking along those lines that we we would have a new
>>> generic netlink family for ethtool to support ethtool over netlink.
>>
>>new api is fine by me. The reason for suggesting devlink was because
>>some of the devlink
>>port_* ops are close to ethtool ops that can operate on a port/netdev.
>>eg split_port could be a netdev operation
>>unless you want to split before the netdev is created.
>
> Let me correct you. The split is always devlink_port operation. In some
> cases however when there is a mapping between devlink_port and netdev,
> userspace part could translate netdev->devlink_port.

yes, thats what i was trying to hint..that in some cases devlink_port
can already be mapped to a netdev.

>
>
>>
>>There are some ops in devlink which are global hw parameters and not
>>specific to a port, those fit perfectly with
>>devlinks original goal.
>
> There are 2 handles from the very beginning:
> 1) devlink - asic-wide handle
> 2) devlink_port - port handle

yep, i know that...and i was not trying to say that is a bad thing.

I think we will end up with devlink_port operations that could also be
done on a netdev down the lane. And, we may have to then argue where
an attribute will go. Hence my suggestion on classifying the api by
the target (driver in this case vs kernel networking for rtnetlink).
If you take netdev out of the picture, the port attributes that
devlink tries to set are similar to the ethtool port attributes today.
Also, it seemed like the new port attributes set api (proposed in this
thread) was close to the ethtool attributes set. Having all link hw
attributes in the same tool/api has  advantages. I have no plans to
move anything yet...so if the general preference is to keep devlink
netdev free for now, thats fine.


[PATCH net-next 1/2] net: move memcpy_to[from]_msg() from skbuff.h to socket.h

2017-10-13 Thread yuan linyu
From: yuan linyu 

these two functions used by skb and other places,
move to socket.h where struct msghdr defined.

Signed-off-by: yuan linyu 
---
 include/linux/skbuff.h | 10 --
 include/linux/socket.h | 12 +++-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 03634ec2..90868d1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3294,16 +3294,6 @@ int skb_vlan_push(struct sk_buff *skb, __be16 
vlan_proto, u16 vlan_tci);
 struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
 gfp_t gfp);
 
-static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
-{
-   return copy_from_iter_full(data, len, >msg_iter) ? 0 : -EFAULT;
-}
-
-static inline int memcpy_to_msg(struct msghdr *msg, void *data, int len)
-{
-   return copy_to_iter(data, len, >msg_iter) == len ? 0 : -EFAULT;
-}
-
 struct skb_checksum_ops {
__wsum (*update)(const void *mem, int len, __wsum wsum);
__wsum (*combine)(__wsum csum, __wsum csum2, int offset, int len);
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8ad963c..c414f1f 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -53,7 +53,17 @@ struct msghdr {
unsigned intmsg_flags;  /* flags on received message */
struct kiocb*msg_iocb;  /* ptr to iocb for async requests */
 };
- 
+
+static inline int memcpy_from_msg(void *data, struct msghdr *msg, int len)
+{
+   return copy_from_iter_full(data, len, >msg_iter) ? 0 : -EFAULT;
+}
+
+static inline int memcpy_to_msg(struct msghdr *msg, void *data, int len)
+{
+   return copy_to_iter(data, len, >msg_iter) == len ? 0 : -EFAULT;
+}
+
 struct user_msghdr {
void__user *msg_name;   /* ptr to socket address 
structure */
int msg_namelen;/* size of socket address 
structure */
-- 
2.7.4




[PATCH net-next 2/2] net: add skb_memcpy_to[from]_msg() to optimize skb code

2017-10-13 Thread yuan linyu
From: yuan linyu 

add these two wrappers in skbuff.h which is better named
than previous and only used for skb.

Signed-off-by: yuan linyu 
---
 drivers/isdn/mISDN/socket.c|  2 +-
 drivers/staging/irda/net/af_irda.c |  2 +-
 include/linux/skbuff.h | 10 ++
 net/appletalk/ddp.c|  2 +-
 net/ax25/af_ax25.c |  2 +-
 net/bluetooth/hci_sock.c   |  4 ++--
 net/bluetooth/rfcomm/sock.c|  2 +-
 net/bluetooth/sco.c|  2 +-
 net/caif/caif_socket.c |  6 +++---
 net/can/bcm.c  |  4 ++--
 net/can/raw.c  |  4 ++--
 net/dccp/proto.c   |  2 +-
 net/decnet/af_decnet.c |  4 ++--
 net/ieee802154/socket.c|  4 ++--
 net/ipx/ipx_route.c|  2 +-
 net/key/af_key.c   |  2 +-
 net/l2tp/l2tp_ip.c |  2 +-
 net/l2tp/l2tp_ppp.c|  2 +-
 net/llc/af_llc.c   |  2 +-
 net/netlink/af_netlink.c   |  2 +-
 net/nfc/rawsock.c  |  2 +-
 net/packet/af_packet.c |  2 +-
 net/phonet/datagram.c  |  2 +-
 net/phonet/pep.c   |  2 +-
 24 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/drivers/isdn/mISDN/socket.c b/drivers/isdn/mISDN/socket.c
index c5603d1..19ecf62 100644
--- a/drivers/isdn/mISDN/socket.c
+++ b/drivers/isdn/mISDN/socket.c
@@ -202,7 +202,7 @@ mISDN_sock_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t len)
if (!skb)
goto done;
 
-   if (memcpy_from_msg(skb_put(skb, len), msg, len)) {
+   if (skb_memcpy_from_msg(skb, msg, len)) {
err = -EFAULT;
goto done;
}
diff --git a/drivers/staging/irda/net/af_irda.c 
b/drivers/staging/irda/net/af_irda.c
index 23fa7c8..159fc1a 100644
--- a/drivers/staging/irda/net/af_irda.c
+++ b/drivers/staging/irda/net/af_irda.c
@@ -1469,7 +1469,7 @@ static int irda_recvmsg_stream(struct socket *sock, 
struct msghdr *msg,
}
 
chunk = min_t(unsigned int, skb->len, size);
-   if (memcpy_to_msg(msg, skb->data, chunk)) {
+   if (skb_memcpy_to_msg(msg, skb, chunk)) {
skb_queue_head(>sk_receive_queue, skb);
if (copied == 0)
copied = -EFAULT;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 90868d1..901fa60 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3294,6 +3294,16 @@ int skb_vlan_push(struct sk_buff *skb, __be16 
vlan_proto, u16 vlan_tci);
 struct sk_buff *pskb_extract(struct sk_buff *skb, int off, int to_copy,
 gfp_t gfp);
 
+static inline int skb_memcpy_from_msg(struct sk_buff *skb, struct msghdr *msg, 
int len)
+{
+   return memcpy_from_msg(skb_put(skb, len), msg, len);
+}
+
+static inline int skb_memcpy_to_msg(struct msghdr *msg, struct sk_buff *skb, 
int len)
+{
+   return memcpy_to_msg(msg, skb->data, len);
+}
+
 struct skb_checksum_ops {
__wsum (*update)(const void *mem, int len, __wsum wsum);
__wsum (*combine)(__wsum csum, __wsum csum2, int offset, int len);
diff --git a/net/appletalk/ddp.c b/net/appletalk/ddp.c
index 5d035c1..c7846c3 100644
--- a/net/appletalk/ddp.c
+++ b/net/appletalk/ddp.c
@@ -1658,7 +1658,7 @@ static int atalk_sendmsg(struct socket *sock, struct 
msghdr *msg, size_t len)
 
SOCK_DEBUG(sk, "SK %p: Copy user data (%zd bytes).\n", sk, len);
 
-   err = memcpy_from_msg(skb_put(skb, len), msg, len);
+   err = skb_memcpy_from_msg(skb, msg, len);
if (err) {
kfree_skb(skb);
err = -EFAULT;
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index f3f9d18..442763e 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1552,7 +1552,7 @@ static int ax25_sendmsg(struct socket *sock, struct 
msghdr *msg, size_t len)
skb_reserve(skb, size - len);
 
/* User data follows immediately after the AX.25 data */
-   if (memcpy_from_msg(skb_put(skb, len), msg, len)) {
+   if (skb_memcpy_from_msg(skb, msg, len)) {
err = -EFAULT;
kfree_skb(skb);
goto out;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 65d734c..349c79a 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -1601,7 +1601,7 @@ static int hci_logging_frame(struct sock *sk, struct 
msghdr *msg, int len)
if (!skb)
return err;
 
-   if (memcpy_from_msg(skb_put(skb, len), msg, len)) {
+   if (skb_memcpy_from_msg(skb, msg, len)) {
err = -EFAULT;
goto drop;
}
@@ -1726,7 +1726,7 @@ static int hci_sock_sendmsg(struct socket *sock, struct 
msghdr *msg,
if (!skb)
goto done;
 
-   if 

[PATCH net-next 0/2] net: add skb_memcpy_to[from]_msg()

2017-10-13 Thread yuan linyu
From: yuan linyu 



yuan linyu (2):
  net: move memcpy_to[from]_msg() from skbuff.h to socket.h
  net: add skb_memcpy_to[from]_msg() to optimize skb code

 drivers/isdn/mISDN/socket.c|  2 +-
 drivers/staging/irda/net/af_irda.c |  2 +-
 include/linux/skbuff.h |  8 
 include/linux/socket.h | 12 +++-
 net/appletalk/ddp.c|  2 +-
 net/ax25/af_ax25.c |  2 +-
 net/bluetooth/hci_sock.c   |  4 ++--
 net/bluetooth/rfcomm/sock.c|  2 +-
 net/bluetooth/sco.c|  2 +-
 net/caif/caif_socket.c |  6 +++---
 net/can/bcm.c  |  4 ++--
 net/can/raw.c  |  4 ++--
 net/dccp/proto.c   |  2 +-
 net/decnet/af_decnet.c |  4 ++--
 net/ieee802154/socket.c|  4 ++--
 net/ipx/ipx_route.c|  2 +-
 net/key/af_key.c   |  2 +-
 net/l2tp/l2tp_ip.c |  2 +-
 net/l2tp/l2tp_ppp.c|  2 +-
 net/llc/af_llc.c   |  2 +-
 net/netlink/af_netlink.c   |  2 +-
 net/nfc/rawsock.c  |  2 +-
 net/packet/af_packet.c |  2 +-
 net/phonet/datagram.c  |  2 +-
 net/phonet/pep.c   |  2 +-
 25 files changed, 45 insertions(+), 35 deletions(-)

-- 
2.7.4




[PATCH net v2 2/2] net: fec: Let fec_ptp have its own interrupt routine

2017-10-13 Thread Troy Kisky
This is better for code locality and should slightly
speed up normal interrupts.

This also allows PPS clock output to start working for
i.mx7. This is because i.mx7 was already using the limit
of 3 interrupts, and needed another.

Signed-off-by: Troy Kisky 

---

v2: made this change independent of any devicetree change
so that old dtbs continue to work.

Continue to register ptp clock if interrupt is not found.
---
 drivers/net/ethernet/freescale/fec.h  |  3 +-
 drivers/net/ethernet/freescale/fec_main.c | 25 ++
 drivers/net/ethernet/freescale/fec_ptp.c  | 82 ++-
 3 files changed, 65 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h 
b/drivers/net/ethernet/freescale/fec.h
index ede1876a9a19..be56ac1f1ac4 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -582,12 +582,11 @@ struct fec_enet_private {
u64 ethtool_stats[0];
 };
 
-void fec_ptp_init(struct platform_device *pdev);
+void fec_ptp_init(struct platform_device *pdev, int irq_index);
 void fec_ptp_stop(struct platform_device *pdev);
 void fec_ptp_start_cyclecounter(struct net_device *ndev);
 int fec_ptp_set(struct net_device *ndev, struct ifreq *ifr);
 int fec_ptp_get(struct net_device *ndev, struct ifreq *ifr);
-uint fec_ptp_check_pps_event(struct fec_enet_private *fep);
 
 //
 #endif /* FEC_H */
diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 3dc2d771a222..21afabbc560f 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1602,10 +1602,6 @@ fec_enet_interrupt(int irq, void *dev_id)
ret = IRQ_HANDLED;
complete(>mdio_done);
}
-
-   if (fep->ptp_clock)
-   if (fec_ptp_check_pps_event(fep))
-   ret = IRQ_HANDLED;
return ret;
 }
 
@@ -3325,6 +3321,8 @@ fec_probe(struct platform_device *pdev)
struct device_node *np = pdev->dev.of_node, *phy_node;
int num_tx_qs;
int num_rx_qs;
+   char irq_name[8];
+   int irq_cnt;
 
fec_enet_get_queue_num(pdev, _tx_qs, _rx_qs);
 
@@ -3465,18 +3463,27 @@ fec_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
 
+   irq_cnt = platform_irq_count(pdev);
+   if (irq_cnt > FEC_IRQ_NUM)
+   irq_cnt = FEC_IRQ_NUM;  /* last for ptp */
+   else if (irq_cnt == 2)
+   irq_cnt = 1;/* last for ptp */
+   else if (irq_cnt <= 0)
+   irq_cnt = 1;/* Let the for loop fail */
+
if (fep->bufdesc_ex)
-   fec_ptp_init(pdev);
+   fec_ptp_init(pdev, irq_cnt);
 
ret = fec_enet_init(ndev);
if (ret)
goto failed_init;
 
-   for (i = 0; i < FEC_IRQ_NUM; i++) {
-   irq = platform_get_irq(pdev, i);
+   for (i = 0; i < irq_cnt; i++) {
+   sprintf(irq_name, "int%d", i);
+   irq = platform_get_irq_byname(pdev, irq_name);
+   if (irq < 0)
+   irq = platform_get_irq(pdev, i);
if (irq < 0) {
-   if (i)
-   break;
ret = irq;
goto failed_irq;
}
diff --git a/drivers/net/ethernet/freescale/fec_ptp.c 
b/drivers/net/ethernet/freescale/fec_ptp.c
index 6ebad3fac81d..3abeee0d16dd 100644
--- a/drivers/net/ethernet/freescale/fec_ptp.c
+++ b/drivers/net/ethernet/freescale/fec_ptp.c
@@ -549,6 +549,37 @@ static void fec_time_keep(struct work_struct *work)
schedule_delayed_work(>time_keep, HZ);
 }
 
+/* This function checks the pps event and reloads the timer compare counter. */
+static irqreturn_t fec_ptp_interrupt(int irq, void *dev_id)
+{
+   struct net_device *ndev = dev_id;
+   struct fec_enet_private *fep = netdev_priv(ndev);
+   u32 val;
+   u8 channel = fep->pps_channel;
+   struct ptp_clock_event event;
+
+   val = readl(fep->hwp + FEC_TCSR(channel));
+   if (val & FEC_T_TF_MASK) {
+   /* Write the next next compare(not the next according the spec)
+* value to the register
+*/
+   writel(fep->next_counter, fep->hwp + FEC_TCCR(channel));
+   do {
+   writel(val, fep->hwp + FEC_TCSR(channel));
+   } while (readl(fep->hwp + FEC_TCSR(channel)) & FEC_T_TF_MASK);
+
+   /* Update the counter; */
+   fep->next_counter = (fep->next_counter + fep->reload_period) &
+   fep->cc.mask;
+
+   event.type = PTP_CLOCK_PPS;
+   ptp_clock_event(fep->ptp_clock, );
+   return IRQ_HANDLED;
+   }
+
+   return IRQ_NONE;
+}
+
 /**
  * 

[PATCH net v2 1/2] ARM: dts: imx: name the interrupts for the fec ethernet driver

2017-10-13 Thread Troy Kisky
imx7s/imx7d has the ptp interrupt newly added as well.

For imx7, "int0" is the interrupt for queue 0 and ENET_MII
"int1" is for queue 1
"int2" is for queue 2

For imx6sx, "int0" handles all 3 queues and ENET_MII

And of course, the "ptp" interrupt is for the PTP_CLOCK_PPS interrupts
This will help document what each interrupt does.

Signed-off-by: Troy Kisky 

---
v2: replaced empty names with "int0","int1", or "int2"

reordered imx7 interrupts so that "int0", for queue 0, is first.
---
 arch/arm/boot/dts/imx6qdl.dtsi | 1 +
 arch/arm/boot/dts/imx6sx.dtsi  | 2 ++
 arch/arm/boot/dts/imx6ul.dtsi  | 2 ++
 arch/arm/boot/dts/imx7d.dtsi   | 6 --
 arch/arm/boot/dts/imx7s.dtsi   | 6 --
 5 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/arm/boot/dts/imx6qdl.dtsi b/arch/arm/boot/dts/imx6qdl.dtsi
index 8884b4a3cafb..b78f7e7a0869 100644
--- a/arch/arm/boot/dts/imx6qdl.dtsi
+++ b/arch/arm/boot/dts/imx6qdl.dtsi
@@ -1017,6 +1017,7 @@
fec: ethernet@02188000 {
compatible = "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
+   interrupt-names = "int0","ptp";
interrupts-extended =
< 0 118 IRQ_TYPE_LEVEL_HIGH>,
< 0 119 IRQ_TYPE_LEVEL_HIGH>;
diff --git a/arch/arm/boot/dts/imx6sx.dtsi b/arch/arm/boot/dts/imx6sx.dtsi
index 6c7eb54be9e2..ed148775f991 100644
--- a/arch/arm/boot/dts/imx6sx.dtsi
+++ b/arch/arm/boot/dts/imx6sx.dtsi
@@ -861,6 +861,7 @@
fec1: ethernet@02188000 {
compatible = "fsl,imx6sx-fec", "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
+   interrupt-names = "int0","ptp";
interrupts = ,
 ;
clocks = < IMX6SX_CLK_ENET>,
@@ -970,6 +971,7 @@
fec2: ethernet@021b4000 {
compatible = "fsl,imx6sx-fec", "fsl,imx6q-fec";
reg = <0x021b4000 0x4000>;
+   interrupt-names = "int0","ptp";
interrupts = ,
 ;
clocks = < IMX6SX_CLK_ENET>,
diff --git a/arch/arm/boot/dts/imx6ul.dtsi b/arch/arm/boot/dts/imx6ul.dtsi
index f11a241a340d..b624b5fc2d99 100644
--- a/arch/arm/boot/dts/imx6ul.dtsi
+++ b/arch/arm/boot/dts/imx6ul.dtsi
@@ -476,6 +476,7 @@
fec2: ethernet@020b4000 {
compatible = "fsl,imx6ul-fec", "fsl,imx6q-fec";
reg = <0x020b4000 0x4000>;
+   interrupt-names = "int0","ptp";
interrupts = ,
 ;
clocks = < IMX6UL_CLK_ENET>,
@@ -775,6 +776,7 @@
fec1: ethernet@02188000 {
compatible = "fsl,imx6ul-fec", "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
+   interrupt-names = "int0","ptp";
interrupts = ,
 ;
clocks = < IMX6UL_CLK_ENET>,
diff --git a/arch/arm/boot/dts/imx7d.dtsi b/arch/arm/boot/dts/imx7d.dtsi
index f46814a7ea44..312d24ff106e 100644
--- a/arch/arm/boot/dts/imx7d.dtsi
+++ b/arch/arm/boot/dts/imx7d.dtsi
@@ -114,9 +114,11 @@
fec2: ethernet@30bf {
compatible = "fsl,imx7d-fec", "fsl,imx6sx-fec";
reg = <0x30bf 0x1>;
-   interrupts = ,
+   interrupt-names = "int0","int1","int2","ptp";
+   interrupts = ,
+   ,
,
-   ;
+   ;
clocks = < IMX7D_ENET_AXI_ROOT_CLK>,
< IMX7D_ENET_AXI_ROOT_CLK>,
< IMX7D_ENET2_TIME_ROOT_CLK>,
diff --git a/arch/arm/boot/dts/imx7s.dtsi b/arch/arm/boot/dts/imx7s.dtsi
index 82ad26e766eb..b00a31a50771 100644
--- a/arch/arm/boot/dts/imx7s.dtsi
+++ b/arch/arm/boot/dts/imx7s.dtsi
@@ -1007,9 +1007,11 @@
fec1: ethernet@30be {
compatible = "fsl,imx7d-fec", "fsl,imx6sx-fec";
reg = <0x30be 0x1>;
-   interrupts = ,
+   interrupt-names = "int0","int1","int2","ptp";
+   interrupts = ,
+   ,
,
-   ;
+   ;

RE: [Intel-wired-lan] [PATCH][V3] e1000: avoid null pointer dereference on invalid stat type

2017-10-13 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf
> Of Colin King
> Sent: Friday, September 22, 2017 10:14 AM
> To: Kirsher, Jeffrey T ; intel-wired-
> l...@lists.osuosl.org; netdev@vger.kernel.org
> Cc: kernel-janit...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH][V3] e1000: avoid null pointer dereference
> on invalid stat type
> 
> From: Colin Ian King 
> 
> Currently if the stat type is invalid then data[i] is being set
> either by dereferencing a null pointer p, or it is reading from
> an incorrect previous location if we had a valid stat type
> previously.  Fix this by skipping over the read of p on an invalid
> stat type.
> 
> Detected by CoverityScan, CID#113385 ("Explicit null dereferenced")
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/ethernet/intel/e1000/e1000_ethtool.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)

Tested-by: Aaron Brown 


[PATCH net 4/6] bnxt_en: Fix VF resource checking.

2017-10-13 Thread Michael Chan
In bnxt_sriov_enable(), we calculate to see if we have enough hardware
resources to enable the requested number of VFs.  The logic to check
for minimum completion rings and statistics contexts is missing.  Add
the required checks so that VF configuration won't fail.

Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
index d37925a..5ee1866 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
@@ -502,6 +502,7 @@ static int bnxt_sriov_enable(struct bnxt *bp, int *num_vfs)
int rc = 0, vfs_supported;
int min_rx_rings, min_tx_rings, min_rss_ctxs;
int tx_ok = 0, rx_ok = 0, rss_ok = 0;
+   int avail_cp, avail_stat;
 
/* Check if we can enable requested num of vf's. At a mininum
 * we require 1 RX 1 TX rings for each VF. In this minimum conf
@@ -509,6 +510,10 @@ static int bnxt_sriov_enable(struct bnxt *bp, int *num_vfs)
 */
vfs_supported = *num_vfs;
 
+   avail_cp = bp->pf.max_cp_rings - bp->cp_nr_rings;
+   avail_stat = bp->pf.max_stat_ctxs - bp->num_stat_ctxs;
+   avail_cp = min_t(int, avail_cp, avail_stat);
+
while (vfs_supported) {
min_rx_rings = vfs_supported;
min_tx_rings = vfs_supported;
@@ -523,10 +528,12 @@ static int bnxt_sriov_enable(struct bnxt *bp, int 
*num_vfs)
min_rx_rings)
rx_ok = 1;
}
-   if (bp->pf.max_vnics - bp->nr_vnics < min_rx_rings)
+   if (bp->pf.max_vnics - bp->nr_vnics < min_rx_rings ||
+   avail_cp < min_rx_rings)
rx_ok = 0;
 
-   if (bp->pf.max_tx_rings - bp->tx_nr_rings >= min_tx_rings)
+   if (bp->pf.max_tx_rings - bp->tx_nr_rings >= min_tx_rings &&
+   avail_cp >= min_tx_rings)
tx_ok = 1;
 
if (bp->pf.max_rsscos_ctxs - bp->rsscos_nr_ctxs >= min_rss_ctxs)
-- 
1.8.3.1



[PATCH net 0/6] bnxt_en: bug fixes.

2017-10-13 Thread Michael Chan
Various bug fixes for the VF/PF link change logic, VF resource checking,
potential firmware response corruption on NVRAM and DCB parameters,
and reading the wrong register for PCIe link speed on the VF.

Michael Chan (4):
  bnxt_en: Improve VF/PF link change logic.
  bnxt_en: Don't use rtnl lock to protect link change logic in
workqueue.
  bnxt_en: Fix VF resource checking.
  bnxt_en: Fix possible corrupted NVRAM parameters from firmware
response.

Sankar Patchineelam (1):
  bnxt_en: Fix possible corruption in DCB parameters from firmware.

Vasundhara Volam (1):
  bnxt_en: Fix VF PCIe link speed and width logic.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 99 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  5 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c | 23 --
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  8 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c   | 11 ++-
 5 files changed, 112 insertions(+), 34 deletions(-)

-- 
1.8.3.1



[PATCH net 6/6] bnxt_en: Fix possible corruption in DCB parameters from firmware.

2017-10-13 Thread Michael Chan
From: Sankar Patchineelam 

hwrm_send_message() is replaced with _hwrm_send_message(), and
hwrm_cmd_lock mutex lock is grabbed for the whole period of
firmware call until the firmware DCB parameters have been copied.
This will prevent possible corruption of the firmware data.

Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.")
Signed-off-by: Sankar Patchineelam 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
index aa1f3a2..fed37cd 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
@@ -50,7 +50,9 @@ static int bnxt_hwrm_queue_pri2cos_qcfg(struct bnxt *bp, 
struct ieee_ets *ets)
 
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_QUEUE_PRI2COS_QCFG, -1, -1);
req.flags = cpu_to_le32(QUEUE_PRI2COS_QCFG_REQ_FLAGS_IVLAN);
-   rc = hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+
+   mutex_lock(>hwrm_cmd_lock);
+   rc = _hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
if (!rc) {
u8 *pri2cos = >pri0_cos_queue_id;
int i, j;
@@ -66,6 +68,7 @@ static int bnxt_hwrm_queue_pri2cos_qcfg(struct bnxt *bp, 
struct ieee_ets *ets)
}
}
}
+   mutex_unlock(>hwrm_cmd_lock);
return rc;
 }
 
@@ -119,9 +122,13 @@ static int bnxt_hwrm_queue_cos2bw_qcfg(struct bnxt *bp, 
struct ieee_ets *ets)
int rc, i;
 
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_QUEUE_COS2BW_QCFG, -1, -1);
-   rc = hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
-   if (rc)
+
+   mutex_lock(>hwrm_cmd_lock);
+   rc = _hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+   if (rc) {
+   mutex_unlock(>hwrm_cmd_lock);
return rc;
+   }
 
data = >queue_id0 + offsetof(struct bnxt_cos2bw_cfg, queue_id);
for (i = 0; i < bp->max_tc; i++, data += sizeof(cos2bw) - 4) {
@@ -143,6 +150,7 @@ static int bnxt_hwrm_queue_cos2bw_qcfg(struct bnxt *bp, 
struct ieee_ets *ets)
}
}
}
+   mutex_unlock(>hwrm_cmd_lock);
return 0;
 }
 
@@ -240,12 +248,17 @@ static int bnxt_hwrm_queue_pfc_qcfg(struct bnxt *bp, 
struct ieee_pfc *pfc)
int rc;
 
bnxt_hwrm_cmd_hdr_init(bp, , HWRM_QUEUE_PFCENABLE_QCFG, -1, -1);
-   rc = hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
-   if (rc)
+
+   mutex_lock(>hwrm_cmd_lock);
+   rc = _hwrm_send_message(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+   if (rc) {
+   mutex_unlock(>hwrm_cmd_lock);
return rc;
+   }
 
pri_mask = le32_to_cpu(resp->flags);
pfc->pfc_en = pri_mask;
+   mutex_unlock(>hwrm_cmd_lock);
return 0;
 }
 
-- 
1.8.3.1



[PATCH net 2/6] bnxt_en: Don't use rtnl lock to protect link change logic in workqueue.

2017-10-13 Thread Michael Chan
As a further improvement to the PF/VF link change logic, use a private
mutex instead of the rtnl lock to protect link change logic.  With the
new mutex, we don't have to take the rtnl lock in the workqueue when
we have to handle link related functions.  If the VF and PF drivers
are running on the same host and both take the rtnl lock and one is
waiting for the other, it will cause timeout.  This patch fixes these
timeouts.

Fixes: 90c694bb7181 ("bnxt_en: Fix RTNL lock usage on bnxt_update_link().")
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 25 ---
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  4 
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  4 
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 7906153..3f596de 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6345,7 +6345,9 @@ static int __bnxt_open_nic(struct bnxt *bp, bool 
irq_re_init, bool link_re_init)
}
 
if (link_re_init) {
+   mutex_lock(>link_lock);
rc = bnxt_update_phy_setting(bp);
+   mutex_unlock(>link_lock);
if (rc)
netdev_warn(bp->dev, "failed to update phy settings\n");
}
@@ -7043,30 +7045,28 @@ static void bnxt_sp_task(struct work_struct *work)
if (test_and_clear_bit(BNXT_PERIODIC_STATS_SP_EVENT, >sp_event))
bnxt_hwrm_port_qstats(bp);
 
-   /* These functions below will clear BNXT_STATE_IN_SP_TASK.  They
-* must be the last functions to be called before exiting.
-*/
if (test_and_clear_bit(BNXT_LINK_CHNG_SP_EVENT, >sp_event)) {
-   int rc = 0;
+   int rc;
 
+   mutex_lock(>link_lock);
if (test_and_clear_bit(BNXT_LINK_SPEED_CHNG_SP_EVENT,
   >sp_event))
bnxt_hwrm_phy_qcaps(bp);
 
-   bnxt_rtnl_lock_sp(bp);
-   if (test_bit(BNXT_STATE_OPEN, >state))
-   rc = bnxt_update_link(bp, true);
-   bnxt_rtnl_unlock_sp(bp);
+   rc = bnxt_update_link(bp, true);
+   mutex_unlock(>link_lock);
if (rc)
netdev_err(bp->dev, "SP task can't update link (rc: 
%x)\n",
   rc);
}
if (test_and_clear_bit(BNXT_HWRM_PORT_MODULE_SP_EVENT, >sp_event)) {
-   bnxt_rtnl_lock_sp(bp);
-   if (test_bit(BNXT_STATE_OPEN, >state))
-   bnxt_get_port_module_status(bp);
-   bnxt_rtnl_unlock_sp(bp);
+   mutex_lock(>link_lock);
+   bnxt_get_port_module_status(bp);
+   mutex_unlock(>link_lock);
}
+   /* These functions below will clear BNXT_STATE_IN_SP_TASK.  They
+* must be the last functions to be called before exiting.
+*/
if (test_and_clear_bit(BNXT_RESET_TASK_SP_EVENT, >sp_event))
bnxt_reset(bp, false);
 
@@ -7766,6 +7766,7 @@ static int bnxt_probe_phy(struct bnxt *bp)
   rc);
return rc;
}
+   mutex_init(>link_lock);
 
rc = bnxt_update_link(bp, false);
if (rc) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 7b888d4..d2925c0 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1290,6 +1290,10 @@ struct bnxt {
unsigned long   *ntp_fltr_bmap;
int ntp_fltr_count;
 
+   /* To protect link related settings during link changes and
+* ethtool settings changes.
+*/
+   struct mutexlink_lock;
struct bnxt_link_info   link_info;
struct ethtool_eee  eee;
u32 lpi_tmr_lo;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 8eff05a..b2cbc97 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -1052,6 +1052,7 @@ static int bnxt_get_link_ksettings(struct net_device *dev,
u32 ethtool_speed;
 
ethtool_link_ksettings_zero_link_mode(lk_ksettings, supported);
+   mutex_lock(>link_lock);
bnxt_fw_to_ethtool_support_spds(link_info, lk_ksettings);
 
ethtool_link_ksettings_zero_link_mode(lk_ksettings, advertising);
@@ -1099,6 +1100,7 @@ static int bnxt_get_link_ksettings(struct net_device *dev,
base->port = PORT_FIBRE;
}
base->phy_address = link_info->phy_addr;
+   mutex_unlock(>link_lock);
 
return 0;
 }
@@ -1190,6 +1192,7 @@ static int 

[PATCH net 3/6] bnxt_en: Fix VF PCIe link speed and width logic.

2017-10-13 Thread Michael Chan
From: Vasundhara Volam 

PCIE PCIE_EP_REG_LINK_STATUS_CONTROL register is only defined in PF
config space, so we must read it from the PF.

Fixes: 90c4f788f6c0 ("bnxt_en: Report PCIe link speed and width during driver 
load")
Signed-off-by: Vasundhara Volam 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 3f596de..4ffa0b1 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7965,7 +7965,7 @@ static void bnxt_parse_log_pcie_link(struct bnxt *bp)
enum pcie_link_width width = PCIE_LNK_WIDTH_UNKNOWN;
enum pci_bus_speed speed = PCI_SPEED_UNKNOWN;
 
-   if (pcie_get_minimum_link(bp->pdev, , ) ||
+   if (pcie_get_minimum_link(pci_physfn(bp->pdev), , ) ||
speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN)
netdev_info(bp->dev, "Failed to determine PCIe Link Info\n");
else
-- 
1.8.3.1



[PATCH net 5/6] bnxt_en: Fix possible corrupted NVRAM parameters from firmware response.

2017-10-13 Thread Michael Chan
In bnxt_find_nvram_item(), it is copying firmware response data after
releasing the mutex.  This can cause the firmware response data
to be corrupted if the next firmware response overwrites the response
buffer.  The rare problem shows up when running ethtool -i repeatedly.

Fix it by calling the new variant _hwrm_send_message_silent() that requires
the caller to take the mutex and to release it after the response data has
been copied.

Fixes: 3ebf6f0a09a2 ("bnxt_en: Add installed-package version reporting via 
Ethtool GDRVINFO")
Reported-by: Sarveswara Rao Mygapula 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 4 +++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 4ffa0b1..dc5de27 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3466,6 +3466,12 @@ int _hwrm_send_message(struct bnxt *bp, void *msg, u32 
msg_len, int timeout)
return bnxt_hwrm_do_send_msg(bp, msg, msg_len, timeout, false);
 }
 
+int _hwrm_send_message_silent(struct bnxt *bp, void *msg, u32 msg_len,
+ int timeout)
+{
+   return bnxt_hwrm_do_send_msg(bp, msg, msg_len, timeout, true);
+}
+
 int hwrm_send_message(struct bnxt *bp, void *msg, u32 msg_len, int timeout)
 {
int rc;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index d2925c0..c911e69 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1362,6 +1362,7 @@ int bnxt_alloc_rx_data(struct bnxt *bp, struct 
bnxt_rx_ring_info *rxr,
 int bnxt_set_rx_skb_mode(struct bnxt *bp, bool page_mode);
 void bnxt_hwrm_cmd_hdr_init(struct bnxt *, void *, u16, u16, u16);
 int _hwrm_send_message(struct bnxt *, void *, u32, int);
+int _hwrm_send_message_silent(struct bnxt *bp, void *msg, u32 len, int 
timeout);
 int hwrm_send_message(struct bnxt *, void *, u32, int);
 int hwrm_send_message_silent(struct bnxt *, void *, u32, int);
 int bnxt_hwrm_func_rgtr_async_events(struct bnxt *bp, unsigned long *bmap,
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index b2cbc97..3cbe771 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -1809,7 +1809,8 @@ static int bnxt_find_nvram_item(struct net_device *dev, 
u16 type, u16 ordinal,
req.dir_ordinal = cpu_to_le16(ordinal);
req.dir_ext = cpu_to_le16(ext);
req.opt_ordinal = NVM_FIND_DIR_ENTRY_REQ_OPT_ORDINAL_EQ;
-   rc = hwrm_send_message_silent(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
+   mutex_lock(>hwrm_cmd_lock);
+   rc = _hwrm_send_message_silent(bp, , sizeof(req), HWRM_CMD_TIMEOUT);
if (rc == 0) {
if (index)
*index = le16_to_cpu(output->dir_idx);
@@ -1818,6 +1819,7 @@ static int bnxt_find_nvram_item(struct net_device *dev, 
u16 type, u16 ordinal,
if (data_length)
*data_length = le32_to_cpu(output->dir_data_length);
}
+   mutex_unlock(>hwrm_cmd_lock);
return rc;
 }
 
-- 
1.8.3.1



[PATCH net 1/6] bnxt_en: Improve VF/PF link change logic.

2017-10-13 Thread Michael Chan
Link status query firmware messages originating from the VFs are forwarded
to the PF.  The driver handles these interactions in a workqueue for the
VF and PF.  The VF driver waits for the response from the PF in the
workqueue.  If the PF and VF driver are running on the same host and the
work for both PF and VF are queued on the same workqueue, the VF driver
may not get the response if the PF work item is queued behind it on the
same workqueue.  This will lead to the VF link query message timing out.

To prevent this, we create a private workqueue for PFs instead of using
the common workqueue.  The VF query and PF response will never be on
the same workqueue.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 66 +--
 1 file changed, 53 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index aacec8b..7906153 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -214,6 +214,8 @@ enum board_idx {
ASYNC_EVENT_CMPL_EVENT_ID_LINK_SPEED_CFG_CHANGE,
 };
 
+static struct workqueue_struct *bnxt_pf_wq;
+
 static bool bnxt_vf_pciid(enum board_idx idx)
 {
return (idx == NETXTREME_C_VF || idx == NETXTREME_E_VF);
@@ -1024,12 +1026,28 @@ static int bnxt_discard_rx(struct bnxt *bp, struct 
bnxt_napi *bnapi,
return 0;
 }
 
+static void bnxt_queue_sp_work(struct bnxt *bp)
+{
+   if (BNXT_PF(bp))
+   queue_work(bnxt_pf_wq, >sp_task);
+   else
+   schedule_work(>sp_task);
+}
+
+static void bnxt_cancel_sp_work(struct bnxt *bp)
+{
+   if (BNXT_PF(bp))
+   flush_workqueue(bnxt_pf_wq);
+   else
+   cancel_work_sync(>sp_task);
+}
+
 static void bnxt_sched_reset(struct bnxt *bp, struct bnxt_rx_ring_info *rxr)
 {
if (!rxr->bnapi->in_reset) {
rxr->bnapi->in_reset = true;
set_bit(BNXT_RESET_TASK_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
}
rxr->rx_next_cons = 0x;
 }
@@ -1717,7 +1735,7 @@ static int bnxt_async_event_process(struct bnxt *bp,
default:
goto async_event_process_exit;
}
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
 async_event_process_exit:
bnxt_ulp_async_events(bp, cmpl);
return 0;
@@ -1751,7 +1769,7 @@ static int bnxt_hwrm_handler(struct bnxt *bp, struct 
tx_cmp *txcmp)
 
set_bit(vf_id - bp->pf.first_vf_id, bp->pf.vf_event_bmap);
set_bit(BNXT_HWRM_EXEC_FWD_REQ_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
break;
 
case CMPL_BASE_TYPE_HWRM_ASYNC_EVENT:
@@ -6647,7 +6665,7 @@ static void bnxt_set_rx_mode(struct net_device *dev)
vnic->rx_mask = mask;
 
set_bit(BNXT_RX_MASK_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
}
 }
 
@@ -6920,7 +6938,7 @@ static void bnxt_tx_timeout(struct net_device *dev)
 
netdev_err(bp->dev,  "TX timeout detected, starting reset task!\n");
set_bit(BNXT_RESET_TASK_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -6952,7 +6970,7 @@ static void bnxt_timer(unsigned long data)
if (bp->link_info.link_up && (bp->flags & BNXT_FLAG_PORT_STATS) &&
bp->stats_coal_ticks) {
set_bit(BNXT_PERIODIC_STATS_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
}
 bnxt_restart_timer:
mod_timer(>timer, jiffies + bp->current_interval);
@@ -7433,7 +7451,7 @@ static int bnxt_rx_flow_steer(struct net_device *dev, 
const struct sk_buff *skb,
spin_unlock_bh(>ntp_fltr_lock);
 
set_bit(BNXT_RX_NTP_FLTR_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
 
return new_fltr->sw_id;
 
@@ -7516,7 +7534,7 @@ static void bnxt_udp_tunnel_add(struct net_device *dev,
if (bp->vxlan_port_cnt == 1) {
bp->vxlan_port = ti->port;
set_bit(BNXT_VXLAN_ADD_PORT_SP_EVENT, >sp_event);
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
}
break;
case UDP_TUNNEL_TYPE_GENEVE:
@@ -7533,7 +7551,7 @@ static void bnxt_udp_tunnel_add(struct net_device *dev,
return;
}
 
-   schedule_work(>sp_task);
+   bnxt_queue_sp_work(bp);
 }
 
 static void bnxt_udp_tunnel_del(struct net_device *dev,
@@ -7572,7 +7590,7 @@ static void bnxt_udp_tunnel_del(struct net_device *dev,
  

Re: [net-next V7 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-10-13 Thread kbuild test robot
Hi Jesper,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/New-bpf-cpumap-type-for-XDP_REDIRECT/20171014-061849
config: blackfin-allyesconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=blackfin 

Note: the 
linux-review/Jesper-Dangaard-Brouer/New-bpf-cpumap-type-for-XDP_REDIRECT/20171014-061849
 HEAD 80e4de4026e29bcabecba678bc91f4a9688c builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   kernel//bpf/cpumap.c: In function '__cpu_map_entry_alloc':
>> kernel//bpf/cpumap.c:242:2: error: expected ';' before 'rcpu'
 rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
 ^~~~
   At top level:
   kernel//bpf/cpumap.c:184:12: warning: 'cpu_map_kthread_run' defined but not 
used [-Wunused-function]
static int cpu_map_kthread_run(void *data)
   ^~~

vim +242 kernel//bpf/cpumap.c

   211  
   212  struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int 
map_id)
   213  {
   214  gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN;
   215  struct bpf_cpu_map_entry *rcpu;
   216  int numa, err;
   217  
   218  /* Have map->numa_node, but choose node of redirect target CPU 
*/
   219  numa = cpu_to_node(cpu);
   220  
   221  rcpu = kzalloc_node(sizeof(*rcpu), gfp, numa);
   222  if (!rcpu)
   223  return NULL;
   224  
   225  /* Alloc percpu bulkq */
   226  rcpu->bulkq = __alloc_percpu_gfp(sizeof(*rcpu->bulkq),
   227   sizeof(void *), gfp);
   228  if (!rcpu->bulkq)
   229  goto free_rcu;
   230  
   231  /* Alloc queue */
   232  rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa);
   233  if (!rcpu->queue)
   234  goto free_bulkq;
   235  
   236  err = ptr_ring_init(rcpu->queue, qsize, gfp);
   237  if (err)
   238  goto free_queue;
   239  rcpu->qsize = qsize
   240  
   241  /* Setup kthread */
 > 242  rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, 
 > rcpu, numa,
   243 "cpumap/%d/map:%d", cpu, 
map_id);
   244  if (IS_ERR(rcpu->kthread))
   245  goto free_ptr_ring;
   246  
   247  get_cpu_map_entry(rcpu); /* 1-refcnt for being in 
cmap->cpu_map[] */
   248  get_cpu_map_entry(rcpu); /* 1-refcnt for kthread */
   249  
   250  /* Make sure kthread runs on a single CPU */
   251  kthread_bind(rcpu->kthread, cpu);
   252  wake_up_process(rcpu->kthread);
   253  
   254  return rcpu;
   255  
   256  free_ptr_ring:
   257  ptr_ring_cleanup(rcpu->queue, NULL);
   258  free_queue:
   259  kfree(rcpu->queue);
   260  free_bulkq:
   261  free_percpu(rcpu->bulkq);
   262  free_rcu:
   263  kfree(rcpu);
   264  return NULL;
   265  }
   266  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[next-queue PATCH v8 2/6] net/sched: Change behavior of mq select_queue()

2017-10-13 Thread Vinicius Costa Gomes
From: Jesus Sanchez-Palencia 

Currently, the class_ops select_queue() implementation on sch_mq
returns a pointer to netdev_queue #0 when it receives and invalid
qdisc id. That can be misleading since all of mq's inner qdiscs are
attached to a valid netdev_queue.

Here we fix that by returning NULL when a qdisc id is invalid. This is
aligned with how select_queue() is implemented for sch_mqprio in the
next patch on this series, keeping a consistent behavior between these
two qdiscs.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/sched/sch_mq.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index f3a3e507422b..213b586a06a0 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -130,15 +130,7 @@ static struct netdev_queue *mq_queue_get(struct Qdisc 
*sch, unsigned long cl)
 static struct netdev_queue *mq_select_queue(struct Qdisc *sch,
struct tcmsg *tcm)
 {
-   unsigned int ntx = TC_H_MIN(tcm->tcm_parent);
-   struct netdev_queue *dev_queue = mq_queue_get(sch, ntx);
-
-   if (!dev_queue) {
-   struct net_device *dev = qdisc_dev(sch);
-
-   return netdev_get_tx_queue(dev, 0);
-   }
-   return dev_queue;
+   return mq_queue_get(sch, TC_H_MIN(tcm->tcm_parent));
 }
 
 static int mq_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
-- 
2.14.2



[next-queue PATCH v8 0/6] TSN: Add qdisc based config interface for CBS

2017-10-13 Thread Vinicius Costa Gomes
Hi,

Changes from v8:
 - Fixed comments from Eric Dumazet and Ivan Khoronzhuk;

Changes since v6:
 - Fixed compilation for 32bit arches;
 - Aligned the behaviour of .select_queue() of the mq qdisc to be the
   same as mqprio;

Changes since v5:
 - Fixed comments from Jiri Pirko;

Changes since v4:
 - Added a software implementation of the CBS algorithm;

Changes since v3:
 - None, only a clean patchset without old patches;

Changes since v2:
 - squashed the patch introducing the userspace API into the patch
   implementing CBS;

Changes since v1:
 - Solved the mqprio dependency;
 - Fixed a mqprio bug, that caused the inner qdisc to have a wrong
   dev_queue associated with it;

Changes from the RFC:
 - Fixed comments from Henrik Austad;
 - Simplified the Qdisc, using the generic implementation of callbacks
   where possible;
 - Small refactor on the driver (igb) code;

This patchset is a proposal of how the Traffic Control subsystem can
be used to offload the configuration of the Credit Based Shaper
(defined in the IEEE 802.1Q-2014 Section 8.6.8.2) into supported
network devices.

As part of this work, we've assessed previous public discussions
related to TSN enabling: patches from Henrik Austad (Cisco), the
presentation from Eric Mann at Linux Plumbers 2012, patches from
Gangfeng Huang (National Instruments) and the current state of the
OpenAVNU project (https://github.com/AVnu/OpenAvnu/).

Overview


Time-sensitive Networking (TSN) is a set of standards that aim to
address resources availability for providing bandwidth reservation and
bounded latency on Ethernet based LANs. The proposal described here
aims to cover mainly what is needed to enable the following standards:
802.1Qat and 802.1Qav.

The initial target of this work is the Intel i210 NIC, but other
controllers' datasheet were also taken into account, like the Renesas
RZ/A1H RZ/A1M group and the Synopsis DesignWare Ethernet QoS
controller.


Proposal


Feature-wise, what is covered here is the configuration interfaces for
HW implementations of the Credit-Based shaper (CBS, 802.1Qav). CBS is
a per-queue shaper. Given that this feature is related to traffic
shaping, and that the traffic control subsystem already provides a
queueing discipline that offloads config into the device driver (i.e.
mqprio), designing a new qdisc for the specific purpose of offloading
the config for the CBS shaper seemed like a good fit.

For steering traffic into the correct queues, we use the socket option
SO_PRIORITY and then a mechanism to map priority to traffic classes /
Tx queues. The qdisc mqprio is currently used in our tests.

As for the CBS config interface, this patchset is proposing a new
qdisc called 'cbs'. Its 'tc' cmd line is:

$ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
 idleslope I

   Note that the parameters for this qdisc are the ones defined by the
   802.1Q-2014 spec, so no hardware specific functionality is exposed here.

Per-stream shaping, as defined by IEEE 802.1Q-2014 Section 34.6.1, is
not yet covered by this proposal.


Testing this RFC


Attached to this cover letter are:
 - calculate_cbs_params.py: A Python script to calculate the
   parameters to the CBS queueing discipline;
 - tsn-talker.c: A sample C implementation of the talker side of a stream;
 - tsn-listener.c: A sample C implementation of the listener side of a
   stream;

For testing the patches of this series, you may want to use the
attached samples to this cover letter and use the 'mqprio' qdisc to
setup the priorities to Tx queues mapping, together with the 'cbs'
qdisc to configure the HW shaper of the i210 controller:

1) Setup priorities to traffic classes to hardware queues mapping
$ tc qdisc replace dev ens4 handle 100: parent root mqprio num_tc 3 \
 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

For a more detailed explanation, see mqprio(8), in short, this command
will map traffic with priority 3 to the hardware queue 0, traffic with
priority 2 to hardware queue 1, and the rest will be mapped to
hardware queues 2 and 3.

2) Check scheme. You want to get the inner qdiscs ID from the bottom up
$ tc -g class show dev ens4

Ex.:
+---(100:3) mqprio
|+---(100:6) mqprio
|+---(100:7) mqprio
|
+---(100:2) mqprio
|+---(100:5) mqprio
|
+---(100:1) mqprio
 +---(100:4) mqprio

* Here '100:4' is Tx Queue #0 and '100:5' is Tx Queue #1.

3) Calculate CBS parameters for classes A and B. i.e. BW for A is 20Mbps and
   for B is 10Mbps:
$ calc_cbs_params.py -A 2 -a 1500 -B 1 -b 1500

4) Configure CBS for traffic class A (priority 3) as provided by the script:
$ tc qdisc replace dev ens4 parent 100:4 cbs locredit -1470 \
 hicredit 30 sendslope -98 idleslope 2

5) Configure CBS for traffic class B (priority 2):
$ tc qdisc replace dev ens4 parent 100:5 cbs \
 locredit -1485 hicredit 31 sendslope -99 idleslope 1

6) Run Listener:
$ ./tsn-listener -d 

[next-queue PATCH v8 5/6] net/sched: Add support for HW offloading for CBS

2017-10-13 Thread Vinicius Costa Gomes
This adds support for offloading the CBS algorithm to the controller,
if supported. Drivers wanting to support CBS offload must implement
the .ndo_setup_tc callback and handle the TC_SETUP_CBS (introduced
here) type.

Signed-off-by: Vinicius Costa Gomes 
---
 include/linux/netdevice.h |   1 +
 include/net/pkt_sched.h   |   9 
 net/sched/sch_cbs.c   | 104 --
 3 files changed, 102 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 31bb3010c69b..1f6c44ef5b21 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -775,6 +775,7 @@ enum tc_setup_type {
TC_SETUP_CLSFLOWER,
TC_SETUP_CLSMATCHALL,
TC_SETUP_CLSBPF,
+   TC_SETUP_CBS,
 };
 
 /* These structures hold the attributes of xdp state that are being passed
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 259bc191ba59..7c597b050b36 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -146,4 +146,13 @@ static inline bool is_classid_clsact_egress(u32 classid)
   TC_H_MIN(classid) == TC_H_MIN(TC_H_MIN_EGRESS);
 }
 
+struct tc_cbs_qopt_offload {
+   u8 enable;
+   s32 queue;
+   s32 hicredit;
+   s32 locredit;
+   s32 idleslope;
+   s32 sendslope;
+};
+
 #endif
diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
index c0102b589494..cae021c642e5 100644
--- a/net/sched/sch_cbs.c
+++ b/net/sched/sch_cbs.c
@@ -68,6 +68,8 @@
 #define BYTES_PER_KBIT (1000LL / 8)
 
 struct cbs_sched_data {
+   bool offload;
+   int queue;
s64 port_rate; /* in bytes/s */
s64 last; /* timestamp in ns */
s64 credits; /* in bytes */
@@ -80,6 +82,11 @@ struct cbs_sched_data {
struct sk_buff *(*dequeue)(struct Qdisc *sch);
 };
 
+static int cbs_enqueue_offload(struct sk_buff *skb, struct Qdisc *sch)
+{
+   return qdisc_enqueue_tail(skb, sch);
+}
+
 static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch)
 {
struct cbs_sched_data *q = qdisc_priv(sch);
@@ -169,6 +176,11 @@ static struct sk_buff *cbs_dequeue_soft(struct Qdisc *sch)
return skb;
 }
 
+static struct sk_buff *cbs_dequeue_offload(struct Qdisc *sch)
+{
+   return qdisc_dequeue_head(sch);
+}
+
 static struct sk_buff *cbs_dequeue(struct Qdisc *sch)
 {
struct cbs_sched_data *q = qdisc_priv(sch);
@@ -180,14 +192,66 @@ static const struct nla_policy cbs_policy[TCA_CBS_MAX + 
1] = {
[TCA_CBS_PARMS] = { .len = sizeof(struct tc_cbs_qopt) },
 };
 
+static void cbs_disable_offload(struct net_device *dev,
+   struct cbs_sched_data *q)
+{
+   struct tc_cbs_qopt_offload cbs = { };
+   const struct net_device_ops *ops;
+   int err;
+
+   if (!q->offload)
+   return;
+
+   q->enqueue = cbs_enqueue_soft;
+   q->dequeue = cbs_dequeue_soft;
+
+   ops = dev->netdev_ops;
+   if (!ops->ndo_setup_tc)
+   return;
+
+   cbs.queue = q->queue;
+   cbs.enable = 0;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_CBS, );
+   if (err < 0)
+   pr_warn("Couldn't disable CBS offload for queue %d\n",
+   cbs.queue);
+}
+
+static int cbs_enable_offload(struct net_device *dev, struct cbs_sched_data *q,
+ const struct tc_cbs_qopt *opt)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+   struct tc_cbs_qopt_offload cbs = { };
+   int err;
+
+   if (!ops->ndo_setup_tc)
+   return -EOPNOTSUPP;
+
+   cbs.queue = q->queue;
+
+   cbs.enable = 1;
+   cbs.hicredit = opt->hicredit;
+   cbs.locredit = opt->locredit;
+   cbs.idleslope = opt->idleslope;
+   cbs.sendslope = opt->sendslope;
+
+   err = ops->ndo_setup_tc(dev, TC_SETUP_CBS, );
+   if (err < 0)
+   return err;
+
+   q->enqueue = cbs_enqueue_offload;
+   q->dequeue = cbs_dequeue_offload;
+
+   return 0;
+}
+
 static int cbs_change(struct Qdisc *sch, struct nlattr *opt)
 {
struct cbs_sched_data *q = qdisc_priv(sch);
struct net_device *dev = qdisc_dev(sch);
struct nlattr *tb[TCA_CBS_MAX + 1];
-   struct ethtool_link_ksettings ecmd;
struct tc_cbs_qopt *qopt;
-   s64 link_speed;
int err;
 
err = nla_parse_nested(tb, TCA_CBS_MAX, opt, cbs_policy, NULL);
@@ -199,23 +263,30 @@ static int cbs_change(struct Qdisc *sch, struct nlattr 
*opt)
 
qopt = nla_data(tb[TCA_CBS_PARMS]);
 
-   if (qopt->offload)
-   return -EOPNOTSUPP;
+   if (!qopt->offload) {
+   struct ethtool_link_ksettings ecmd;
+   s64 link_speed;
 
-   if (!__ethtool_get_link_ksettings(dev, ))
-   link_speed = ecmd.base.speed;
-   else
-   link_speed = SPEED_1000;
+   if (!__ethtool_get_link_ksettings(dev, ))
+   

Re: [PATCH net-next] ipv6: only update __use and lastusetime once per jiffy at most

2017-10-13 Thread Eric Dumazet
On Fri, Oct 13, 2017 at 5:09 PM, Martin KaFai Lau  wrote:
> On Fri, Oct 13, 2017 at 10:08:07PM +, Wei Wang wrote:
>> From: Wei Wang 
>>
>> In order to not dirty the cacheline too often, we try to only update
>> dst->__use and dst->lastusetime at most once per jiffy.
>
>
>> As dst->lastusetime is only used by ipv6 garbage collector, it should
>> be good enough time resolution.
> Make sense.
>
>> And __use is only used in ipv6_route_seq_show() to show how many times a
>> dst has been used. And as __use is not atomic_t right now, it does not
>> show the precise number of usage times anyway. So we think it should be
>> OK to only update it at most once per jiffy.
> If __use is only bumped HZ number of times per second and we can do ~3Mpps 
> now,
> would __use be way off?

It is not used in the kernel, and is not even reported by user space
(iproute2) currently.

With the percpu stuff, we never did the sum anyway.

I believe we should be fine by being very lazy on this field.

If really someones complain, we will see, but insuring ~one update per
HZ seems fine.


[next-queue PATCH v8 4/6] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-10-13 Thread Vinicius Costa Gomes
This queueing discipline implements the shaper algorithm defined by
the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.

It's primary usage is to apply some bandwidth reservation to user
defined traffic classes, which are mapped to different queues via the
mqprio qdisc.

Only a simple software implementation is added for now.

Signed-off-by: Vinicius Costa Gomes 
Signed-off-by: Jesus Sanchez-Palencia 
---
 include/uapi/linux/pkt_sched.h |  18 +++
 net/sched/Kconfig  |  11 ++
 net/sched/Makefile |   1 +
 net/sched/sch_cbs.c| 293 +
 4 files changed, 323 insertions(+)
 create mode 100644 net/sched/sch_cbs.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 099bf5528fed..41e349df4bf4 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -871,4 +871,22 @@ struct tc_pie_xstats {
__u32 maxq; /* maximum queue size */
__u32 ecn_mark; /* packets marked with ecn*/
 };
+
+/* CBS */
+struct tc_cbs_qopt {
+   __u8 offload;
+   __s32 hicredit;
+   __s32 locredit;
+   __s32 idleslope;
+   __s32 sendslope;
+};
+
+enum {
+   TCA_CBS_UNSPEC,
+   TCA_CBS_PARMS,
+   __TCA_CBS_MAX,
+};
+
+#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e70ed26485a2..c03d86a7775e 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -172,6 +172,17 @@ config NET_SCH_TBF
  To compile this code as a module, choose M here: the
  module will be called sch_tbf.
 
+config NET_SCH_CBS
+   tristate "Credit Based Shaper (CBS)"
+   ---help---
+ Say Y here if you want to use the Credit Based Shaper (CBS) packet
+ scheduling algorithm.
+
+ See the top of  for more details.
+
+ To compile this code as a module, choose M here: the
+ module will be called sch_cbs.
+
 config NET_SCH_GRED
tristate "Generic Random Early Detection (GRED)"
---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 7b915d226de7..80c8f92d162d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL)+= sch_fq_codel.o
 obj-$(CONFIG_NET_SCH_FQ)   += sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)  += sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)  += sch_pie.o
+obj-$(CONFIG_NET_SCH_CBS)  += sch_cbs.o
 
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
new file mode 100644
index ..c0102b589494
--- /dev/null
+++ b/net/sched/sch_cbs.c
@@ -0,0 +1,293 @@
+/*
+ * net/sched/sch_cbs.c Credit Based Shaper
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Vinicius Costa Gomes 
+ *
+ */
+
+/* Credit Based Shaper (CBS)
+ * =
+ *
+ * This is a simple rate-limiting shaper aimed at TSN applications on
+ * systems with known traffic workloads.
+ *
+ * Its algorithm is defined by the IEEE 802.1Q-2014 Specification,
+ * Section 8.6.8.2, and explained in more detail in the Annex L of the
+ * same specification.
+ *
+ * There are four tunables to be considered:
+ *
+ * 'idleslope': Idleslope is the rate of credits that is
+ * accumulated (in kilobits per second) when there is at least
+ * one packet waiting for transmission. Packets are transmitted
+ * when the current value of credits is equal or greater than
+ * zero. When there is no packet to be transmitted the amount of
+ * credits is set to zero. This is the main tunable of the CBS
+ * algorithm.
+ *
+ * 'sendslope':
+ * Sendslope is the rate of credits that is depleted (it should be a
+ * negative number of kilobits per second) when a transmission is
+ * ocurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section
+ * 8.6.8.2 item g):
+ *
+ * sendslope = idleslope - port_transmit_rate
+ *
+ * 'hicredit': Hicredit defines the maximum amount of credits (in
+ * bytes) that can be accumulated. Hicredit depends on the
+ * characteristics of interfering traffic,
+ * 'max_interference_size' is the maximum size of any burst of
+ * traffic that can delay the transmission of a frame that is
+ * available for transmission for this traffic class, (IEEE
+ * 802.1Q-2014 Annex L, Equation L-3):
+ *
+ * hicredit = max_interference_size * (idleslope / port_transmit_rate)
+ *
+ * 'locredit': Locredit is the minimum amount of credits that can
+ * be reached. It is a 

[next-queue PATCH v8 6/6] igb: Add support for CBS offload

2017-10-13 Thread Vinicius Costa Gomes
From: Andre Guedes 

This patch adds support for Credit-Based Shaper (CBS) qdisc offload
from Traffic Control system. This support enable us to leverage the
Forwarding and Queuing for Time-Sensitive Streams (FQTSS) features
from Intel i210 Ethernet Controller. FQTSS is the former 802.1Qav
standard which was merged into 802.1Q in 2014. It enables traffic
prioritization and bandwidth reservation via the Credit-Based Shaper
which is implemented in hardware by i210 controller.

The patch introduces the igb_setup_tc() function which implements the
support for CBS qdisc hardware offload in the IGB driver. CBS offload
is the only traffic control offload supported by the driver at the
moment.

FQTSS transmission mode from i210 controller is automatically enabled
by the IGB driver when the CBS is enabled for the first hardware
queue. Likewise, FQTSS mode is automatically disabled when CBS is
disabled for the last hardware queue. Changing FQTSS mode requires NIC
reset.

FQTSS feature is supported by i210 controller only.

Signed-off-by: Andre Guedes 
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  23 ++
 drivers/net/ethernet/intel/igb/e1000_regs.h|   8 +
 drivers/net/ethernet/intel/igb/igb.h   |   6 +
 drivers/net/ethernet/intel/igb/igb_main.c  | 347 +
 4 files changed, 384 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h 
b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 1de82f247312..83cabff1e0ab 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -353,7 +353,18 @@
 #define E1000_RXPBS_CFG_TS_EN   0x8000
 
 #define I210_RXPBSIZE_DEFAULT  0x00A2 /* RXPBSIZE default */
+#define I210_RXPBSIZE_MASK 0x003F
+#define I210_RXPBSIZE_PB_32KB  0x0020
 #define I210_TXPBSIZE_DEFAULT  0x0414 /* TXPBSIZE default */
+#define I210_TXPBSIZE_MASK 0xC0FF
+#define I210_TXPBSIZE_PB0_8KB  (8 << 0)
+#define I210_TXPBSIZE_PB1_8KB  (8 << 6)
+#define I210_TXPBSIZE_PB2_4KB  (4 << 12)
+#define I210_TXPBSIZE_PB3_4KB  (4 << 18)
+
+#define I210_DTXMXPKTSZ_DEFAULT0x0098
+
+#define I210_SR_QUEUES_NUM 2
 
 /* SerDes Control */
 #define E1000_SCTL_DISABLE_SERDES_LOOPBACK 0x0400
@@ -1051,4 +1062,16 @@
 #define E1000_VLAPQF_P_VALID(_n)   (0x1 << (3 + (_n) * 4))
 #define E1000_VLAPQF_QUEUE_MASK0x03
 
+/* TX Qav Control fields */
+#define E1000_TQAVCTRL_XMIT_MODE   BIT(0)
+#define E1000_TQAVCTRL_DATAFETCHARBBIT(4)
+#define E1000_TQAVCTRL_DATATRANARB BIT(8)
+
+/* TX Qav Credit Control fields */
+#define E1000_TQAVCC_IDLESLOPE_MASK0x
+#define E1000_TQAVCC_QUEUEMODE BIT(31)
+
+/* Transmit Descriptor Control fields */
+#define E1000_TXDCTL_PRIORITY  BIT(27)
+
 #endif
diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h 
b/drivers/net/ethernet/intel/igb/e1000_regs.h
index 58adbf234e07..8eee081d395f 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -421,6 +421,14 @@ do { \
 
 #define E1000_I210_FLA 0x1201C
 
+#define E1000_I210_DTXMXPKTSZ  0x355C
+
+#define E1000_I210_TXDCTL(_n)  (0x0E028 + ((_n) * 0x40))
+
+#define E1000_I210_TQAVCTRL0x3570
+#define E1000_I210_TQAVCC(_n)  (0x3004 + ((_n) * 0x40))
+#define E1000_I210_TQAVHC(_n)  (0x300C + ((_n) * 0x40))
+
 #define E1000_INVM_DATA_REG(_n)(0x12120 + 4*(_n))
 #define E1000_INVM_SIZE64 /* Number of INVM Data Registers */
 
diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 06ffb2bc713e..92845692087a 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,11 @@ struct igb_ring {
u16 count;  /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+   bool cbs_enable;/* indicates if CBS is enabled */
+   s32 idleslope;  /* idleSlope in kbps */
+   s32 sendslope;  /* sendSlope in kbps */
+   s32 hicredit;   /* hiCredit in bytes */
+   s32 locredit;   /* loCredit in bytes */
 
/* everything past this point are written often */
u16 next_to_clean;
@@ -621,6 +626,7 @@ struct igb_adapter {
 #define IGB_FLAG_EEE   BIT(14)
 #define IGB_FLAG_VLAN_PROMISC  BIT(15)
 #define IGB_FLAG_RX_LEGACY BIT(16)
+#define IGB_FLAG_FQTSS BIT(17)
 
 /* Media Auto Sense */
 #define IGB_MAS_ENABLE_0   0X0001
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 837d9b46a390..be2cf263efa9 100644
--- 

[next-queue PATCH v8 1/6] net/sched: Check for null dev_queue on create flow

2017-10-13 Thread Vinicius Costa Gomes
From: Jesus Sanchez-Palencia 

In qdisc_alloc() the dev_queue pointer was used without any checks
being performed. If qdisc_create() gets a null dev_queue pointer, it
just passes it along to qdisc_alloc(), leading to a crash. That
happens if a root qdisc implements select_queue() and returns a null
dev_queue pointer for an "invalid handle", for example, or if the
dev_queue associated with the parent qdisc is null.

This patch is in preparation for the next in this series, where
select_queue() is being added to mqprio and as it may return a null
dev_queue.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/sched/sch_generic.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a0a198768aad..de2408f1ccd3 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -603,8 +603,14 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
struct Qdisc *sch;
unsigned int size = QDISC_ALIGN(sizeof(*sch)) + ops->priv_size;
int err = -ENOBUFS;
-   struct net_device *dev = dev_queue->dev;
+   struct net_device *dev;
+
+   if (!dev_queue) {
+   err = -EINVAL;
+   goto errout;
+   }
 
+   dev = dev_queue->dev;
p = kzalloc_node(size, GFP_KERNEL,
 netdev_queue_numa_node_read(dev_queue));
 
-- 
2.14.2



[next-queue PATCH v8 3/6] net/sched: Add select_queue() class_ops for mqprio

2017-10-13 Thread Vinicius Costa Gomes
From: Jesus Sanchez-Palencia 

When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch
the netdev_queue pointer that the current child qdisc is associated
with before creating the new qdisc.

Currently, when using mqprio as root qdisc, the kernel will end up
getting the queue #0 pointer from the mqprio (root qdisc), which leaves
any new child qdisc with a possibly wrong netdev_queue pointer.

Implementing the Qdisc_class_ops select_queue() on mqprio fixes this
issue and avoid an inconsistent state when child qdiscs are replaced.

Signed-off-by: Jesus Sanchez-Palencia 
---
 net/sched/sch_mqprio.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 6bcdfe6e7b63..8c042ae323e3 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -396,6 +396,12 @@ static void mqprio_walk(struct Qdisc *sch, struct 
qdisc_walker *arg)
}
 }
 
+static struct netdev_queue *mqprio_select_queue(struct Qdisc *sch,
+   struct tcmsg *tcm)
+{
+   return mqprio_queue_get(sch, TC_H_MIN(tcm->tcm_parent));
+}
+
 static const struct Qdisc_class_ops mqprio_class_ops = {
.graft  = mqprio_graft,
.leaf   = mqprio_leaf,
@@ -403,6 +409,7 @@ static const struct Qdisc_class_ops mqprio_class_ops = {
.walk   = mqprio_walk,
.dump   = mqprio_dump_class,
.dump_stats = mqprio_dump_class_stats,
+   .select_queue   = mqprio_select_queue,
 };
 
 static struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
-- 
2.14.2



Re: [net-next RFC 4/4] openvswitch: Add meter action support

2017-10-13 Thread Pravin Shelar
On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou  wrote:
> Implements OVS kernel meter action support.
>
> Signed-off-by: Andy Zhou 
> ---
>  include/uapi/linux/openvswitch.h |  1 +
>  net/openvswitch/actions.c| 12 
>  net/openvswitch/datapath.h   |  1 +
>  net/openvswitch/flow_netlink.c   |  6 ++
>  4 files changed, 20 insertions(+)
>
> diff --git a/include/uapi/linux/openvswitch.h 
> b/include/uapi/linux/openvswitch.h
> index 325049a129e4..11fe1a06cdd6 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -835,6 +835,7 @@ enum ovs_action_attr {
> OVS_ACTION_ATTR_TRUNC,/* u32 struct ovs_action_trunc. */
> OVS_ACTION_ATTR_PUSH_ETH, /* struct ovs_action_push_eth. */
> OVS_ACTION_ATTR_POP_ETH,  /* No argument. */
> +   OVS_ACTION_ATTR_METER,/* u32 meter ID. */
>
> __OVS_ACTION_ATTR_MAX,/* Nothing past this will be accepted
>* from userspace. */
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556fcdb5..4eb160ac5a27 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -1210,6 +1210,12 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> case OVS_ACTION_ATTR_POP_ETH:
> err = pop_eth(skb, key);
> break;
> +
> +   case OVS_ACTION_ATTR_METER:
> +   if (ovs_meter_execute(dp, skb, key, nla_get_u32(a))) {
> +   consume_skb(skb);
> +   return 0;
> +   }
> }
>
> if (unlikely(err)) {
> @@ -1341,6 +1347,12 @@ int ovs_execute_actions(struct datapath *dp, struct 
> sk_buff *skb,
> err = do_execute_actions(dp, skb, key,
>  acts->actions, acts->actions_len);
>
> +   /* OVS action has dropped the packet, do not expose it
> +* to the user.
> +*/
> +   if (err == -ENODATA)
> +   err = 0;
> +
I am not sure who is returning this error code?


Re: [net-next RFC 3/4] openvswitch: Add meter infrastructure

2017-10-13 Thread Pravin Shelar
On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou  wrote:
> OVS kernel datapath so far does not support Openflow meter action.
> This is the first stab at adding kernel datapath meter support.
> This implementation supports only drop band type.
>
> Signed-off-by: Andy Zhou 
> ---
>  net/openvswitch/Makefile   |   1 +
>  net/openvswitch/datapath.c |  14 +-
>  net/openvswitch/datapath.h |   3 +
>  net/openvswitch/meter.c| 611 
> +
>  net/openvswitch/meter.h|  54 
>  5 files changed, 681 insertions(+), 2 deletions(-)
>  create mode 100644 net/openvswitch/meter.c
>  create mode 100644 net/openvswitch/meter.h
>
...

> diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
> new file mode 100644
> index ..f24ebb5f7af4
> --- /dev/null
> +++ b/net/openvswitch/meter.c



> +static int ovs_meter_cmd_features(struct sk_buff *skb, struct genl_info 
> *info)
> +{
> +   struct datapath *dp;
> +   struct ovs_header *ovs_header = info->userhdr;
> +   struct sk_buff *reply;
> +   struct ovs_header *ovs_reply_header;
> +   struct nlattr *nla, *band_nla;
> +   int err;
> +
> +   /* Check that the datapath exists */
> +   ovs_lock();
> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> +   ovs_unlock();
> +   if (!dp)
> +   return -ENODEV;
> +
why dp check is required for this API?

> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_FEATURES,
> + _reply_header);
> +   if (!reply)
> +   return PTR_ERR(reply);
> +
> +   if (nla_put_u32(reply, OVS_METER_ATTR_MAX_METERS, U32_MAX) ||
> +   nla_put_u32(reply, OVS_METER_ATTR_MAX_BANDS, DP_MAX_BANDS))
> +   goto nla_put_failure;
> +
> +   nla = nla_nest_start(reply, OVS_METER_ATTR_BANDS);
> +   if (!nla)
> +   goto nla_put_failure;
> +
> +   band_nla = nla_nest_start(reply, OVS_BAND_ATTR_UNSPEC);
> +   if (!band_nla)
> +   goto nla_put_failure;
> +   /* Currently only DROP band type is supported. */
> +   if (nla_put_u32(reply, OVS_BAND_ATTR_TYPE, OVS_METER_BAND_TYPE_DROP))
> +   goto nla_put_failure;
> +   nla_nest_end(reply, band_nla);
> +   nla_nest_end(reply, nla);
> +
> +   genlmsg_end(reply, ovs_reply_header);
> +   return genlmsg_reply(reply, info);
> +
> +nla_put_failure:
> +   nlmsg_free(reply);
> +   err = -EMSGSIZE;
> +   return err;
> +}
> +


> +static int ovs_meter_cmd_set(struct sk_buff *skb, struct genl_info *info)
> +{
> +   struct nlattr **a = info->attrs;
> +   struct dp_meter *meter, *old_meter;
> +   struct sk_buff *reply;
> +   struct ovs_header *ovs_reply_header;
> +   struct ovs_header *ovs_header = info->userhdr;
> +   struct datapath *dp;
> +   int err;
> +   u32 meter_id;
> +   bool failed;
> +
> +   meter = dp_meter_create(a);
> +   if (IS_ERR_OR_NULL(meter))
> +   return PTR_ERR(meter);
> +
> +   reply = ovs_meter_cmd_reply_start(info, OVS_METER_CMD_SET,
> + _reply_header);
> +   if (IS_ERR(reply)) {
> +   err = PTR_ERR(reply);
> +   goto exit_free_meter;
> +   }
> +
> +   ovs_lock();
> +   dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
> +   if (!dp) {
> +   err = -ENODEV;
> +   goto exit_unlock;
> +   }
> +
> +   if (!a[OVS_METER_ATTR_ID]) {
> +   err = -ENODEV;
> +   goto exit_unlock;
> +   }
> +
> +   meter_id = nla_get_u32(a[OVS_METER_ATTR_ID]);
> +
> +   /* Cannot fail after this. */
> +   old_meter = lookup_meter(dp, meter_id);
> +   attach_meter(dp, meter);
> +   ovs_unlock();
> +
After the unlock, it is not safe to keep the ref to old_meter. better
to release lock at the end. we could optimize it later if required.

> +   /* Build response with the meter_id and stats from
> +* the old meter, if any.
> +*/
> +   failed = nla_put_u32(reply, OVS_METER_ATTR_ID, meter_id);
> +   WARN_ON(failed);
> +   if (old_meter) {
> +   spin_lock_bh(_meter->lock);
> +   if (old_meter->keep_stats) {
> +   err = ovs_meter_cmd_reply_stats(reply, meter_id,
> +   old_meter);
> +   WARN_ON(err);
> +   }
> +   spin_unlock_bh(_meter->lock);
> +   ovs_meter_free(old_meter);
> +   }
> +
> +   genlmsg_end(reply, ovs_reply_header);
> +   return genlmsg_reply(reply, info);
> +
> +exit_unlock:
> +   ovs_unlock();
> +   nlmsg_free(reply);
> +exit_free_meter:
> +   kfree(meter);
> +   return err;
> +}
> +


> +bool ovs_meter_execute(struct datapath *dp, struct sk_buff *skb,
> +  

Re: [net-next RFC 0/4] Openvswitch meter action

2017-10-13 Thread Pravin Shelar
On Thu, Oct 12, 2017 at 3:38 PM, Andy Zhou  wrote:
> This patch series is the first attempt to add openvswitch
> meter support. We have previously experimented with adding
> metering support in nftables. However 1) It was not clear
> how to expose a named nftables object cleanly, and 2)
> the logic that implements metering is quite small, < 100 lines
> of code.
>
> With those two observations, it seems cleaner to add meter
> support in the openvswitch module directly.
>
>
Thanks for working on this feature. It looks good to me. I have couple
of comments inlined.

> Andy Zhou (4):
>   openvswitch: Add meter netlink definitions
>   openvswitch: export get_dp() API.
>   openvswitch: Add meter infrastructure
>   openvswitch: Add meter action support
>
>  include/uapi/linux/openvswitch.h |  52 
>  net/openvswitch/Makefile |   1 +
>  net/openvswitch/actions.c|  12 +
>  net/openvswitch/datapath.c   |  43 +--
>  net/openvswitch/datapath.h   |  35 +++
>  net/openvswitch/flow_netlink.c   |   6 +
>  net/openvswitch/meter.c  | 611 
> +++
>  net/openvswitch/meter.h  |  54 
>  8 files changed, 783 insertions(+), 31 deletions(-)
>  create mode 100644 net/openvswitch/meter.c
>  create mode 100644 net/openvswitch/meter.h
>
> --
> 1.8.3.1
>


Re: [PATCH net-next] ipv6: only update __use and lastusetime once per jiffy at most

2017-10-13 Thread Martin KaFai Lau
On Fri, Oct 13, 2017 at 10:08:07PM +, Wei Wang wrote:
> From: Wei Wang 
> 
> In order to not dirty the cacheline too often, we try to only update
> dst->__use and dst->lastusetime at most once per jiffy.


> As dst->lastusetime is only used by ipv6 garbage collector, it should
> be good enough time resolution.
Make sense.

> And __use is only used in ipv6_route_seq_show() to show how many times a
> dst has been used. And as __use is not atomic_t right now, it does not
> show the precise number of usage times anyway. So we think it should be
> OK to only update it at most once per jiffy.
If __use is only bumped HZ number of times per second and we can do ~3Mpps now,
would __use be way off?


Re: [Patch net-next v3] tcp: add a tracepoint for tcp retransmission

2017-10-13 Thread Brendan Gregg
On Fri, Oct 13, 2017 at 3:09 PM, Alexei Starovoitov
 wrote:
> On Fri, Oct 13, 2017 at 01:50:44PM -0700, Brendan Gregg wrote:
>> On Fri, Oct 13, 2017 at 1:03 PM, Cong Wang  wrote:
>> > We need a real-time notification for tcp retransmission
>> > for monitoring.
>> >
>> > Of course we could use ftrace to dynamically instrument this
>> > kernel function too, however we can't retrieve the connection
>> > information at the same time, for example perf-tools [1] reads
>> > /proc/net/tcp for socket details, which is slow when we have
>> > a lots of connections.
>> >
>> > Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
>> > and exposes src/dst IP addresses and ports of the connection.
>> > This also makes it easier to integrate into perf.
>> >
>> > Note, I expose both IPv4 and IPv6 addresses at the same time:
>> > for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
>> > for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
>> > Also, add sk and skb pointers as they are useful for BPF.
>>
>> Thanks, a TCP retransmit tracepoint would be great. (tcp_set_state
>> would be highly useful too, which Alexei already has in his list).
>>
>> Should skp->__sk_common.skc_state be included in the format string, so
>> we don't have to always dig it out of the skaddr? For retransmits I
>> always want to know the TCP state, to determine if it is ESTABLISHED
>> (packet drop) or SYN_SENT (backlog full) or something else.
>
> let's not expose internal socket fields into tp fields.
> Few people still believe that tp fields are abi, so to be safe
> no such fields should be exposed.
> It's trivial enough to read sk_state from bpf program
> with bpf_probe_read().

Ah, right, the number mapping for TCP_ESTABLISHED and friends is a
Linux implementation detail, and not from the RFCs. Ok, I can dig it
from the skp instead.

>
>> We probably need a tracepoint for tcp_send_loss_probe() (TLP) as well,
>> for tracing at the same time as retransmits (like my tools do), but
>> that can be added later.
>
> hmm. why?
> This single tracepoint will cover both cases of retransmits.

I don't think tcp_send_loss_probe() TLP goes through
__tcp_retransmit_skb(): look at the path to bumping
LINUX_MIB_TCPLOSSPROBES. I was thinking that later on we might want to
add a tcp:tcp_send_tlp tracepoint, in addition to this
tcp:tcp_retransmit_skb tracepoint, for investigating the same kind of
issues: packet loss. This existing tcp:tcp_retransmit_skb tracepoint
patch is ok.

Acked-by: Brendan Gregg 

(with or without %pI6c)

Brendan


Re: [Patch net-next v3] tcp: add a tracepoint for tcp retransmission

2017-10-13 Thread David Ahern
On 10/13/17 2:03 PM, Cong Wang wrote:
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> new file mode 100644
> index ..3d1cbd072b7e
> --- /dev/null
> +++ b/include/trace/events/tcp.h
> @@ -0,0 +1,68 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM tcp
> +
> +#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_TCP_H
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +TRACE_EVENT(tcp_retransmit_skb,
> +
> + TP_PROTO(struct sock *sk, struct sk_buff *skb),
> +
> + TP_ARGS(sk, skb),
> +
> + TP_STRUCT__entry(
> + __field(void *, skbaddr)
> + __field(void *, skaddr)
> + __field(__u16, sport)
> + __field(__u16, dport)
> + __array(__u8, saddr, 4)
> + __array(__u8, daddr, 4)
> + __array(__u8, saddr_v6, 16)
> + __array(__u8, daddr_v6, 16)
> + ),
> +
> + TP_fast_assign(
> + struct ipv6_pinfo *np = inet6_sk(sk);
> + struct inet_sock *inet = inet_sk(sk);
> + struct in6_addr *pin6;
> + __be32 *p32;
> +
> + __entry->skbaddr = skb;
> + __entry->skaddr = sk;
> +
> + __entry->sport = ntohs(inet->inet_sport);
> + __entry->dport = ntohs(inet->inet_dport);
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet->inet_saddr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 =  inet->inet_daddr;
> +
> + if (np) {
> + pin6 = (struct in6_addr *)__entry->saddr_v6;
> + *pin6 = np->saddr;
> + pin6 = (struct in6_addr *)__entry->daddr_v6;
> + *pin6 = *(np->daddr_cache);
> + } else {
> + pin6 = (struct in6_addr *)__entry->saddr_v6;
> + ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
> + pin6 = (struct in6_addr *)__entry->daddr_v6;
> + ipv6_addr_set_v4mapped(inet->inet_daddr, pin6);
> + }
> + ),
> +
> + TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6 
> daddrv6=%pI6",

%pI6c is more user friendly for IPv6 addresses.


[PATCH net-next 5/5] mlxsw: spectrum_router: Add extack message for RIF and VRF overflow

2017-10-13 Thread David Ahern
Add extack argument down to mlxsw_sp_rif_create and mlxsw_sp_vr_create
to set an error message on RIF or VR overflow. Now on overflow of
either resource the user gets an informative message as opposed to
failing with EBUSY.

Signed-off-by: David Ahern 
Reviewed-by: Ido Schimmel 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 114 +
 1 file changed, 69 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 2a7f066dfab5..9e0b46513ca7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -731,14 +731,17 @@ static struct mlxsw_sp_fib *mlxsw_sp_vr_fib(const struct 
mlxsw_sp_vr *vr,
 }
 
 static struct mlxsw_sp_vr *mlxsw_sp_vr_create(struct mlxsw_sp *mlxsw_sp,
- u32 tb_id)
+ u32 tb_id,
+ struct netlink_ext_ack *extack)
 {
struct mlxsw_sp_vr *vr;
int err;
 
vr = mlxsw_sp_vr_find_unused(mlxsw_sp);
-   if (!vr)
+   if (!vr) {
+   NL_SET_ERR_MSG(extack, "spectrum: Exceeded number of supported 
virtual routers");
return ERR_PTR(-EBUSY);
+   }
vr->fib4 = mlxsw_sp_fib_create(vr, MLXSW_SP_L3_PROTO_IPV4);
if (IS_ERR(vr->fib4))
return ERR_CAST(vr->fib4);
@@ -775,14 +778,15 @@ static void mlxsw_sp_vr_destroy(struct mlxsw_sp_vr *vr)
vr->fib4 = NULL;
 }
 
-static struct mlxsw_sp_vr *mlxsw_sp_vr_get(struct mlxsw_sp *mlxsw_sp, u32 
tb_id)
+static struct mlxsw_sp_vr *mlxsw_sp_vr_get(struct mlxsw_sp *mlxsw_sp, u32 
tb_id,
+  struct netlink_ext_ack *extack)
 {
struct mlxsw_sp_vr *vr;
 
tb_id = mlxsw_sp_fix_tb_id(tb_id);
vr = mlxsw_sp_vr_find(mlxsw_sp, tb_id);
if (!vr)
-   vr = mlxsw_sp_vr_create(mlxsw_sp, tb_id);
+   vr = mlxsw_sp_vr_create(mlxsw_sp, tb_id, extack);
return vr;
 }
 
@@ -948,7 +952,8 @@ static u32 mlxsw_sp_ipip_dev_ul_tb_id(const struct 
net_device *ol_dev)
 
 static struct mlxsw_sp_rif *
 mlxsw_sp_rif_create(struct mlxsw_sp *mlxsw_sp,
-   const struct mlxsw_sp_rif_params *params);
+   const struct mlxsw_sp_rif_params *params,
+   struct netlink_ext_ack *extack);
 
 static struct mlxsw_sp_rif_ipip_lb *
 mlxsw_sp_ipip_ol_ipip_lb_create(struct mlxsw_sp *mlxsw_sp,
@@ -966,7 +971,7 @@ mlxsw_sp_ipip_ol_ipip_lb_create(struct mlxsw_sp *mlxsw_sp,
.lb_config = ipip_ops->ol_loopback_config(mlxsw_sp, ol_dev),
};
 
-   rif = mlxsw_sp_rif_create(mlxsw_sp, _params.common);
+   rif = mlxsw_sp_rif_create(mlxsw_sp, _params.common, NULL);
if (IS_ERR(rif))
return ERR_CAST(rif);
return container_of(rif, struct mlxsw_sp_rif_ipip_lb, common);
@@ -3711,7 +3716,7 @@ mlxsw_sp_fib_node_get(struct mlxsw_sp *mlxsw_sp, u32 
tb_id, const void *addr,
struct mlxsw_sp_vr *vr;
int err;
 
-   vr = mlxsw_sp_vr_get(mlxsw_sp, tb_id);
+   vr = mlxsw_sp_vr_get(mlxsw_sp, tb_id, NULL);
if (IS_ERR(vr))
return ERR_CAST(vr);
fib = mlxsw_sp_vr_fib(vr, proto);
@@ -4750,7 +4755,7 @@ static int mlxsw_sp_router_fibmr_add(struct mlxsw_sp 
*mlxsw_sp,
if (mlxsw_sp->router->aborted)
return 0;
 
-   vr = mlxsw_sp_vr_get(mlxsw_sp, men_info->tb_id);
+   vr = mlxsw_sp_vr_get(mlxsw_sp, men_info->tb_id, NULL);
if (IS_ERR(vr))
return PTR_ERR(vr);
 
@@ -4783,7 +4788,7 @@ mlxsw_sp_router_fibmr_vif_add(struct mlxsw_sp *mlxsw_sp,
if (mlxsw_sp->router->aborted)
return 0;
 
-   vr = mlxsw_sp_vr_get(mlxsw_sp, ven_info->tb_id);
+   vr = mlxsw_sp_vr_get(mlxsw_sp, ven_info->tb_id, NULL);
if (IS_ERR(vr))
return PTR_ERR(vr);
 
@@ -5346,7 +5351,8 @@ const struct net_device *mlxsw_sp_rif_dev(const struct 
mlxsw_sp_rif *rif)
 
 static struct mlxsw_sp_rif *
 mlxsw_sp_rif_create(struct mlxsw_sp *mlxsw_sp,
-   const struct mlxsw_sp_rif_params *params)
+   const struct mlxsw_sp_rif_params *params,
+   struct netlink_ext_ack *extack)
 {
u32 tb_id = l3mdev_fib_table(params->dev);
const struct mlxsw_sp_rif_ops *ops;
@@ -5360,14 +5366,16 @@ mlxsw_sp_rif_create(struct mlxsw_sp *mlxsw_sp,
type = mlxsw_sp_dev_rif_type(mlxsw_sp, params->dev);
ops = mlxsw_sp->router->rif_ops_arr[type];
 
-   vr = mlxsw_sp_vr_get(mlxsw_sp, tb_id ? : RT_TABLE_MAIN);
+   vr = mlxsw_sp_vr_get(mlxsw_sp, tb_id ? : RT_TABLE_MAIN, extack);
if (IS_ERR(vr))
return ERR_CAST(vr);
vr->rif_count++;
 
err = mlxsw_sp_rif_index_alloc(mlxsw_sp, _index);

[PATCH net-next 2/5] net: ipv6: Make inet6addr_validator a blocking notifier

2017-10-13 Thread David Ahern
inet6addr_validator chain was added by commit 3ad7d2468f79f ("Ipvlan
should return an error when an address is already in use") to allow
address validation before changes are committed and to be able to
fail the address change with an error back to the user. The address
validation is not done for addresses received from router
advertisements.

Handling RAs in softirq context is the only reason for the notifier
chain to be atomic versus blocking. Since the only current user, ipvlan,
of the validator chain ignores softirq context, the notifier can be made
blocking and simply not invoked for softirq path.

The blocking option is needed by spectrum for example to validate
resources for an adding an address to an interface.

Signed-off-by: David Ahern 
---
 drivers/net/ipvlan/ipvlan_main.c |  4 
 net/ipv6/addrconf.c  | 21 ++---
 net/ipv6/addrconf_core.c |  9 +
 3 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 3cf67db513e2..6842739b6679 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -808,10 +808,6 @@ static int ipvlan_addr6_event(struct notifier_block 
*unused,
struct net_device *dev = (struct net_device *)if6->idev->dev;
struct ipvl_dev *ipvlan = netdev_priv(dev);
 
-   /* FIXME IPv6 autoconf calls us from bh without RTNL */
-   if (in_softirq())
-   return NOTIFY_DONE;
-
if (!netif_is_ipvlan(dev))
return NOTIFY_DONE;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 80f5fc74f0c4..31ff12277bcf 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -993,7 +993,6 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr 
*addr,
struct net *net = dev_net(idev->dev);
struct inet6_ifaddr *ifa = NULL;
struct rt6_info *rt = NULL;
-   struct in6_validator_info i6vi;
int err = 0;
int addr_type = ipv6_addr_type(addr);
 
@@ -1013,12 +1012,20 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
goto out;
}
 
-   i6vi.i6vi_addr = *addr;
-   i6vi.i6vi_dev = idev;
-   err = inet6addr_validator_notifier_call_chain(NETDEV_UP, );
-   err = notifier_to_errno(err);
-   if (err < 0)
-   goto out;
+   /* validator notifier needs to be blocking;
+* do not call in atomic context
+*/
+   if (can_block) {
+   struct in6_validator_info i6vi = {
+   .i6vi_addr = *addr,
+   .i6vi_dev = idev,
+   };
+
+   err = inet6addr_validator_notifier_call_chain(NETDEV_UP, );
+   err = notifier_to_errno(err);
+   if (err < 0)
+   goto out;
+   }
 
ifa = kzalloc(sizeof(*ifa), gfp_flags);
if (!ifa) {
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 9e3488d50b15..32b564dfd02a 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -88,7 +88,7 @@ int __ipv6_addr_type(const struct in6_addr *addr)
 EXPORT_SYMBOL(__ipv6_addr_type);
 
 static ATOMIC_NOTIFIER_HEAD(inet6addr_chain);
-static ATOMIC_NOTIFIER_HEAD(inet6addr_validator_chain);
+static BLOCKING_NOTIFIER_HEAD(inet6addr_validator_chain);
 
 int register_inet6addr_notifier(struct notifier_block *nb)
 {
@@ -110,19 +110,20 @@ EXPORT_SYMBOL(inet6addr_notifier_call_chain);
 
 int register_inet6addr_validator_notifier(struct notifier_block *nb)
 {
-   return atomic_notifier_chain_register(_validator_chain, nb);
+   return blocking_notifier_chain_register(_validator_chain, nb);
 }
 EXPORT_SYMBOL(register_inet6addr_validator_notifier);
 
 int unregister_inet6addr_validator_notifier(struct notifier_block *nb)
 {
-   return atomic_notifier_chain_unregister(_validator_chain, nb);
+   return blocking_notifier_chain_unregister(_validator_chain,
+ nb);
 }
 EXPORT_SYMBOL(unregister_inet6addr_validator_notifier);
 
 int inet6addr_validator_notifier_call_chain(unsigned long val, void *v)
 {
-   return atomic_notifier_call_chain(_validator_chain, val, v);
+   return blocking_notifier_call_chain(_validator_chain, val, v);
 }
 EXPORT_SYMBOL(inet6addr_validator_notifier_call_chain);
 
-- 
2.1.4



[PATCH net-next 4/5] mlxsw: spectrum: router: Add support for address validator notifier

2017-10-13 Thread David Ahern
Add support for inetaddr_validator and inet6addr_validator. The
notifiers provide a means for validating ipv4 and ipv6 addresses
before the addresses are installed and on failure the error
is propagated back to the user.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 15 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  4 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 52 ++
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 321988ac57cc..d51402f98f97 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -4505,9 +4505,16 @@ static struct notifier_block mlxsw_sp_netdevice_nb 
__read_mostly = {
.notifier_call = mlxsw_sp_netdevice_event,
 };
 
+static struct notifier_block mlxsw_sp_inetaddr_valid_nb __read_mostly = {
+   .notifier_call = mlxsw_sp_inetaddr_valid_event,
+};
+
 static struct notifier_block mlxsw_sp_inetaddr_nb __read_mostly = {
.notifier_call = mlxsw_sp_inetaddr_event,
-   .priority = 10, /* Must be called before FIB notifier block */
+};
+
+static struct notifier_block mlxsw_sp_inet6addr_valid_nb __read_mostly = {
+   .notifier_call = mlxsw_sp_inet6addr_valid_event,
 };
 
 static struct notifier_block mlxsw_sp_inet6addr_nb __read_mostly = {
@@ -4533,7 +4540,9 @@ static int __init mlxsw_sp_module_init(void)
int err;
 
register_netdevice_notifier(_sp_netdevice_nb);
+   register_inetaddr_validator_notifier(_sp_inetaddr_valid_nb);
register_inetaddr_notifier(_sp_inetaddr_nb);
+   register_inet6addr_validator_notifier(_sp_inet6addr_valid_nb);
register_inet6addr_notifier(_sp_inet6addr_nb);
register_netevent_notifier(_sp_router_netevent_nb);
 
@@ -4552,7 +4561,9 @@ static int __init mlxsw_sp_module_init(void)
 err_core_driver_register:
unregister_netevent_notifier(_sp_router_netevent_nb);
unregister_inet6addr_notifier(_sp_inet6addr_nb);
+   unregister_inet6addr_validator_notifier(_sp_inet6addr_valid_nb);
unregister_inetaddr_notifier(_sp_inetaddr_nb);
+   unregister_inetaddr_validator_notifier(_sp_inetaddr_valid_nb);
unregister_netdevice_notifier(_sp_netdevice_nb);
return err;
 }
@@ -4563,7 +4574,9 @@ static void __exit mlxsw_sp_module_exit(void)
mlxsw_core_driver_unregister(_sp_driver);
unregister_netevent_notifier(_sp_router_netevent_nb);
unregister_inet6addr_notifier(_sp_inet6addr_nb);
+   unregister_inet6addr_validator_notifier(_sp_inet6addr_valid_nb);
unregister_inetaddr_notifier(_sp_inetaddr_nb);
+   unregister_inetaddr_validator_notifier(_sp_inetaddr_valid_nb);
unregister_netdevice_notifier(_sp_netdevice_nb);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 8e45183dc9bb..4865a6f58c83 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -390,8 +390,12 @@ int mlxsw_sp_router_netevent_event(struct notifier_block 
*unused,
 int mlxsw_sp_netdevice_router_port_event(struct net_device *dev);
 int mlxsw_sp_inetaddr_event(struct notifier_block *unused,
unsigned long event, void *ptr);
+int mlxsw_sp_inetaddr_valid_event(struct notifier_block *unused,
+ unsigned long event, void *ptr);
 int mlxsw_sp_inet6addr_event(struct notifier_block *unused,
 unsigned long event, void *ptr);
+int mlxsw_sp_inet6addr_valid_event(struct notifier_block *unused,
+  unsigned long event, void *ptr);
 int mlxsw_sp_netdevice_vrf_event(struct net_device *l3_dev, unsigned long 
event,
 struct netdev_notifier_changeupper_info *info);
 void
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 6a356f4b99a3..2a7f066dfab5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -5656,6 +5656,32 @@ int mlxsw_sp_inetaddr_event(struct notifier_block 
*unused,
struct mlxsw_sp_rif *rif;
int err = 0;
 
+   /* NETDEV_UP event is handled by mlxsw_sp_inetaddr_valid_event */
+   if (event == NETDEV_UP)
+   goto out;
+
+   mlxsw_sp = mlxsw_sp_lower_get(dev);
+   if (!mlxsw_sp)
+   goto out;
+
+   rif = mlxsw_sp_rif_find_by_dev(mlxsw_sp, dev);
+   if (!mlxsw_sp_rif_should_config(rif, dev, event))
+   goto out;
+
+   err = __mlxsw_sp_inetaddr_event(dev, event);
+out:
+   return notifier_from_errno(err);
+}
+
+int mlxsw_sp_inetaddr_valid_event(struct notifier_block *unused,
+  

[PATCH net-next 1/5] ipv6: addrconf: cleanup locking in ipv6_add_addr

2017-10-13 Thread David Ahern
ipv6_add_addr is called in process context with rtnl lock held
(e.g., manual config of an address) or during softirq processing
(e.g., autoconf and address from a router advertisement).

Currently, ipv6_add_addr calls rcu_read_lock_bh shortly after entry
and does not call unlock until exit, minus the call around the address
validator notifier. Similarly, addrconf_hash_lock is taken after the
validator notifier and held until exit. This forces the allocation of
inet6_ifaddr to always be atomic.

Refactor ipv6_add_addr as follows:
1. add an input boolean to discriminate the call path (process context
   or softirq). This new flag controls whether the alloc can be done
   with GFP_KERNEL or GFP_ATOMIC.

2. Move the rcu_read_lock_bh and unlock calls only around functions that
   do rcu updates.

3. Remove the in6_dev_hold and put added by 3ad7d2468f79f ("Ipvlan should
   return an error when an address is already in use."). This was done
   presumably because rcu_read_unlock_bh needs to be called before calling
   the validator. Since rcu_read_lock is not needed before the validator
   runs revert the hold and put added by 3ad7d2468f79f and only do the
   hold when setting ifp->idev.

4. move duplicate address check and insertion of new address in the global
   address hash into a helper. The helper is called after an ifa is
   allocated and filled in.

This allows the ifa for manually configured addresses to be done with
GFP_KERNEL and reduces the overall amount of time with rcu_read_lock held
and hash table spinlock held.

Signed-off-by: David Ahern 
---
 net/ipv6/addrconf.c | 97 +
 1 file changed, 54 insertions(+), 43 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4603aa488f4f..80f5fc74f0c4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -957,18 +957,43 @@ static u32 inet6_addr_hash(const struct in6_addr *addr)
return hash_32(ipv6_addr_hash(addr), IN6_ADDR_HSIZE_SHIFT);
 }
 
+static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa)
+{
+   unsigned int hash;
+   int err = 0;
+
+   spin_lock(_hash_lock);
+
+   /* Ignore adding duplicate addresses on an interface */
+   if (ipv6_chk_same_addr(dev_net(dev), >addr, dev)) {
+   ADBG("ipv6_add_addr: already assigned\n");
+   err = -EEXIST;
+   goto out;
+   }
+
+   /* Add to big hash table */
+   hash = inet6_addr_hash(>addr);
+   hlist_add_head_rcu(>addr_lst, _addr_lst[hash]);
+
+out:
+   spin_unlock(_hash_lock);
+
+   return err;
+}
+
 /* On success it returns ifp with increased reference count */
 
 static struct inet6_ifaddr *
 ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
  const struct in6_addr *peer_addr, int pfxlen,
- int scope, u32 flags, u32 valid_lft, u32 prefered_lft)
+ int scope, u32 flags, u32 valid_lft, u32 prefered_lft,
+ bool can_block)
 {
+   gfp_t gfp_flags = can_block ? GFP_KERNEL : GFP_ATOMIC;
struct net *net = dev_net(idev->dev);
struct inet6_ifaddr *ifa = NULL;
-   struct rt6_info *rt;
+   struct rt6_info *rt = NULL;
struct in6_validator_info i6vi;
-   unsigned int hash;
int err = 0;
int addr_type = ipv6_addr_type(addr);
 
@@ -978,42 +1003,24 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
 addr_type & IPV6_ADDR_LOOPBACK))
return ERR_PTR(-EADDRNOTAVAIL);
 
-   rcu_read_lock_bh();
-
-   in6_dev_hold(idev);
-
if (idev->dead) {
err = -ENODEV;  /*XXX*/
-   goto out2;
+   goto out;
}
 
if (idev->cnf.disable_ipv6) {
err = -EACCES;
-   goto out2;
+   goto out;
}
 
i6vi.i6vi_addr = *addr;
i6vi.i6vi_dev = idev;
-   rcu_read_unlock_bh();
-
err = inet6addr_validator_notifier_call_chain(NETDEV_UP, );
-
-   rcu_read_lock_bh();
err = notifier_to_errno(err);
-   if (err)
-   goto out2;
-
-   spin_lock(_hash_lock);
-
-   /* Ignore adding duplicate addresses on an interface */
-   if (ipv6_chk_same_addr(dev_net(idev->dev), addr, idev->dev)) {
-   ADBG("ipv6_add_addr: already assigned\n");
-   err = -EEXIST;
+   if (err < 0)
goto out;
-   }
-
-   ifa = kzalloc(sizeof(struct inet6_ifaddr), GFP_ATOMIC);
 
+   ifa = kzalloc(sizeof(*ifa), gfp_flags);
if (!ifa) {
ADBG("ipv6_add_addr: malloc failed\n");
err = -ENOBUFS;
@@ -1053,16 +1060,21 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
ifa->rt = rt;
 
ifa->idev = idev;
+   in6_dev_hold(idev);
+
/* For caller */
refcount_set(>refcnt, 1);
 
-   /* Add to big hash 

[PATCH net-next 3/5] net: Add extack to validator_info structs used for address notifier

2017-10-13 Thread David Ahern
Add extack to in_validator_info and in6_validator_info. Update the one
user of each, ipvlan, to return an error message for failures.

Only manual configuration of an address is plumbed in the IPv6 code path.

Signed-off-by: David Ahern 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ipvlan/ipvlan_main.c | 10 --
 include/linux/inetdevice.h   |  1 +
 include/net/addrconf.h   |  1 +
 net/ipv4/devinet.c   |  8 +---
 net/ipv6/addrconf.c  | 22 --
 5 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 6842739b6679..f0ab55df57f1 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -847,8 +847,11 @@ static int ipvlan_addr6_validator_event(struct 
notifier_block *unused,
 
switch (event) {
case NETDEV_UP:
-   if (ipvlan_addr_busy(ipvlan->port, >i6vi_addr, true))
+   if (ipvlan_addr_busy(ipvlan->port, >i6vi_addr, true)) {
+   NL_SET_ERR_MSG(i6vi->extack,
+  "Address already assigned to an ipvlan 
device");
return notifier_from_errno(-EADDRINUSE);
+   }
break;
}
 
@@ -917,8 +920,11 @@ static int ipvlan_addr4_validator_event(struct 
notifier_block *unused,
 
switch (event) {
case NETDEV_UP:
-   if (ipvlan_addr_busy(ipvlan->port, >ivi_addr, false))
+   if (ipvlan_addr_busy(ipvlan->port, >ivi_addr, false)) {
+   NL_SET_ERR_MSG(ivi->extack,
+  "Address already assigned to an ipvlan 
device");
return notifier_from_errno(-EADDRINUSE);
+   }
break;
}
 
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 751d051f0bc7..681dff30940b 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -154,6 +154,7 @@ struct in_ifaddr {
 struct in_validator_info {
__be32  ivi_addr;
struct in_device*ivi_dev;
+   struct netlink_ext_ack  *extack;
 };
 
 int register_inetaddr_notifier(struct notifier_block *nb);
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 87981cd63180..b8b16437c6d5 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -55,6 +55,7 @@ struct prefix_info {
 struct in6_validator_info {
struct in6_addr i6vi_addr;
struct inet6_dev*i6vi_dev;
+   struct netlink_ext_ack  *extack;
 };
 
 #define IN6_ADDR_HSIZE_SHIFT   4
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 7ce22a2c07ce..93773e5a80c7 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -444,7 +444,7 @@ static void check_lifetime(struct work_struct *work);
 static DECLARE_DELAYED_WORK(check_lifetime_work, check_lifetime);
 
 static int __inet_insert_ifa(struct in_ifaddr *ifa, struct nlmsghdr *nlh,
-u32 portid)
+u32 portid, struct netlink_ext_ack *extack)
 {
struct in_device *in_dev = ifa->ifa_dev;
struct in_ifaddr *ifa1, **ifap, **last_primary;
@@ -489,6 +489,7 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct 
nlmsghdr *nlh,
 */
ivi.ivi_addr = ifa->ifa_address;
ivi.ivi_dev = ifa->ifa_dev;
+   ivi.extack = extack;
ret = blocking_notifier_call_chain(_validator_chain,
   NETDEV_UP, );
ret = notifier_to_errno(ret);
@@ -521,7 +522,7 @@ static int __inet_insert_ifa(struct in_ifaddr *ifa, struct 
nlmsghdr *nlh,
 
 static int inet_insert_ifa(struct in_ifaddr *ifa)
 {
-   return __inet_insert_ifa(ifa, NULL, 0);
+   return __inet_insert_ifa(ifa, NULL, 0, NULL);
 }
 
 static int inet_set_ifa(struct net_device *dev, struct in_ifaddr *ifa)
@@ -902,7 +903,8 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return ret;
}
}
-   return __inet_insert_ifa(ifa, nlh, NETLINK_CB(skb).portid);
+   return __inet_insert_ifa(ifa, nlh, NETLINK_CB(skb).portid,
+extack);
} else {
inet_free_ifa(ifa);
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 31ff12277bcf..0075dd3fdc57 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -987,7 +987,7 @@ static struct inet6_ifaddr *
 ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
  const struct in6_addr *peer_addr, int pfxlen,
  int scope, u32 flags, u32 valid_lft, u32 prefered_lft,
- bool can_block)
+ bool can_block, struct netlink_ext_ack *extack)
 {
gfp_t gfp_flags = can_block ? GFP_KERNEL : GFP_ATOMIC;

[PATCH net-next 0/5] mlxsw: spectrum_router: Add extack messages for RIF and VRF overflow

2017-10-13 Thread David Ahern
Currently, exceeding the number of VRF instances or the number of router
interfaces either fails with a non-intuitive EBUSY:
$ ip li set swp1s1.6 vrf vrf-1s1-6 up
RTNETLINK answers: Device or resource busy

or fails silently (IPv6) since the checks are done in a work queue. This
set adds support for the address validator notifier to spectrum which
allows ext-ack based messages to be returned on failure.

To make that happen the IPv6 version needs to be converted from atomic
to blocking (patch 1), and then support for extack needs to be added
to the notifier (patch 2). Patches 3 and 4 add the validator notifier
to spectrum and then plumb the extack argument.

With this set, VRF overflows fail with:
   $ ip li set swp1s1.6 vrf vrf-1s1-6 up
   Error: spectrum: Exceeded number of supported VRF.

and RIF overflows fail with:
   $ ip addr add dev swp1s2.191 10.12.191.1/24
   Error: spectrum: Exceeded number of supported router interfaces.

Changes since RFC
- addressed various comments from Ido
- refactored ipv6_add_addr to allow ifa's to be allocated with
  GFP_KERNEL as requested by DaveM

Ido: given the changes in patch 1 and the impact to what is now
 patch 2 I dropped your Reviewed-by tag from patch 2.

David Ahern (5):
  ipv6: addrconf: cleanup locking in ipv6_add_addr
  net: ipv6: Make inet6addr_validator a blocking notifier
  net: Add extack to validator_info structs used for address notifier
  mlxsw: spectrum: router: Add support for address validator notifier
  mlxsw: spectrum_router: Add extack message for RIF and VRF overflow

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  15 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   4 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 162 +++--
 drivers/net/ipvlan/ipvlan_main.c   |  14 +-
 include/linux/inetdevice.h |   1 +
 include/net/addrconf.h |   1 +
 net/ipv4/devinet.c |   8 +-
 net/ipv6/addrconf.c| 122 +---
 net/ipv6/addrconf_core.c   |   9 +-
 9 files changed, 228 insertions(+), 108 deletions(-)

-- 
2.1.4



Re: [next-queue PATCH v7 4/6] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-10-13 Thread Vinicius Costa Gomes
Hi,

Ivan Khoronzhuk  writes:

[...]

>> +
>> +static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch)
>> +{
>> +struct cbs_sched_data *q = qdisc_priv(sch);
>> +
>> +if (sch->q.qlen == 0 && q->credits > 0) {
>> +/* We need to stop accumulating credits when there's
>> + * no packet enqueued packets and q->credits is
> no packet -> no

Ugh. Fixed.

>
>> + * positive.
>> + */
>> +q->credits = 0;
>> +q->last = ktime_get_ns();
>> +}
>> +
>> +return qdisc_enqueue_tail(skb, sch);
>> +}
>> +

[...]

>> +static struct sk_buff *cbs_dequeue_soft(struct Qdisc *sch)
>> +{
>> +struct cbs_sched_data *q = qdisc_priv(sch);
>> +s64 now = ktime_get_ns();
>> +struct sk_buff *skb;
>> +s64 credits;
>> +int len;
>> +
>> +if (q->credits < 0) {
>> +credits = timediff_to_credits(now - q->last, q->idleslope);
> Maybe be better to add small optimization by moving some calculations from 
> data
> path, I mean, save idle_slope in bytes instead of kbit and converting it for
> every packet. Both delay_from_credits() and timediff_to_credits() is used only
> once and with idle_slope only...and both of them converting it.
>
> Same for credits_from_len() and send slope, save it in units of port_rate.
>

Done. Thanks.


Cheers,
--
Vinicius


Re: [PATCH net-next] ipv6: check fn before doing FIB6_SUBTREE(fn)

2017-10-13 Thread Martin KaFai Lau
On Fri, Oct 13, 2017 at 10:01:08PM +, Wei Wang wrote:
> From: Wei Wang 
> 
> In fib6_locate(), we need to first make sure fn is not NULL before doing
> FIB6_SUBTREE(fn) to avoid crash.
Acked-by: Martin KaFai Lau 


Re: [PATCH] Add -target to clang switch while cross compiling.

2017-10-13 Thread Abhijit Ayarekar
On Fri, Oct 13, 2017 at 03:10:38PM -0700, Alexei Starovoitov wrote:
> On Fri, Oct 13, 2017 at 12:24:06PM -0700, Abhijit Ayarekar wrote:
> > Update to llvm excludes assembly instructions.
> > llvm git revision is below
> > 
> > commit 65fad7c26569 ("bpf: add inline-asm support")
> > 
> > This change will be part of llvm  release 6.0
> > 
> > __ASM_SYSREG_H define is not required for native compile.
> > -target switch includes appropriate target specific files
> > while cross compiling
> > 
> > Tested on x86 and arm64.
> > 
> > Signed-off-by: Abhijit Ayarekar 
> 
> Thanks
> Acked-by: Alexei Starovoitov 
>
Glad i could help :)
 


Re: [PATCH] Add -target to clang switch while cross compiling.

2017-10-13 Thread Alexei Starovoitov
On Fri, Oct 13, 2017 at 12:24:06PM -0700, Abhijit Ayarekar wrote:
> Update to llvm excludes assembly instructions.
> llvm git revision is below
> 
> commit 65fad7c26569 ("bpf: add inline-asm support")
> 
> This change will be part of llvm  release 6.0
> 
> __ASM_SYSREG_H define is not required for native compile.
> -target switch includes appropriate target specific files
> while cross compiling
> 
> Tested on x86 and arm64.
> 
> Signed-off-by: Abhijit Ayarekar 

Thanks
Acked-by: Alexei Starovoitov 



Re: [Patch net-next v3] tcp: add a tracepoint for tcp retransmission

2017-10-13 Thread Alexei Starovoitov
On Fri, Oct 13, 2017 at 01:50:44PM -0700, Brendan Gregg wrote:
> On Fri, Oct 13, 2017 at 1:03 PM, Cong Wang  wrote:
> > We need a real-time notification for tcp retransmission
> > for monitoring.
> >
> > Of course we could use ftrace to dynamically instrument this
> > kernel function too, however we can't retrieve the connection
> > information at the same time, for example perf-tools [1] reads
> > /proc/net/tcp for socket details, which is slow when we have
> > a lots of connections.
> >
> > Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
> > and exposes src/dst IP addresses and ports of the connection.
> > This also makes it easier to integrate into perf.
> >
> > Note, I expose both IPv4 and IPv6 addresses at the same time:
> > for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
> > for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
> > Also, add sk and skb pointers as they are useful for BPF.
> 
> Thanks, a TCP retransmit tracepoint would be great. (tcp_set_state
> would be highly useful too, which Alexei already has in his list).
> 
> Should skp->__sk_common.skc_state be included in the format string, so
> we don't have to always dig it out of the skaddr? For retransmits I
> always want to know the TCP state, to determine if it is ESTABLISHED
> (packet drop) or SYN_SENT (backlog full) or something else.

let's not expose internal socket fields into tp fields.
Few people still believe that tp fields are abi, so to be safe
no such fields should be exposed.
It's trivial enough to read sk_state from bpf program
with bpf_probe_read().

> We probably need a tracepoint for tcp_send_loss_probe() (TLP) as well,
> for tracing at the same time as retransmits (like my tools do), but
> that can be added later.

hmm. why?
This single tracepoint will cover both cases of retransmits.

For the patch:
Acked-by: Alexei Starovoitov 

I believe it will fit our use case perfectly.



[PATCH net-next] ipv6: only update __use and lastusetime once per jiffy at most

2017-10-13 Thread Wei Wang
From: Wei Wang 

In order to not dirty the cacheline too often, we try to only update
dst->__use and dst->lastusetime at most once per jiffy.
As dst->lastusetime is only used by ipv6 garbage collector, it should
be good enough time resolution.
And __use is only used in ipv6_route_seq_show() to show how many times a
dst has been used. And as __use is not atomic_t right now, it does not
show the precise number of usage times anyway. So we think it should be
OK to only update it at most once per jiffy.

According to my latest syn flood test on a machine with intel Xeon 6th
gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
on 2 NUMA nodes:
With this patch, the packet process rate increases from ~3.49Mpps to
~3.75Mpps with a 7% increase rate.

Note: dst_use() is being renamed to dst_hold_and_use() to better specify
the purpose of the function.

Signed-off-by: Wei Wang 
Acked-by: Eric Dumazet 
---
 include/net/dst.h | 15 ---
 net/decnet/dn_route.c |  8 
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 204c19e25456..5047e8053d6c 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -255,17 +255,18 @@ static inline void dst_hold(struct dst_entry *dst)
WARN_ON(atomic_inc_not_zero(>__refcnt) == 0);
 }
 
-static inline void dst_use(struct dst_entry *dst, unsigned long time)
+static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
 {
-   dst_hold(dst);
-   dst->__use++;
-   dst->lastuse = time;
+   if (time != dst->lastuse) {
+   dst->__use++;
+   dst->lastuse = time;
+   }
 }
 
-static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
+static inline void dst_hold_and_use(struct dst_entry *dst, unsigned long time)
 {
-   dst->__use++;
-   dst->lastuse = time;
+   dst_hold(dst);
+   dst_use_noref(dst, time);
 }
 
 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd01dd2..bff5ab88cdbb 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -338,7 +338,7 @@ static int dn_insert_route(struct dn_route *rt, unsigned 
int hash, struct dn_rou
   dn_rt_hash_table[hash].chain);
rcu_assign_pointer(dn_rt_hash_table[hash].chain, rth);
 
-   dst_use(>dst, now);
+   dst_hold_and_use(>dst, now);
spin_unlock_bh(_rt_hash_table[hash].lock);
 
dst_release_immediate(>dst);
@@ -351,7 +351,7 @@ static int dn_insert_route(struct dn_route *rt, unsigned 
int hash, struct dn_rou
rcu_assign_pointer(rt->dst.dn_next, dn_rt_hash_table[hash].chain);
rcu_assign_pointer(dn_rt_hash_table[hash].chain, rt);
 
-   dst_use(>dst, now);
+   dst_hold_and_use(>dst, now);
spin_unlock_bh(_rt_hash_table[hash].lock);
*rp = rt;
return 0;
@@ -1258,7 +1258,7 @@ static int __dn_route_output_key(struct dst_entry **pprt, 
const struct flowidn *
(flp->flowidn_mark == rt->fld.flowidn_mark) &&
dn_is_output_route(rt) &&
(rt->fld.flowidn_oif == flp->flowidn_oif)) {
-   dst_use(>dst, jiffies);
+   dst_hold_and_use(>dst, jiffies);
rcu_read_unlock_bh();
*pprt = >dst;
return 0;
@@ -1535,7 +1535,7 @@ static int dn_route_input(struct sk_buff *skb)
(rt->fld.flowidn_oif == 0) &&
(rt->fld.flowidn_mark == skb->mark) &&
(rt->fld.flowidn_iif == cb->iif)) {
-   dst_use(>dst, jiffies);
+   dst_hold_and_use(>dst, jiffies);
rcu_read_unlock();
skb_dst_set(skb, (struct dst_entry *)rt);
return 0;
-- 
2.15.0.rc0.271.g36b669edcc-goog



[PATCH net-next] ipv6: check fn before doing FIB6_SUBTREE(fn)

2017-10-13 Thread Wei Wang
From: Wei Wang 

In fib6_locate(), we need to first make sure fn is not NULL before doing
FIB6_SUBTREE(fn) to avoid crash.

This fixes the following static checker warning:
net/ipv6/ip6_fib.c:1462 fib6_locate()
 warn: variable dereferenced before check 'fn' (see line 1459)

net/ipv6/ip6_fib.c
  1458  if (src_len) {
  1459  struct fib6_node *subtree = FIB6_SUBTREE(fn);

We shifted this dereference

  1460
  1461  WARN_ON(saddr == NULL);
  1462  if (fn && subtree)
^^
before the check for NULL.

  1463  fn = fib6_locate_1(subtree, saddr, src_len,
  1464 offsetof(struct rt6_info, 
rt6i_src)

Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Dan Carpenter 
Signed-off-by: Wei Wang 
Acked-by: Eric Dumazet 
---
 net/ipv6/ip6_fib.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index c2ecd5ec638a..548af48212fc 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1456,13 +1456,16 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 
 #ifdef CONFIG_IPV6_SUBTREES
if (src_len) {
-   struct fib6_node *subtree = FIB6_SUBTREE(fn);
-
WARN_ON(saddr == NULL);
-   if (fn && subtree)
-   fn = fib6_locate_1(subtree, saddr, src_len,
+   if (fn) {
+   struct fib6_node *subtree = FIB6_SUBTREE(fn);
+
+   if (subtree) {
+   fn = fib6_locate_1(subtree, saddr, src_len,
   offsetof(struct rt6_info, rt6i_src),
   exact_match);
+   }
+   }
}
 #endif
 
-- 
2.15.0.rc0.271.g36b669edcc-goog



[net-next 7/9] i40e: make const array patterns static, reduces object code size

2017-10-13 Thread Jeff Kirsher
From: Colin Ian King 

Don't populate const array patterns on the stack, instead make it
static. Makes the object code smaller by over 60 bytes:

Before:
   textdata bss dec hex filename
   1953 496   02449 991 i40e_diag.o

After:
   textdata bss dec hex filename
   1798 584   02382 94e i40e_diag.o

(gcc 6.3.0, x86-64)

Signed-off-by: Colin Ian King 
Acked-by: Jesse Brandeburg 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_diag.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_diag.c 
b/drivers/net/ethernet/intel/i40e/i40e_diag.c
index f141e78d409e..76ed56641864 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_diag.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_diag.c
@@ -36,7 +36,9 @@
 static i40e_status i40e_diag_reg_pattern_test(struct i40e_hw *hw,
u32 reg, u32 mask)
 {
-   const u32 patterns[] = {0x5A5A5A5A, 0xA5A5A5A5, 0x, 0x};
+   static const u32 patterns[] = {
+   0x5A5A5A5A, 0xA5A5A5A5, 0x, 0x
+   };
u32 pat, val, orig_val;
int i;
 
-- 
2.14.2



[net-next 1/9] mqprio: Introduce new hardware offload mode and shaper in mqprio

2017-10-13 Thread Jeff Kirsher
From: Amritha Nambiar 

The offload types currently supported in mqprio are 0 (no offload) and
1 (offload only TCs) by setting these values for the 'hw' option. If
offloads are supported by setting the 'hw' option to 1, the default
offload mode is 'dcb' where only the TC values are offloaded to the
device. This patch introduces a new hardware offload mode called
'channel' with 'hw' set to 1 in mqprio which makes full use of the
mqprio options, the TCs, the queue configurations and the QoS parameters
for the TCs. This is achieved through a new netlink attribute for the
'mode' option which takes values such as 'dcb' (default) and 'channel'.
The 'channel' mode also supports QoS attributes for traffic class such as
minimum and maximum values for bandwidth rate limits.

This patch enables configuring additional HW shaper attributes associated
with a traffic class. Currently the shaper for bandwidth rate limiting is
supported which takes options such as minimum and maximum bandwidth rates
and are offloaded to the hardware in the 'channel' mode. The min and max
limits for bandwidth rates are provided by the user along with the TCs
and the queue configurations when creating the mqprio qdisc. The interface
can be extended to support new HW shapers in future through the 'shaper'
attribute.

Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
mqprio queue options and use this to be shared between the kernel and
device driver. This contains a copy of the existing data structure
for mqprio queue options. This new data structure can be extended when
adding new attributes for traffic class such as mode, shaper, shaper
parameters (bandwidth rate limits). The existing data structure for mqprio
queue options will be shared between the kernel and userspace.

Example:
  queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
  min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit

To dump the bandwidth rates:

qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
 queues:(0:3) (4:7)
 mode:channel
 shaper:bw_rlimit   min_rate:1Gbit 2Gbit   max_rate:4Gbit 5Gbit

Signed-off-by: Amritha Nambiar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 include/net/pkt_cls.h  |   9 ++
 include/uapi/linux/pkt_sched.h |  32 +++
 net/sched/sch_mqprio.c | 183 +++--
 3 files changed, 215 insertions(+), 9 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index f5263743076b..60d39789e4f0 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -546,6 +546,15 @@ struct tc_cls_bpf_offload {
u32 gen_flags;
 };
 
+struct tc_mqprio_qopt_offload {
+   /* struct tc_mqprio_qopt must always be the first element */
+   struct tc_mqprio_qopt qopt;
+   u16 mode;
+   u16 shaper;
+   u32 flags;
+   u64 min_rate[TC_QOPT_MAX_QUEUE];
+   u64 max_rate[TC_QOPT_MAX_QUEUE];
+};
 
 /* This structure holds cookie structure that is passed from user
  * to the kernel for actions and classifiers
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 099bf5528fed..e95b5c9b9fad 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -625,6 +625,22 @@ enum {
 
 #define TC_MQPRIO_HW_OFFLOAD_MAX (__TC_MQPRIO_HW_OFFLOAD_MAX - 1)
 
+enum {
+   TC_MQPRIO_MODE_DCB,
+   TC_MQPRIO_MODE_CHANNEL,
+   __TC_MQPRIO_MODE_MAX
+};
+
+#define __TC_MQPRIO_MODE_MAX (__TC_MQPRIO_MODE_MAX - 1)
+
+enum {
+   TC_MQPRIO_SHAPER_DCB,
+   TC_MQPRIO_SHAPER_BW_RATE,   /* Add new shapers below */
+   __TC_MQPRIO_SHAPER_MAX
+};
+
+#define __TC_MQPRIO_SHAPER_MAX (__TC_MQPRIO_SHAPER_MAX - 1)
+
 struct tc_mqprio_qopt {
__u8num_tc;
__u8prio_tc_map[TC_QOPT_BITMASK + 1];
@@ -633,6 +649,22 @@ struct tc_mqprio_qopt {
__u16   offset[TC_QOPT_MAX_QUEUE];
 };
 
+#define TC_MQPRIO_F_MODE   0x1
+#define TC_MQPRIO_F_SHAPER 0x2
+#define TC_MQPRIO_F_MIN_RATE   0x4
+#define TC_MQPRIO_F_MAX_RATE   0x8
+
+enum {
+   TCA_MQPRIO_UNSPEC,
+   TCA_MQPRIO_MODE,
+   TCA_MQPRIO_SHAPER,
+   TCA_MQPRIO_MIN_RATE64,
+   TCA_MQPRIO_MAX_RATE64,
+   __TCA_MQPRIO_MAX,
+};
+
+#define TCA_MQPRIO_MAX (__TCA_MQPRIO_MAX - 1)
+
 /* SFB */
 
 enum {
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 6bcdfe6e7b63..f1ae9be83934 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -18,10 +18,16 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mqprio_sched {
struct Qdisc**qdiscs;
+   u16 mode;
+   u16 shaper;
int hw_offload;
+   u32 flags;
+   u64 min_rate[TC_QOPT_MAX_QUEUE];
+   u64 max_rate[TC_QOPT_MAX_QUEUE];
 };
 
 static void mqprio_destroy(struct Qdisc *sch)
@@ -39,9 +45,17 @@ static void 

[net-next 9/9] i40e/i40evf: don't trust VF to reset itself

2017-10-13 Thread Jeff Kirsher
From: Alan Brady 

When using 'ethtool -L' on a VF to change number of requested queues
from PF, we shouldn't trust the VF to reset itself after making the
request.  Doing it that way opens the door for a potentially malicious
VF to do nasty things to the PF which should never be the case.

This makes it such that after VF makes a successful request, PF will
then reset the VF to institute required changes.  Only if the request
fails will PF send a message back to VF letting it know the request was
unsuccessful.

Testing-hints:
There should be no real functional changes.  This is simply hardening
against a potentially malicious VF.

Signed-off-by: Alan Brady 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c  | 9 +++--
 drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 7 +++
 include/linux/avf/virtchnl.h| 4 ++--
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index ce0981e2f605..f8a794b72462 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2045,8 +2045,9 @@ static int i40e_vc_disable_queues_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
  * @msglen: msg length
  *
  * VFs get a default number of queues but can use this message to request a
- * different number.  Will respond with either the number requested or the
- * maximum we can support.
+ * different number.  If the request is successful, PF will reset the VF and
+ * return 0.  If unsuccessful, PF will send message informing VF of number of
+ * available queues and return result of sending VF a message.
  **/
 static int i40e_vc_request_queues_msg(struct i40e_vf *vf, u8 *msg, int msglen)
 {
@@ -2077,7 +2078,11 @@ static int i40e_vc_request_queues_msg(struct i40e_vf 
*vf, u8 *msg, int msglen)
 pf->queues_left);
vfres->num_queue_pairs = pf->queues_left + cur_pairs;
} else {
+   /* successful request */
vf->num_req_queues = req_pairs;
+   i40e_vc_notify_vf_reset(vf);
+   i40e_reset_vf(vf, false);
+   return 0;
}
 
return i40e_vc_send_msg_to_vf(vf, VIRTCHNL_OP_REQUEST_QUEUES, 0,
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
index 2bb81c39d85f..46c8b8a3907c 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c
@@ -407,6 +407,7 @@ int i40evf_request_queues(struct i40evf_adapter *adapter, 
int num)
vfres.num_queue_pairs = num;
 
adapter->current_op = VIRTCHNL_OP_REQUEST_QUEUES;
+   adapter->flags |= I40EVF_FLAG_REINIT_ITR_NEEDED;
return i40evf_send_pf_msg(adapter, VIRTCHNL_OP_REQUEST_QUEUES,
  (u8 *), sizeof(vfres));
 }
@@ -1098,15 +1099,13 @@ void i40evf_virtchnl_completion(struct i40evf_adapter 
*adapter,
case VIRTCHNL_OP_REQUEST_QUEUES: {
struct virtchnl_vf_res_request *vfres =
(struct virtchnl_vf_res_request *)msg;
-   if (vfres->num_queue_pairs == adapter->num_req_queues) {
-   adapter->flags |= I40EVF_FLAG_REINIT_ITR_NEEDED;
-   i40evf_schedule_reset(adapter);
-   } else {
+   if (vfres->num_queue_pairs != adapter->num_req_queues) {
dev_info(>pdev->dev,
 "Requested %d queues, PF can support %d\n",
 adapter->num_req_queues,
 vfres->num_queue_pairs);
adapter->num_req_queues = 0;
+   adapter->flags &= ~I40EVF_FLAG_REINIT_ITR_NEEDED;
}
}
break;
diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h
index 60e5d90cb18a..3ce61342fa31 100644
--- a/include/linux/avf/virtchnl.h
+++ b/include/linux/avf/virtchnl.h
@@ -333,8 +333,8 @@ struct virtchnl_vsi_queue_config_info {
  * additional queues must be negotiated.  This is a best effort request as it
  * is possible the PF does not have enough queues left to support the request.
  * If the PF cannot support the number requested it will respond with the
- * maximum number it is able to support; otherwise it will respond with the
- * number requested.
+ * maximum number it is able to support.  If the request is successful, PF will
+ * then reset the VF to institute required changes.
  */
 
 /* VF resource request */
-- 
2.14.2



[net-next 5/9] i40e: Refactor VF BW rate limiting

2017-10-13 Thread Jeff Kirsher
From: Amritha Nambiar 

This patch refactors the BW rate limiting for Tx traffic
on the VF to be reused in the next patch for rate limiting Tx
traffic for the VSIs on the PF as well.

Signed-off-by: Amritha Nambiar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |  5 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c| 64 ++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 45 +--
 3 files changed, 71 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 024c88474951..524aa06a9e0e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -128,6 +128,10 @@
 /* default to trying for four seconds */
 #define I40E_TRY_LINK_TIMEOUT  (4 * HZ)
 
+/* BW rate limiting */
+#define I40E_BW_CREDIT_DIVISOR 50 /* 50Mbps per BW credit */
+#define I40E_MAX_BW_INACTIVE_ACCUM 4  /* accumulate 4 credits max */
+
 /* driver state flags */
 enum i40e_state_t {
__I40E_TESTING,
@@ -1039,4 +1043,5 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi 
*vsi)
 }
 
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
+int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e803aa1552c6..fc6eaf44d87c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5399,6 +5399,70 @@ static int i40e_vsi_config_tc(struct i40e_vsi *vsi, u8 
enabled_tc)
return ret;
 }
 
+/**
+ * i40e_get_link_speed - Returns link speed for the interface
+ * @vsi: VSI to be configured
+ *
+ **/
+int i40e_get_link_speed(struct i40e_vsi *vsi)
+{
+   struct i40e_pf *pf = vsi->back;
+
+   switch (pf->hw.phy.link_info.link_speed) {
+   case I40E_LINK_SPEED_40GB:
+   return 4;
+   case I40E_LINK_SPEED_25GB:
+   return 25000;
+   case I40E_LINK_SPEED_20GB:
+   return 2;
+   case I40E_LINK_SPEED_10GB:
+   return 1;
+   case I40E_LINK_SPEED_1GB:
+   return 1000;
+   default:
+   return -EINVAL;
+   }
+}
+
+/**
+ * i40e_set_bw_limit - setup BW limit for Tx traffic based on max_tx_rate
+ * @vsi: VSI to be configured
+ * @seid: seid of the channel/VSI
+ * @max_tx_rate: max TX rate to be configured as BW limit
+ *
+ * Helper function to set BW limit for a given VSI
+ **/
+int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate)
+{
+   struct i40e_pf *pf = vsi->back;
+   int speed = 0;
+   int ret = 0;
+
+   speed = i40e_get_link_speed(vsi);
+   if (max_tx_rate > speed) {
+   dev_err(>pdev->dev,
+   "Invalid max tx rate %llu specified for VSI seid %d.",
+   max_tx_rate, seid);
+   return -EINVAL;
+   }
+   if (max_tx_rate && max_tx_rate < 50) {
+   dev_warn(>pdev->dev,
+"Setting max tx rate to minimum usable value of 
50Mbps.\n");
+   max_tx_rate = 50;
+   }
+
+   /* Tx rate credits are in values of 50Mbps, 0 is disabled */
+   ret = i40e_aq_config_vsi_bw_limit(>hw, seid,
+ max_tx_rate / I40E_BW_CREDIT_DIVISOR,
+ I40E_MAX_BW_INACTIVE_ACCUM, NULL);
+   if (ret)
+   dev_err(>pdev->dev,
+   "Failed set tx rate (%llu Mbps) for vsi->seid %u, err 
%s aq_err %s\n",
+   max_tx_rate, seid, i40e_stat_str(>hw, ret),
+   i40e_aq_str(>hw, pf->hw.aq.asq_last_status));
+   return ret;
+}
+
 /**
  * i40e_remove_queue_channels - Remove queue channels for the TCs
  * @vsi: VSI to be configured
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index e7f98e306554..ce0981e2f605 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -3117,8 +3117,6 @@ int i40e_ndo_set_vf_port_vlan(struct net_device *netdev, 
int vf_id,
return ret;
 }
 
-#define I40E_BW_CREDIT_DIVISOR 50 /* 50Mbps per BW credit */
-#define I40E_MAX_BW_INACTIVE_ACCUM 4  /* device can accumulate 4 credits max */
 /**
  * i40e_ndo_set_vf_bw
  * @netdev: network interface device structure
@@ -3134,7 +3132,6 @@ int i40e_ndo_set_vf_bw(struct net_device *netdev, int 
vf_id, int min_tx_rate,
struct i40e_pf *pf = np->vsi->back;
struct i40e_vsi *vsi;
struct i40e_vf *vf;
-   int speed = 0;
int ret = 0;
 
/* validate the request */

[net-next 2/9] i40e: Add macro for PF reset bit

2017-10-13 Thread Jeff Kirsher
From: Amritha Nambiar 

Introduce a macro for the bit setting the PF reset flag and
update its usages. This makes it easier to use this flag
in functions to be introduced in future without encountering
checkpatch issues related to alignment and line over 80
characters.

Signed-off-by: Amritha Nambiar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h | 2 ++
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 3 +--
 drivers/net/ethernet/intel/i40e/i40e_main.c| 9 -
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 5 ++---
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 8139b4ee1dc3..e7c7a853cf7f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -157,6 +157,8 @@ enum i40e_state_t {
__I40E_STATE_SIZE__,
 };
 
+#define I40E_PF_RESET_FLAG BIT_ULL(__I40E_PF_RESET_REQUESTED)
+
 /* VSI state flags */
 enum i40e_vsi_state_t {
__I40E_VSI_DOWN,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c 
b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index 6f2725fc50a1..2b8bbc84e34f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -798,8 +798,7 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
 */
if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) {
pf->flags |= I40E_FLAG_VEB_MODE_ENABLED;
-   i40e_do_reset_safe(pf,
-  BIT_ULL(__I40E_PF_RESET_REQUESTED));
+   i40e_do_reset_safe(pf, I40E_PF_RESET_FLAG);
}
 
vsi = i40e_vsi_setup(pf, I40E_VSI_VMDQ2, vsi_seid, 0);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 4de52001a2b9..6190257eecfe 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5747,7 +5747,7 @@ int i40e_vsi_open(struct i40e_vsi *vsi)
 err_setup_tx:
i40e_vsi_free_tx_resources(vsi);
if (vsi == pf->vsi[pf->lan_vsi])
-   i40e_do_reset(pf, BIT_ULL(__I40E_PF_RESET_REQUESTED), true);
+   i40e_do_reset(pf, I40E_PF_RESET_FLAG, true);
 
return err;
 }
@@ -5875,7 +5875,7 @@ void i40e_do_reset(struct i40e_pf *pf, u32 reset_flags, 
bool lock_acquired)
wr32(>hw, I40E_GLGEN_RTRIG, val);
i40e_flush(>hw);
 
-   } else if (reset_flags & BIT_ULL(__I40E_PF_RESET_REQUESTED)) {
+   } else if (reset_flags & I40E_PF_RESET_FLAG) {
 
/* Request a PF Reset
 *
@@ -9223,7 +9223,7 @@ static int i40e_set_features(struct net_device *netdev,
need_reset = i40e_set_ntuple(pf, features);
 
if (need_reset)
-   i40e_do_reset(pf, BIT_ULL(__I40E_PF_RESET_REQUESTED), true);
+   i40e_do_reset(pf, I40E_PF_RESET_FLAG, true);
 
return 0;
 }
@@ -9475,8 +9475,7 @@ static int i40e_ndo_bridge_setlink(struct net_device *dev,
pf->flags |= I40E_FLAG_VEB_MODE_ENABLED;
else
pf->flags &= ~I40E_FLAG_VEB_MODE_ENABLED;
-   i40e_do_reset(pf, BIT_ULL(__I40E_PF_RESET_REQUESTED),
- true);
+   i40e_do_reset(pf, I40E_PF_RESET_FLAG, true);
break;
}
}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 0c4fa225c7be..e7f98e306554 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1425,8 +1425,7 @@ int i40e_pci_sriov_configure(struct pci_dev *pdev, int 
num_vfs)
if (num_vfs) {
if (!(pf->flags & I40E_FLAG_VEB_MODE_ENABLED)) {
pf->flags |= I40E_FLAG_VEB_MODE_ENABLED;
-   i40e_do_reset_safe(pf,
-  BIT_ULL(__I40E_PF_RESET_REQUESTED));
+   i40e_do_reset_safe(pf, I40E_PF_RESET_FLAG);
}
return i40e_pci_sriov_enable(pdev, num_vfs);
}
@@ -1434,7 +1433,7 @@ int i40e_pci_sriov_configure(struct pci_dev *pdev, int 
num_vfs)
if (!pci_vfs_assigned(pf->pdev)) {
i40e_free_vfs(pf);
pf->flags &= ~I40E_FLAG_VEB_MODE_ENABLED;
-   i40e_do_reset_safe(pf, BIT_ULL(__I40E_PF_RESET_REQUESTED));
+   i40e_do_reset_safe(pf, I40E_PF_RESET_FLAG);
} else {
dev_warn(>dev, "Unable to free VFs because some are 
assigned to 

[net-next 8/9] i40e: fix link reporting

2017-10-13 Thread Jeff Kirsher
From: Alan Brady 

When querying the NVM for supported phy_types, on some firmware
versions, we were failing to actually fill out the phy_types which means
ethtool wouldn't report any link types.

Testing-hints:
Check 'ethtool ' if you have the right (wrong?) firmware.
Without this patch, no link modes will be reported.

Signed-off-by: Alan Brady 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_common.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 53aad378d49c..aeb497258f20 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1611,8 +1611,13 @@ i40e_status i40e_aq_get_phy_capabilities(struct i40e_hw 
*hw,
if (report_init) {
if (hw->mac.type ==  I40E_MAC_XL710 &&
hw->aq.api_maj_ver == I40E_FW_API_VERSION_MAJOR &&
-   hw->aq.api_min_ver >= I40E_MINOR_VER_GET_LINK_INFO_XL710)
+   hw->aq.api_min_ver >= I40E_MINOR_VER_GET_LINK_INFO_XL710) {
status = i40e_aq_get_link_info(hw, true, NULL, NULL);
+   } else {
+   hw->phy.phy_types = le32_to_cpu(abilities->phy_type);
+   hw->phy.phy_types |=
+   ((u64)abilities->phy_type_ext << 32);
+   }
}
 
return status;
-- 
2.14.2



[net-next 4/9] i40e: Enable 'channel' mode in mqprio for TC configs

2017-10-13 Thread Jeff Kirsher
From: Amritha Nambiar 

The i40e driver is modified to enable the new mqprio hardware
offload mode and factor the TCs and queue configuration by
creating channel VSIs. In this mode, the priority to traffic
class mapping and the user specified queue ranges are used
to configure the traffic classes by setting the mode option to
'channel'.

Example:
  map 0 0 0 0 1 2 2 3 queues 2@0 2@2 1@4 1@5\
  hw 1 mode channel

qdisc mqprio 8038: root  tc 4 map 0 0 0 0 1 2 2 3 0 0 0 0 0 0 0 0
 queues:(0:1) (2:3) (4:4) (5:5)
 mode:channel
 shaper:dcb

The HW channels created are removed and all the queue configuration
is set to default when the qdisc is detached from the root of the
device.

This patch also disables setting up channels via ethtool (ethtool -L)
when the TCs are configured using mqprio scheduler.

The patch also limits setting ethtool Rx flow hash indirection
(ethtool -X eth0 equal N) to max queues configured via mqprio.
The Rx flow hash indirection input through ethtool should be
validated so that it is within in the queue range configured via
tc/mqprio. The bound checking is achieved by reporting the current
rss size to the kernel when queues are configured via mqprio.

Example:
  map 0 0 0 1 0 2 3 0 queues 2@0 4@2 8@6 11@14\
  hw 1 mode channel

Cannot set RX flow hash configuration: Invalid argument

Signed-off-by: Amritha Nambiar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h |   3 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 457 +++--
 3 files changed, 362 insertions(+), 106 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index bde982541772..024c88474951 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "i40e_type.h"
 #include "i40e_prototype.h"
 #include "i40e_client.h"
@@ -700,6 +701,7 @@ struct i40e_vsi {
enum i40e_vsi_type type;  /* VSI type, e.g., LAN, FCoE, etc */
s16 vf_id;  /* Virtual function ID for SRIOV VSIs */
 
+   struct tc_mqprio_qopt_offload mqprio_qopt; /* queue parameters */
struct i40e_tc_configuration tc_config;
struct i40e_aqc_vsi_properties_data info;
 
@@ -725,6 +727,7 @@ struct i40e_vsi {
u16 cnt_q_avail;/* num of queues available for channel usage */
u16 orig_rss_size;
u16 current_rss_size;
+   bool reconfig_rss;
 
u16 next_base_queue;/* next queue to be used for channel setup */
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index afd3ca8d9851..72d5f2cdf419 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2652,7 +2652,7 @@ static int i40e_get_rxnfc(struct net_device *netdev, 
struct ethtool_rxnfc *cmd,
 
switch (cmd->cmd) {
case ETHTOOL_GRXRINGS:
-   cmd->data = vsi->num_queue_pairs;
+   cmd->data = vsi->rss_size;
ret = 0;
break;
case ETHTOOL_GRXFH:
@@ -3897,6 +3897,12 @@ static int i40e_set_channels(struct net_device *dev,
if (vsi->type != I40E_VSI_MAIN)
return -EINVAL;
 
+   /* We do not support setting channels via ethtool when TCs are
+* configured through mqprio
+*/
+   if (pf->flags & I40E_FLAG_TC_MQPRIO)
+   return -EINVAL;
+
/* verify they are not requesting separate vectors */
if (!count || ch->rx_count || ch->tx_count)
return -EINVAL;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e23105bee6d1..e803aa1552c6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1588,6 +1588,170 @@ static int i40e_set_mac(struct net_device *netdev, void 
*p)
return 0;
 }
 
+/**
+ * i40e_config_rss_aq - Prepare for RSS using AQ commands
+ * @vsi: vsi structure
+ * @seed: RSS hash seed
+ **/
+static int i40e_config_rss_aq(struct i40e_vsi *vsi, const u8 *seed,
+ u8 *lut, u16 lut_size)
+{
+   struct i40e_pf *pf = vsi->back;
+   struct i40e_hw *hw = >hw;
+   int ret = 0;
+
+   if (seed) {
+   struct i40e_aqc_get_set_rss_key_data *seed_dw =
+   (struct i40e_aqc_get_set_rss_key_data *)seed;
+   ret = i40e_aq_set_rss_key(hw, vsi->id, seed_dw);
+   if (ret) {
+   dev_info(>pdev->dev,
+"Cannot set RSS key, err %s aq_err %s\n",
+

[net-next 0/9][pull request] 40GbE Intel Wired LAN Driver Updates 2017-10-13

2017-10-13 Thread Jeff Kirsher
This series contains updates to mqprio and i40e.

Amritha introduces a new hardware offload mode in tc/mqprio where the TCs,
the queue configurations and bandwidth rate limits are offloaded to the
hardware. The existing mqprio framework is extended to configure the queue
counts and layout and also added support for rate limiting. This is
achieved through new netlink attributes for the 'mode' option which takes
values such as 'dcb' (default) and 'channel' and a 'shaper' option for
QoS attributes such as bandwidth rate limits in hw mode 1.  Legacy devices
can fall back to the existing setup supporting hw mode 1 without these
additional options where only the TCs are offloaded and then the 'mode'
and 'shaper' options defaults to DCB support.  The i40e driver enables the
new mqprio hardware offload mechanism factoring the TCs, queue
configuration and bandwidth rates by creating HW channel VSIs.
In this new mode, the priority to traffic class mapping and the user
specified queue ranges are used to configure the traffic class when the
'mode' option is set to 'channel'. This is achieved by creating HW
channels(VSI). A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC (TC0)
which is for the main VSI. TC0 for the main VSI is also reconfigured as
per user provided queue parameters. Finally, bandwidth rate limits are set
on these traffic classes through the shaper attribute by sending these
rates in addition to the number of TCs and the queue configurations.

Colin Ian King makes an array of constant values "constant".

Alan fixes and issue where on some firmware versions, we were failing to
actually fill out the phy_types which caused ethtool to not report any
link types.  Also hardened against a potentially malicious VF by not
letting the VF to reset itself after requesting to change the number of
queues (via ethtool), let the PF reset the VF to institute the requested
changes.

The following are changes since commit a00344bd1bbea2ba40719ae0eb3b6da7fae08cf2:
  Merge branch 'tipc-comm-groups'
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Alan Brady (2):
  i40e: fix link reporting
  i40e/i40evf: don't trust VF to reset itself

Amritha Nambiar (6):
  mqprio: Introduce new hardware offload mode and shaper in mqprio
  i40e: Add macro for PF reset bit
  i40e: Add infrastructure for queue channel support
  i40e: Enable 'channel' mode in mqprio for TC configs
  i40e: Refactor VF BW rate limiting
  i40e: Add support setting TC max bandwidth rates

Colin Ian King (1):
  i40e: make const array patterns static, reduces object code size

 drivers/net/ethernet/intel/i40e/i40e.h |   44 +
 drivers/net/ethernet/intel/i40e/i40e_common.c  |7 +-
 drivers/net/ethernet/intel/i40e/i40e_debugfs.c |3 +-
 drivers/net/ethernet/intel/i40e/i40e_diag.c|4 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |8 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c| 1462 +---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|2 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   59 +-
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|7 +-
 include/linux/avf/virtchnl.h   |4 +-
 include/net/pkt_cls.h  |9 +
 include/uapi/linux/pkt_sched.h |   32 +
 net/sched/sch_mqprio.c |  183 ++-
 13 files changed, 1571 insertions(+), 253 deletions(-)

-- 
2.14.2



[net-next 3/9] i40e: Add infrastructure for queue channel support

2017-10-13 Thread Jeff Kirsher
From: Amritha Nambiar 

This patch sets up the infrastructure for offloading TCs and
queue configurations to the hardware by creating HW channels(VSI).
A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC
(TC0). TC0 for the main VSI is also reconfigured as per user provided
queue parameters. Queue counts that are not power-of-2 are handled by
reconfiguring RSS by reprogramming LUTs using the queue count value.
This patch also handles configuring the TX rings for the channels,
setting up the RX queue map for channel.

Also, the channels so created are removed and all the queue
configuration is set to default when the qdisc is detached from the
root of the device.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Kiran Patil 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |  32 ++
 drivers/net/ethernet/intel/i40e/i40e_main.c | 718 +++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   2 +
 3 files changed, 743 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index e7c7a853cf7f..bde982541772 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -87,6 +87,7 @@
 #define I40E_AQ_LEN256
 #define I40E_AQ_WORK_LIMIT 66 /* max number of VFs + a little */
 #define I40E_MAX_USER_PRIORITY 8
+#define I40E_MAX_QUEUES_PER_CH 64
 #define I40E_DEFAULT_TRAFFIC_CLASS BIT(0)
 #define I40E_DEFAULT_MSG_ENABLE4
 #define I40E_QUEUE_WAIT_RETRY_LIMIT10
@@ -340,6 +341,23 @@ struct i40e_flex_pit {
u8 pit_index;
 };
 
+struct i40e_channel {
+   struct list_head list;
+   bool initialized;
+   u8 type;
+   u16 vsi_number; /* Assigned VSI number from AQ 'Add VSI' response */
+   u16 stat_counter_idx;
+   u16 base_queue;
+   u16 num_queue_pairs; /* Requested by user */
+   u16 seid;
+
+   u8 enabled_tc;
+   struct i40e_aqc_vsi_properties_data info;
+
+   /* track this channel belongs to which VSI */
+   struct i40e_vsi *parent_vsi;
+};
+
 /* struct that defines the Ethernet device */
 struct i40e_pf {
struct pci_dev *pdev;
@@ -456,6 +474,7 @@ struct i40e_pf {
 #define I40E_FLAG_CLIENT_RESET BIT(26)
 #define I40E_FLAG_LINK_DOWN_ON_CLOSE_ENABLED   BIT(27)
 #define I40E_FLAG_SOURCE_PRUNING_DISABLED  BIT(28)
+#define I40E_FLAG_TC_MQPRIOBIT(29)
 
struct i40e_client_instance *cinst;
bool stat_offsets_loaded;
@@ -536,6 +555,8 @@ struct i40e_pf {
u32 ioremap_len;
u32 fd_inv;
u16 phy_led_val;
+
+   u16 override_q_count;
 };
 
 /**
@@ -700,6 +721,15 @@ struct i40e_vsi {
bool current_isup;  /* Sync 'link up' logging */
enum i40e_aq_link_speed current_speed;  /* Sync link speed logging */
 
+   /* channel specific fields */
+   u16 cnt_q_avail;/* num of queues available for channel usage */
+   u16 orig_rss_size;
+   u16 current_rss_size;
+
+   u16 next_base_queue;/* next queue to be used for channel setup */
+
+   struct list_head ch_list;
+
void *priv; /* client driver data reference. */
 
/* VSI specific handlers */
@@ -1004,4 +1034,6 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi 
*vsi)
 {
return !!vsi->xdp_prog;
 }
+
+int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6190257eecfe..e23105bee6d1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -2881,7 +2881,7 @@ static void i40e_config_xps_tx_ring(struct i40e_ring 
*ring)
 {
int cpu;
 
-   if (!ring->q_vector || !ring->netdev)
+   if (!ring->q_vector || !ring->netdev || ring->ch)
return;
 
/* We only initialize XPS once, so as not to overwrite user settings */
@@ -2944,7 +2944,14 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 * initialization. This has to be done regardless of
 * DCB as by default everything is mapped to TC0.
 */
-   tx_ctx.rdylist = le16_to_cpu(vsi->info.qs_handle[ring->dcb_tc]);
+
+   if (ring->ch)
+   tx_ctx.rdylist =
+   le16_to_cpu(ring->ch->info.qs_handle[ring->dcb_tc]);
+
+   else
+   tx_ctx.rdylist = le16_to_cpu(vsi->info.qs_handle[ring->dcb_tc]);
+
tx_ctx.rdylist_act = 0;
 
/* clear the context in the HMC */
@@ -2966,12 +2973,23 @@ static int i40e_configure_tx_ring(struct i40e_ring 
*ring)
}
 

[net-next 6/9] i40e: Add support setting TC max bandwidth rates

2017-10-13 Thread Jeff Kirsher
From: Amritha Nambiar 

This patch enables setting up maximum Tx rates for the traffic
classes in i40e. The maximum rate is offloaded to the hardware through
the mqprio framework by specifying the mode option as 'channel' and
shaper option as 'bw_rlimit' and is configured for the VSI. Configuring
minimum Tx rate limit is not supported in the device. The minimum
usable value for Tx rate is 50Mbps.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
  queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
  max_rate 4Gbit 5Gbit

To dump the bandwidth rates:
# tc qdisc show dev eth0

qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
 queues:(0:3) (4:7)
 mode:channel
 shaper:bw_rlimit   max_rate:4Gbit 5Gbit

Signed-off-by: Amritha Nambiar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e.h  |   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c | 100 +---
 2 files changed, 93 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 524aa06a9e0e..266e1dc5e786 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -359,6 +359,8 @@ struct i40e_channel {
u8 enabled_tc;
struct i40e_aqc_vsi_properties_data info;
 
+   u64 max_tx_rate;
+
/* track this channel belongs to which VSI */
struct i40e_vsi *parent_vsi;
 };
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index fc6eaf44d87c..bb31d53c4923 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5196,9 +5196,16 @@ static int i40e_vsi_configure_bw_alloc(struct i40e_vsi 
*vsi, u8 enabled_tc,
i40e_status ret;
int i;
 
-   if ((vsi->back->flags & I40E_FLAG_TC_MQPRIO) ||
-   !vsi->mqprio_qopt.qopt.hw)
+   if (vsi->back->flags & I40E_FLAG_TC_MQPRIO)
return 0;
+   if (!vsi->mqprio_qopt.qopt.hw) {
+   ret = i40e_set_bw_limit(vsi, vsi->seid, 0);
+   if (ret)
+   dev_info(>back->pdev->dev,
+"Failed to reset tx rate for vsi->seid %u\n",
+vsi->seid);
+   return ret;
+   }
bw_data.tc_valid_bits = enabled_tc;
for (i = 0; i < I40E_MAX_TRAFFIC_CLASS; i++)
bw_data.tc_bw_credits[i] = bw_share[i];
@@ -5505,6 +5512,13 @@ static void i40e_remove_queue_channels(struct i40e_vsi 
*vsi)
rx_ring->ch = NULL;
}
 
+   /* Reset BW configured for this VSI via mqprio */
+   ret = i40e_set_bw_limit(vsi, ch->seid, 0);
+   if (ret)
+   dev_info(>back->pdev->dev,
+"Failed to reset tx rate for ch->seid %u\n",
+ch->seid);
+
/* delete VSI from FW */
ret = i40e_aq_delete_element(>back->hw, ch->seid,
 NULL);
@@ -6047,6 +6061,17 @@ int i40e_create_queue_channel(struct i40e_vsi *vsi,
 "Setup channel (id:%u) utilizing num_queues %d\n",
 ch->seid, ch->num_queue_pairs);
 
+   /* configure VSI for BW limit */
+   if (ch->max_tx_rate) {
+   if (i40e_set_bw_limit(vsi, ch->seid, ch->max_tx_rate))
+   return -EINVAL;
+
+   dev_dbg(>pdev->dev,
+   "Set tx rate of %llu Mbps (count of 50Mbps %llu) for 
vsi->seid %u\n",
+   ch->max_tx_rate,
+   ch->max_tx_rate / I40E_BW_CREDIT_DIVISOR, ch->seid);
+   }
+
/* in case of VF, this will be main SRIOV VSI */
ch->parent_vsi = vsi;
 
@@ -6082,6 +6107,12 @@ static int i40e_configure_queue_channels(struct i40e_vsi 
*vsi)
ch->base_queue =
vsi->tc_config.tc_info[i].qoffset;
 
+   /* Bandwidth limit through tc interface is in bytes/s,
+* change to Mbit/s
+*/
+   ch->max_tx_rate =
+   vsi->mqprio_qopt.max_rate[i] / (100 / 8);
+
list_add_tail(>list, >ch_list);
 
ret = i40e_create_queue_channel(vsi, ch);
@@ -6508,6 +6539,7 @@ void i40e_down(struct i40e_vsi *vsi)
 static int i40e_validate_mqprio_qopt(struct i40e_vsi *vsi,
 struct tc_mqprio_qopt_offload *mqprio_qopt)
 {
+   u64 sum_max_rate = 0;
int i;
 
if (mqprio_qopt->qopt.offset[0] != 0 ||
@@ -6517,8 +6549,13 @@ static int 

[PATCH net-next v2 0/4] tc-testing: Test suite updates

2017-10-13 Thread Lucas Bates
This patch series is a roundup of changes to the tc-testing
suite:

 - Add test cases for police and mirred modules and some coverage
   in already-submitted test categories
 - Break the test case files down into more user-friendly sizes
 - Bug fix to the tdc.py script's handling of the -l argument

v2: fix the lack of final newlines in two new files (thanks David)

Lucas Bates (4):
  tc-testing: Add test cases for flushing actions
  tc-testing: Split test case files into smaller chunks
  tc-testing: Add test cases for police and skbmod
  tc-testing: fix the -l argument bug in tdc.py

 .../tc-testing/tc-tests/actions/gact.json  |  469 
 .../selftests/tc-testing/tc-tests/actions/ife.json |   52 +
 .../tc-testing/tc-tests/actions/mirred.json|  223 
 .../tc-testing/tc-tests/actions/police.json|  527 +
 .../tc-testing/tc-tests/actions/simple.json|  130 +++
 .../tc-testing/tc-tests/actions/skbedit.json   |  320 ++
 .../tc-testing/tc-tests/actions/skbmod.json|  372 +++
 .../tc-testing/tc-tests/actions/tests.json | 1165 
 tools/testing/selftests/tc-testing/tdc.py  |8 +-
 9 files changed, 2097 insertions(+), 1169 deletions(-)
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/actions/ife.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/police.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/simple.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/skbedit.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json
 delete mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/tests.json

--
2.7.4



[PATCH net-next v2 4/4] tc-testing: fix the -l argument bug in tdc.py

2017-10-13 Thread Lucas Bates
This patch fixes a bug in the tdc script, where executing tdc
with the -l argument would cause the tests to start running
as opposed to listing all the known test cases.

Signed-off-by: Lucas Bates 
Acked-by: Jamal Hadi Salim 
---
 tools/testing/selftests/tc-testing/tdc.py | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tdc.py 
b/tools/testing/selftests/tc-testing/tdc.py
index cd61b78..d2391df 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -49,7 +49,7 @@ def exec_cmd(command, nsonly=True):
 stderr=subprocess.PIPE)
 (rawout, serr) = proc.communicate()

-if proc.returncode != 0:
+if proc.returncode != 0 and len(serr) > 0:
 foutput = serr.decode("utf-8")
 else:
 foutput = rawout.decode("utf-8")
@@ -203,7 +203,7 @@ def set_args(parser):
 help='Run tests only from the specified category, or 
if no category is specified, list known categories.')
 parser.add_argument('-f', '--file', type=str,
 help='Run tests from the specified file')
-parser.add_argument('-l', '--list', type=str, nargs='?', const="", 
metavar='CATEGORY',
+parser.add_argument('-l', '--list', type=str, nargs='?', const="++", 
metavar='CATEGORY',
 help='List all test cases, or those only within the 
specified category')
 parser.add_argument('-s', '--show', type=str, nargs=1, metavar='ID', 
dest='showID',
 help='Display the test case with specified id')
@@ -357,10 +357,10 @@ def set_operation_mode(args):
 testcases = get_categorized_testlist(alltests, ucat)

 if args.list:
-if (len(args.list) == 0):
+if (args.list == "++"):
 list_test_cases(alltests)
 exit(0)
-elif(len(args.list > 0)):
+elif(len(args.list) > 0):
 if (args.list not in ucat):
 print("Unknown category " + args.list)
 print("Available categories:")
--
2.7.4



[PATCH net-next v2 2/4] tc-testing: Split test case files into smaller chunks

2017-10-13 Thread Lucas Bates
The original submission had the test cases stored in one
monolithic file. This can be unwieldy to edit, especially as more
test cases are added. This patch removes the original tests.json
file in favour of individual ones broken down by category.

Signed-off-by: Lucas Bates 
Acked-by: Jamal Hadi Salim 
---
 .../tc-testing/tc-tests/actions/gact.json  |  469 
 .../selftests/tc-testing/tc-tests/actions/ife.json |   52 +
 .../tc-testing/tc-tests/actions/mirred.json|  223 
 .../tc-testing/tc-tests/actions/simple.json|  130 +++
 .../tc-testing/tc-tests/actions/skbedit.json   |  320 ++
 .../tc-testing/tc-tests/actions/tests.json | 1212 
 6 files changed, 1194 insertions(+), 1212 deletions(-)
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/actions/ife.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/simple.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/skbedit.json
 delete mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/tests.json

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
new file mode 100644
index 000..e2187b6
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/gact.json
@@ -0,0 +1,469 @@
+[
+{
+"id": "e89a",
+"name": "Add valid pass action",
+"category": [
+"actions",
+"gact"
+],
+"setup": [
+[
+"$TC actions flush action gact",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action pass index 8",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action gact",
+"matchPattern": "action order [0-9]*: gact action pass.*index 8 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action gact"
+]
+},
+{
+"id": "a02c",
+"name": "Add valid pipe action",
+"category": [
+"actions",
+"gact"
+],
+"setup": [
+[
+"$TC actions flush action gact",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action pipe index 6",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action gact",
+"matchPattern": "action order [0-9]*: gact action pipe.*index 6 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action gact"
+]
+},
+{
+"id": "feef",
+"name": "Add valid reclassify action",
+"category": [
+"actions",
+"gact"
+],
+"setup": [
+[
+"$TC actions flush action gact",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action reclassify index 5",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action gact",
+"matchPattern": "action order [0-9]*: gact action reclassify.*index 5 
ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action gact"
+]
+},
+{
+"id": "8a7a",
+"name": "Add valid drop action",
+"category": [
+"actions",
+"gact"
+],
+"setup": [
+[
+"$TC actions flush action gact",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action drop index 30",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action gact",
+"matchPattern": "action order [0-9]*: gact action drop.*index 30 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action gact"
+]
+},
+{
+"id": "9a52",
+"name": "Add valid continue action",
+"category": [
+"actions",
+"gact"
+],
+"setup": [
+[
+"$TC actions flush action gact",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action continue index 432",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action gact",
+"matchPattern": "action order [0-9]*: gact action continue.*index 432 
ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action gact"
+]
+},
+{
+"id": "d700",
+   

[PATCH net-next v2 3/4] tc-testing: Add test cases for police and skbmod

2017-10-13 Thread Lucas Bates
Add basic unit tests for police and skbmod actions in tc.

Signed-off-by: Lucas Bates 
Acked-by: Jamal Hadi Salim 
---
 .../tc-testing/tc-tests/actions/police.json| 527 +
 .../tc-testing/tc-tests/actions/skbmod.json| 372 +++
 2 files changed, 899 insertions(+)
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/police.json
 create mode 100644 
tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/police.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/police.json
new file mode 100644
index 000..0e602a3
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/police.json
@@ -0,0 +1,527 @@
+[
+{
+"id": "49aa",
+"name": "Add valid basic police action",
+"category": [
+"actions",
+"police"
+],
+"setup": [
+[
+"$TC actions flush action police",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action police rate 1kbit burst 10k 
index 1",
+"expExitCode": "0",
+"verifyCmd": "$TC actions ls action police",
+"matchPattern": "action order [0-9]*:  police 0x1 rate 1Kbit burst 
10Kb",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action police"
+]
+},
+{
+"id": "3abe",
+"name": "Add police action with duplicate index",
+"category": [
+"actions",
+"police"
+],
+"setup": [
+[
+"$TC actions flush action police",
+0,
+1,
+255
+],
+"$TC actions add action police rate 4Mbit burst 120k index 9"
+],
+"cmdUnderTest": "$TC actions add action police rate 8kbit burst 24k 
index 9",
+"expExitCode": "255",
+"verifyCmd": "$TC actions ls action police",
+"matchPattern": "action order [0-9]*:  police 0x9",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action police"
+]
+},
+{
+"id": "49fa",
+"name": "Add valid police action with mtu",
+"category": [
+"actions",
+"police"
+],
+"setup": [
+[
+"$TC actions flush action police",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action police rate 90kbit burst 10k 
mtu 1k index 98",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action police index 98",
+"matchPattern": "action order [0-9]*:  police 0x62 rate 90Kbit burst 
10Kb mtu 1Kb",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action police"
+]
+},
+{
+"id": "7943",
+"name": "Add valid police action with peakrate",
+"category": [
+"actions",
+"police"
+],
+"setup": [
+[
+"$TC actions flush action police",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action police rate 90kbit burst 10k 
mtu 2kb peakrate 100kbit index 3",
+"expExitCode": "0",
+"verifyCmd": "$TC actions ls action police",
+"matchPattern": "action order [0-9]*:  police 0x3 rate 90Kbit burst 
10Kb mtu 2Kb peakrate 100Kbit",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action police"
+]
+},
+{
+"id": "055e",
+"name": "Add police action with peakrate and no mtu",
+"category": [
+"actions",
+"police"
+],
+"setup": [
+[
+"$TC actions flush action police",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action police rate 5kbit burst 6kb 
peakrate 10kbit index 9",
+"expExitCode": "255",
+"verifyCmd": "$TC actions ls action police",
+"matchPattern": "action order [0-9]*:  police 0x9 rate 5Kb burst 10Kb",
+"matchCount": "0",
+"teardown": [
+"$TC actions flush action police"
+]
+},
+{
+"id": "f057",
+"name": "Add police action with valid overhead",
+"category": [
+"actions",
+"police"
+],
+"setup": [
+[
+"$TC actions flush action police",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action police rate 1mbit burst 100k 
overhead 64 index 64",
+

[PATCH net-next v2 1/4] tc-testing: Add test cases for flushing actions

2017-10-13 Thread Lucas Bates
Tests for flushing gact and mirred were missing. This patch
adds test cases to explicitly test the flush of any installed
gact/mirred actions.

Signed-off-by: Lucas Bates 
Acked-by: Jamal Hadi Salim 
---
 .../tc-testing/tc-tests/actions/tests.json | 49 +-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/tests.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/tests.json
index 6973bdc..2ea0065 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/tests.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/tests.json
@@ -246,6 +246,27 @@
 ]
 },
 {
+"id": "3edf",
+"name": "Flush gact actions",
+"category": [
+"actions",
+"gact"
+],
+"setup": [
+"$TC actions add action reclassify index 101",
+"$TC actions add action reclassify index 102",
+"$TC actions add action reclassify index 103",
+"$TC actions add action reclassify index 104",
+"$TC actions add action reclassify index 105"
+],
+"cmdUnderTest": "$TC actions flush action gact",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action gact",
+"matchPattern": "action order [0-9]*: gact action reclassify",
+"matchCount": "0",
+"teardown": []
+},
+{
 "id": "63ec",
 "name": "Delete pass action",
 "category": [
@@ -469,6 +490,32 @@
 ]
 },
 {
+"id": "58c3",
+"name": "Flush mirred actions",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+],
+"$TC actions add action mirred egress mirror index 1 dev lo",
+"$TC actions add action mirred egress redirect index 2 dev lo"
+],
+"cmdUnderTest": "$TC actions show action mirred",
+"expExitCode": "0",
+"verifyCmd": "$TC actions list action mirred",
+"matchPattern": "[Mirror|Redirect] to device lo",
+"matchCount": "0",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
 "id": "d7c0",
 "name": "Add invalid mirred direction",
 "category": [
@@ -1162,4 +1209,4 @@
 "$TC actions flush action ife"
 ]
 }
-]
\ No newline at end of file
+]
--
2.7.4



[RFC] SIOCGSTAMP semantics

2017-10-13 Thread Al Viro
ioctl() in question
1) fails with EOPNOTSUPP on
AF_ALG, AF_CAIF, AF_IUCV, AF_KEY, AF_NFC, AF_RXRPC, AF_VSOCK
2) fails with ENOTTY on
AF_DECnet, AF_KCM, AF_LLC, AF_NETLINK, AF_PHONET, AF_PPPOX, AF_RDS,
AF_TIPC, AF_UNIX
3) fails with EINVAL on
AF_ISDN
4) sock_get_timestamp(sock->sk, arg)
AF_INET, AF_INET6, AF_CAN, AF_ROSE, AF_PACKET, AF_IEEE802154,
AF_ATMSVC, AF_ATMPVC, AF_APPLETALK
5) sock_get_timestamp(sock->sk, arg) under lock_sock(sock->sk)
AF_AX25, AF_NETROM, AF_QRTR, AF_IPX
6) sock_get_timestamp(sock->sk, arg) after checking that sock->sk != NULL
AF_X25, AF_IRDA

AF_BLUETOOTH is sometimes (1), sometimes (2), sometimes (4).  Not sure about
AF_SMC - sometimes it's (1), sometimes might be (4).

To make the things even less consistent, AF_CAN, AF_IPX and AF_QRTR lack
SIOCGSTAMPNS; everything else has it parallel to SIOCGSTAMP with 
s/timestamp//.

Am I right assuming that (5) and (6) should be like (4)?  IOW, that
lock_sock() is not needed for anyone and that sock->sk cannot be NULL on
anything that could be fed to ioctl()?  If the last assumption is not true,
we have a plenty of triggerable oopsen - other ioctls (handled on the top
level) do _not_ check that and dereference sock->sk directly.  I've grepped
around, and AFAICS NULL sock->sk on an opened socket should be impossible,
but confirmation would be nice.

Another question, of course, is whether anyone gives a damn about distinctions
between (1), (2) and (3) *and* if anything bad would've happenend from having
sock_get_timestamp() simply done to all sockets, right in net/socket.c.

Comments?


Re: [Patch net-next v3] tcp: add a tracepoint for tcp retransmission

2017-10-13 Thread Brendan Gregg
On Fri, Oct 13, 2017 at 1:03 PM, Cong Wang  wrote:
> We need a real-time notification for tcp retransmission
> for monitoring.
>
> Of course we could use ftrace to dynamically instrument this
> kernel function too, however we can't retrieve the connection
> information at the same time, for example perf-tools [1] reads
> /proc/net/tcp for socket details, which is slow when we have
> a lots of connections.
>
> Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
> and exposes src/dst IP addresses and ports of the connection.
> This also makes it easier to integrate into perf.
>
> Note, I expose both IPv4 and IPv6 addresses at the same time:
> for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
> for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
> Also, add sk and skb pointers as they are useful for BPF.

Thanks, a TCP retransmit tracepoint would be great. (tcp_set_state
would be highly useful too, which Alexei already has in his list).

Should skp->__sk_common.skc_state be included in the format string, so
we don't have to always dig it out of the skaddr? For retransmits I
always want to know the TCP state, to determine if it is ESTABLISHED
(packet drop) or SYN_SENT (backlog full) or something else.

We probably need a tracepoint for tcp_send_loss_probe() (TLP) as well,
for tracing at the same time as retransmits (like my tools do), but
that can be added later.

Brendan


[net-next PATCH 0/2] Minor macvlan source mode cleanups

2017-10-13 Thread Alexander Duyck
So this patch series is just a few minor cleanups for macvlan source mode.
The first patch addresses double receives when a packet is being routed to
the macvlan destination address, and the other addresses the pkt_type being
updated in cases where it most likely should not be.

---

Alexander Duyck (2):
  macvlan: Only deliver one copy of the frame to the macvlan interface
  macvlan: Only update pkt_type if destination MAC address matches


 drivers/net/macvlan.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

--


[net-next PATCH 1/2] macvlan: Only deliver one copy of the frame to the macvlan interface

2017-10-13 Thread Alexander Duyck
From: Alexander Duyck 

This patch intoduces a slight adjustment for macvlan to address the fact
that in source mode I was seeing two copies of any packet addressed to the
macvlan interface being delivered where there should have been only one.

The issue appears to be that one copy was delivered based on the source MAC
address and then the second copy was being delivered based on the
destination MAC address. To fix it I am just treating a unicast address
match as though it is not a match since source based macvlan isn't supposed
to be matching based on the destination MAC anyway.

Fixes: 79cf79abce71 ("macvlan: add source mode")
Signed-off-by: Alexander Duyck 
---
 drivers/net/macvlan.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index d2aea961e0f4..fb1c9e095d0c 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -480,7 +480,7 @@ static rx_handler_result_t macvlan_handle_frame(struct 
sk_buff **pskb)
  struct macvlan_dev, list);
else
vlan = macvlan_hash_lookup(port, eth->h_dest);
-   if (vlan == NULL)
+   if (!vlan || vlan->mode == MACVLAN_MODE_SOURCE)
return RX_HANDLER_PASS;
 
dev = vlan->dev;



[net-next PATCH 2/2] macvlan: Only update pkt_type if destination MAC address matches

2017-10-13 Thread Alexander Duyck
From: Alexander Duyck 

This patch updates the pkt_type to PACKET_HOST only if the destination MAC
address matches on the on the source based macvlan. It didn't make sense to
be updating broadcast, multicast, and non-local destined frames with
PACKET_HOST.

Signed-off-by: Alexander Duyck 
---
 drivers/net/macvlan.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fb1c9e095d0c..b0cd866915d7 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -413,7 +413,9 @@ static void macvlan_forward_source_one(struct sk_buff *skb,
 
len = nskb->len + ETH_HLEN;
nskb->dev = dev;
-   nskb->pkt_type = PACKET_HOST;
+
+   if (ether_addr_equal_64bits(eth_hdr(skb)->h_dest, dev->dev_addr))
+   nskb->pkt_type = PACKET_HOST;
 
ret = netif_rx(nskb);
macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, false);



[PATCH] Staging: irda: Remove trailing whitespace errors

2017-10-13 Thread Shreeya Patel
Remove trailing whitespace checkpatch errors.

Signed-off-by: Shreeya Patel 
---
 drivers/staging/irda/drivers/esi-sir.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/staging/irda/drivers/esi-sir.c 
b/drivers/staging/irda/drivers/esi-sir.c
index 019a3e8..eb7aa64 100644
--- a/drivers/staging/irda/drivers/esi-sir.c
+++ b/drivers/staging/irda/drivers/esi-sir.c
@@ -1,5 +1,5 @@
 /*
- *
+ *
  * Filename:  esi.c
  * Version:   1.6
  * Description:   Driver for the Extended Systems JetEye PC dongle
@@ -8,25 +8,25 @@
  * Created at:Sat Feb 21 18:54:38 1998
  * Modified at:   Sun Oct 27 22:01:04 2002
  * Modified by:   Martin Diehl 
- * 
+ *
  * Copyright (c) 1999 Dag Brattli, ,
  * Copyright (c) 1998 Thomas Davis, ,
  * Copyright (c) 2002 Martin Diehl, ,
  * All Rights Reserved.
- * 
- * This program is free software; you can redistribute it and/or 
- * modify it under the terms of the GNU General Public License as 
- * published by the Free Software Foundation; either version 2 of 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
  * the License, or (at your option) any later version.
- * 
+ *
  * This program is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  * GNU General Public License for more details.
- * 
- * You should have received a copy of the GNU General Public License 
+ *
+ * You should have received a copy of the GNU General Public License
  * along with this program; if not, see .
- * 
+ *
  /
 
 #include 
@@ -97,7 +97,7 @@ static int esi_change_speed(struct sir_dev *dev, unsigned 
speed)
 {
int ret = 0;
int dtr, rts;
-   
+
switch (speed) {
case 19200:
dtr = TRUE;
-- 
2.7.4



Re: [Patch net-next] net_sched: fix a compile warning in act_ife

2017-10-13 Thread Roman Mashak
Cong Wang  writes:

> Apparently ife_meta_id2name() is only called when
> CONFIG_MODULES is defined.
>
> This fixes:
>
> net/sched/act_ife.c:251:20: warning: ‘ife_meta_id2name’ defined but not used 
> [-Wunused-function]
>  static const char *ife_meta_id2name(u32 metaid)
> ^~~~

Fair enough, thanks Cong!

> Fixes: d3f24ba895f0 ("net sched actions: fix module auto-loading")
> Cc: Roman Mashak 
> Signed-off-by: Cong Wang 
> ---
>  net/sched/act_ife.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
> index 252ee7d8c731..3007cb1310ea 100644
> --- a/net/sched/act_ife.c
> +++ b/net/sched/act_ife.c
> @@ -248,6 +248,7 @@ static int ife_validate_metatype(struct tcf_meta_ops 
> *ops, void *val, int len)
>   return ret;
>  }
>  
> +#ifdef CONFIG_MODULES
>  static const char *ife_meta_id2name(u32 metaid)
>  {
>   switch (metaid) {
> @@ -261,6 +262,7 @@ static const char *ife_meta_id2name(u32 metaid)
>   return "unknown";
>   }
>  }
> +#endif
>  
>  /* called when adding new meta information
>   * under ife->tcf_lock for existing action


[Patch net-next v3] tcp: add a tracepoint for tcp retransmission

2017-10-13 Thread Cong Wang
We need a real-time notification for tcp retransmission
for monitoring.

Of course we could use ftrace to dynamically instrument this
kernel function too, however we can't retrieve the connection
information at the same time, for example perf-tools [1] reads
/proc/net/tcp for socket details, which is slow when we have
a lots of connections.

Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
and exposes src/dst IP addresses and ports of the connection.
This also makes it easier to integrate into perf.

Note, I expose both IPv4 and IPv6 addresses at the same time:
for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
Also, add sk and skb pointers as they are useful for BPF.

1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans

Cc: Eric Dumazet 
Cc: Alexei Starovoitov 
Cc: Hannes Frederic Sowa 
Cc: Brendan Gregg 
Cc: Neal Cardwell 
Signed-off-by: Cong Wang 
---
 include/trace/events/tcp.h | 68 ++
 net/core/net-traces.c  |  1 +
 net/ipv4/tcp_output.c  |  3 ++
 3 files changed, 72 insertions(+)
 create mode 100644 include/trace/events/tcp.h

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
new file mode 100644
index ..3d1cbd072b7e
--- /dev/null
+++ b/include/trace/events/tcp.h
@@ -0,0 +1,68 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM tcp
+
+#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TCP_H
+
+#include 
+#include 
+#include 
+#include 
+
+TRACE_EVENT(tcp_retransmit_skb,
+
+   TP_PROTO(struct sock *sk, struct sk_buff *skb),
+
+   TP_ARGS(sk, skb),
+
+   TP_STRUCT__entry(
+   __field(void *, skbaddr)
+   __field(void *, skaddr)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __array(__u8, saddr_v6, 16)
+   __array(__u8, daddr_v6, 16)
+   ),
+
+   TP_fast_assign(
+   struct ipv6_pinfo *np = inet6_sk(sk);
+   struct inet_sock *inet = inet_sk(sk);
+   struct in6_addr *pin6;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->skaddr = sk;
+
+   __entry->sport = ntohs(inet->inet_sport);
+   __entry->dport = ntohs(inet->inet_dport);
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = inet->inet_saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 =  inet->inet_daddr;
+
+   if (np) {
+   pin6 = (struct in6_addr *)__entry->saddr_v6;
+   *pin6 = np->saddr;
+   pin6 = (struct in6_addr *)__entry->daddr_v6;
+   *pin6 = *(np->daddr_cache);
+   } else {
+   pin6 = (struct in6_addr *)__entry->saddr_v6;
+   ipv6_addr_set_v4mapped(inet->inet_saddr, pin6);
+   pin6 = (struct in6_addr *)__entry->daddr_v6;
+   ipv6_addr_set_v4mapped(inet->inet_daddr, pin6);
+   }
+   ),
+
+   TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6 
daddrv6=%pI6",
+ __entry->sport, __entry->dport, __entry->saddr, 
__entry->daddr,
+ __entry->saddr_v6, __entry->daddr_v6)
+);
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index 1132820c8e62..f4e4fa2db505 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #if IS_ENABLED(CONFIG_IPV6)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 696b0a168f16..6c74f2a39778 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -42,6 +42,8 @@
 #include 
 #include 
 
+#include 
+
 /* People can turn this off for buggy TCP's found in printers etc. */
 int sysctl_tcp_retrans_collapse __read_mostly = 1;
 
@@ -2875,6 +2877,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
 
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
+   trace_tcp_retransmit_skb(sk, skb);
} else if (err != -EBUSY) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
}
-- 
2.13.0



Re: [next-queue PATCH v7 4/6] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-10-13 Thread Ivan Khoronzhuk
On Thu, Oct 12, 2017 at 05:40:03PM -0700, Vinicius Costa Gomes wrote:
> This queueing discipline implements the shaper algorithm defined by
> the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
> 
> It's primary usage is to apply some bandwidth reservation to user
> defined traffic classes, which are mapped to different queues via the
> mqprio qdisc.
> 
> Only a simple software implementation is added for now.
> 
> Signed-off-by: Vinicius Costa Gomes 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  include/uapi/linux/pkt_sched.h |  18 +++
>  net/sched/Kconfig  |  11 ++
>  net/sched/Makefile |   1 +
>  net/sched/sch_cbs.c| 314 
> +
>  4 files changed, 344 insertions(+)
>  create mode 100644 net/sched/sch_cbs.c
> 
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 099bf5528fed..41e349df4bf4 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -871,4 +871,22 @@ struct tc_pie_xstats {
>   __u32 maxq; /* maximum queue size */
>   __u32 ecn_mark; /* packets marked with ecn*/
>  };
> +
> +/* CBS */
> +struct tc_cbs_qopt {
> + __u8 offload;
> + __s32 hicredit;
> + __s32 locredit;
> + __s32 idleslope;
> + __s32 sendslope;
> +};
> +
> +enum {
> + TCA_CBS_UNSPEC,
> + TCA_CBS_PARMS,
> + __TCA_CBS_MAX,
> +};
> +
> +#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
> +
>  #endif
> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index e70ed26485a2..c03d86a7775e 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -172,6 +172,17 @@ config NET_SCH_TBF
> To compile this code as a module, choose M here: the
> module will be called sch_tbf.
>  
> +config NET_SCH_CBS
> + tristate "Credit Based Shaper (CBS)"
> + ---help---
> +   Say Y here if you want to use the Credit Based Shaper (CBS) packet
> +   scheduling algorithm.
> +
> +   See the top of  for more details.
> +
> +   To compile this code as a module, choose M here: the
> +   module will be called sch_cbs.
> +
>  config NET_SCH_GRED
>   tristate "Generic Random Early Detection (GRED)"
>   ---help---
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index 7b915d226de7..80c8f92d162d 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -52,6 +52,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL)  += sch_fq_codel.o
>  obj-$(CONFIG_NET_SCH_FQ) += sch_fq.o
>  obj-$(CONFIG_NET_SCH_HHF)+= sch_hhf.o
>  obj-$(CONFIG_NET_SCH_PIE)+= sch_pie.o
> +obj-$(CONFIG_NET_SCH_CBS)+= sch_cbs.o
>  
>  obj-$(CONFIG_NET_CLS_U32)+= cls_u32.o
>  obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
> diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
> new file mode 100644
> index ..0643587e6dc8
> --- /dev/null
> +++ b/net/sched/sch_cbs.c
> @@ -0,0 +1,314 @@
> +/*
> + * net/sched/sch_cbs.c   Credit Based Shaper
> + *
> + *   This program is free software; you can redistribute it and/or
> + *   modify it under the terms of the GNU General Public License
> + *   as published by the Free Software Foundation; either version
> + *   2 of the License, or (at your option) any later version.
> + *
> + * Authors:  Vinicius Costa Gomes 
> + *
> + */
> +
> +/* Credit Based Shaper (CBS)
> + * =
> + *
> + * This is a simple rate-limiting shaper aimed at TSN applications on
> + * systems with known traffic workloads.
> + *
> + * Its algorithm is defined by the IEEE 802.1Q-2014 Specification,
> + * Section 8.6.8.2, and explained in more detail in the Annex L of the
> + * same specification.
> + *
> + * There are four tunables to be considered:
> + *
> + *   'idleslope': Idleslope is the rate of credits that is
> + *   accumulated (in kilobits per second) when there is at least
> + *   one packet waiting for transmission. Packets are transmitted
> + *   when the current value of credits is equal or greater than
> + *   zero. When there is no packet to be transmitted the amount of
> + *   credits is set to zero. This is the main tunable of the CBS
> + *   algorithm.
> + *
> + *   'sendslope':
> + *   Sendslope is the rate of credits that is depleted (it should be a
> + *   negative number of kilobits per second) when a transmission is
> + *   ocurring. It can be calculated as follows, (IEEE 802.1Q-2014 Section
> + *   8.6.8.2 item g):
> + *
> + *   sendslope = idleslope - port_transmit_rate
> + *
> + *   'hicredit': Hicredit defines the maximum amount of credits (in
> + *   bytes) that can be accumulated. Hicredit depends on the
> + *   characteristics of interfering traffic,
> + *   'max_interference_size' is the maximum size of any burst of
> + *   traffic that can delay the transmission of a frame that is
> + *   available for transmission for this traffic class, 

[Patch net-next] net_sched: fix a compile warning in act_ife

2017-10-13 Thread Cong Wang
Apparently ife_meta_id2name() is only called when
CONFIG_MODULES is defined.

This fixes:

net/sched/act_ife.c:251:20: warning: ‘ife_meta_id2name’ defined but not used 
[-Wunused-function]
 static const char *ife_meta_id2name(u32 metaid)
^~~~

Fixes: d3f24ba895f0 ("net sched actions: fix module auto-loading")
Cc: Roman Mashak 
Signed-off-by: Cong Wang 
---
 net/sched/act_ife.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 252ee7d8c731..3007cb1310ea 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -248,6 +248,7 @@ static int ife_validate_metatype(struct tcf_meta_ops *ops, 
void *val, int len)
return ret;
 }
 
+#ifdef CONFIG_MODULES
 static const char *ife_meta_id2name(u32 metaid)
 {
switch (metaid) {
@@ -261,6 +262,7 @@ static const char *ife_meta_id2name(u32 metaid)
return "unknown";
}
 }
+#endif
 
 /* called when adding new meta information
  * under ife->tcf_lock for existing action
-- 
2.13.0



Re: [PATCH net-next v5 5/5] selinux: bpf: Add addtional check for bpf object file receive

2017-10-13 Thread Stephen Smalley
On Thu, 2017-10-12 at 13:55 -0700, Chenbo Feng wrote:
> From: Chenbo Feng 
> 
> Introduce a bpf object related check when sending and receiving files
> through unix domain socket as well as binder. It checks if the
> receiving
> process have privilege to read/write the bpf map or use the bpf
> program.
> This check is necessary because the bpf maps and programs are using a
> anonymous inode as their shared inode so the normal way of checking
> the
> files and sockets when passing between processes cannot work properly
> on
> eBPF object. This check only works when the BPF_SYSCALL is
> configured.
> 
> Signed-off-by: Chenbo Feng 

Acked-by: Stephen Smalley 

> ---
>  include/linux/bpf.h  |  3 +++
>  kernel/bpf/syscall.c |  4 ++--
>  security/selinux/hooks.c | 49
> 
>  3 files changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 225740688ab7..81d6c01b8825 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -285,6 +285,9 @@ int bpf_prog_array_copy_to_user(struct
> bpf_prog_array __rcu *progs,
>  #ifdef CONFIG_BPF_SYSCALL
>  DECLARE_PER_CPU(int, bpf_prog_active);
>  
> +extern const struct file_operations bpf_map_fops;
> +extern const struct file_operations bpf_prog_fops;
> +
>  #define BPF_PROG_TYPE(_id, _ops) \
>   extern const struct bpf_verifier_ops _ops;
>  #define BPF_MAP_TYPE(_id, _ops) \
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d3e152e282d8..8bdb98aa7f34 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -313,7 +313,7 @@ static ssize_t bpf_dummy_write(struct file *filp,
> const char __user *buf,
>   return -EINVAL;
>  }
>  
> -static const struct file_operations bpf_map_fops = {
> +const struct file_operations bpf_map_fops = {
>  #ifdef CONFIG_PROC_FS
>   .show_fdinfo= bpf_map_show_fdinfo,
>  #endif
> @@ -967,7 +967,7 @@ static void bpf_prog_show_fdinfo(struct seq_file
> *m, struct file *filp)
>  }
>  #endif
>  
> -static const struct file_operations bpf_prog_fops = {
> +const struct file_operations bpf_prog_fops = {
>  #ifdef CONFIG_PROC_FS
>   .show_fdinfo= bpf_prog_show_fdinfo,
>  #endif
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 12cf7de8cbed..ef7e5c1de640 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -1815,6 +1815,10 @@ static inline int file_path_has_perm(const
> struct cred *cred,
>   return inode_has_perm(cred, file_inode(file), av, );
>  }
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +static int bpf_fd_pass(struct file *file, u32 sid);
> +#endif
> +
>  /* Check whether a task can use an open file descriptor to
> access an inode in a given way.  Check access to the
> descriptor itself, and then use dentry_has_perm to
> @@ -1845,6 +1849,12 @@ static int file_has_perm(const struct cred
> *cred,
>   goto out;
>   }
>  
> +#ifdef CONFIG_BPF_SYSCALL
> + rc = bpf_fd_pass(file, cred_sid(cred));
> + if (rc)
> + return rc;
> +#endif
> +
>   /* av is zero if only checking access to the descriptor. */
>   rc = 0;
>   if (av)
> @@ -2165,6 +2175,12 @@ static int selinux_binder_transfer_file(struct
> task_struct *from,
>   return rc;
>   }
>  
> +#ifdef CONFIG_BPF_SYSCALL
> + rc = bpf_fd_pass(file, sid);
> + if (rc)
> + return rc;
> +#endif
> +
>   if (unlikely(IS_PRIVATE(d_backing_inode(dentry
>   return 0;
>  
> @@ -6288,6 +6304,39 @@ static u32 bpf_map_fmode_to_av(fmode_t fmode)
>   return av;
>  }
>  
> +/* This function will check the file pass through unix socket or
> binder to see
> + * if it is a bpf related object. And apply correspinding checks on
> the bpf
> + * object based on the type. The bpf maps and programs, not like
> other files and
> + * socket, are using a shared anonymous inode inside the kernel as
> their inode.
> + * So checking that inode cannot identify if the process have
> privilege to
> + * access the bpf object and that's why we have to add this
> additional check in
> + * selinux_file_receive and selinux_binder_transfer_files.
> + */
> +static int bpf_fd_pass(struct file *file, u32 sid)
> +{
> + struct bpf_security_struct *bpfsec;
> + struct bpf_prog *prog;
> + struct bpf_map *map;
> + int ret;
> +
> + if (file->f_op == _map_fops) {
> + map = file->private_data;
> + bpfsec = map->security;
> + ret = avc_has_perm(sid, bpfsec->sid,
> SECCLASS_BPF_MAP,
> +    bpf_map_fmode_to_av(file-
> >f_mode), NULL);
> + if (ret)
> + return ret;
> + } else if (file->f_op == _prog_fops) {
> + prog = file->private_data;
> + bpfsec = prog->aux->security;
> + ret = avc_has_perm(sid, bpfsec->sid,
> 

Re: [PATCH net-next v5 4/5] selinux: bpf: Add selinux check for eBPF syscall operations

2017-10-13 Thread Stephen Smalley
On Thu, 2017-10-12 at 13:55 -0700, Chenbo Feng wrote:
> From: Chenbo Feng 
> 
> Implement the actual checks introduced to eBPF related syscalls. This
> implementation use the security field inside bpf object to store a
> sid that
> identify the bpf object. And when processes try to access the object,
> selinux will check if processes have the right privileges. The
> creation
> of eBPF object are also checked at the general bpf check hook and new
> cmd introduced to eBPF domain can also be checked there.
> 
> Signed-off-by: Chenbo Feng 
> Acked-by: Alexei Starovoitov 

Acked-by: Stephen Smalley 

> ---
>  security/selinux/hooks.c| 111
> 
>  security/selinux/include/classmap.h |   2 +
>  security/selinux/include/objsec.h   |   4 ++
>  3 files changed, 117 insertions(+)
> 
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index f5d304736852..12cf7de8cbed 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -85,6 +85,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "avc.h"
>  #include "objsec.h"
> @@ -6252,6 +6253,106 @@ static void selinux_ib_free_security(void
> *ib_sec)
>  }
>  #endif
>  
> +#ifdef CONFIG_BPF_SYSCALL
> +static int selinux_bpf(int cmd, union bpf_attr *attr,
> +  unsigned int size)
> +{
> + u32 sid = current_sid();
> + int ret;
> +
> + switch (cmd) {
> + case BPF_MAP_CREATE:
> + ret = avc_has_perm(sid, sid, SECCLASS_BPF,
> BPF__MAP_CREATE,
> +    NULL);
> + break;
> + case BPF_PROG_LOAD:
> + ret = avc_has_perm(sid, sid, SECCLASS_BPF,
> BPF__PROG_LOAD,
> +    NULL);
> + break;
> + default:
> + ret = 0;
> + break;
> + }
> +
> + return ret;
> +}
> +
> +static u32 bpf_map_fmode_to_av(fmode_t fmode)
> +{
> + u32 av = 0;
> +
> + if (fmode & FMODE_READ)
> + av |= BPF__MAP_READ;
> + if (fmode & FMODE_WRITE)
> + av |= BPF__MAP_WRITE;
> + return av;
> +}
> +
> +static int selinux_bpf_map(struct bpf_map *map, fmode_t fmode)
> +{
> + u32 sid = current_sid();
> + struct bpf_security_struct *bpfsec;
> +
> + bpfsec = map->security;
> + return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF,
> + bpf_map_fmode_to_av(fmode), NULL);
> +}
> +
> +static int selinux_bpf_prog(struct bpf_prog *prog)
> +{
> + u32 sid = current_sid();
> + struct bpf_security_struct *bpfsec;
> +
> + bpfsec = prog->aux->security;
> + return avc_has_perm(sid, bpfsec->sid, SECCLASS_BPF,
> + BPF__PROG_RUN, NULL);
> +}
> +
> +static int selinux_bpf_map_alloc(struct bpf_map *map)
> +{
> + struct bpf_security_struct *bpfsec;
> +
> + bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> + if (!bpfsec)
> + return -ENOMEM;
> +
> + bpfsec->sid = current_sid();
> + map->security = bpfsec;
> +
> + return 0;
> +}
> +
> +static void selinux_bpf_map_free(struct bpf_map *map)
> +{
> + struct bpf_security_struct *bpfsec = map->security;
> +
> + map->security = NULL;
> + kfree(bpfsec);
> +}
> +
> +static int selinux_bpf_prog_alloc(struct bpf_prog_aux *aux)
> +{
> + struct bpf_security_struct *bpfsec;
> +
> + bpfsec = kzalloc(sizeof(*bpfsec), GFP_KERNEL);
> + if (!bpfsec)
> + return -ENOMEM;
> +
> + bpfsec->sid = current_sid();
> + aux->security = bpfsec;
> +
> + return 0;
> +}
> +
> +static void selinux_bpf_prog_free(struct bpf_prog_aux *aux)
> +{
> + struct bpf_security_struct *bpfsec = aux->security;
> +
> + aux->security = NULL;
> + kfree(bpfsec);
> +}
> +#endif
> +
>  static struct security_hook_list selinux_hooks[] __lsm_ro_after_init
> = {
>   LSM_HOOK_INIT(binder_set_context_mgr,
> selinux_binder_set_context_mgr),
>   LSM_HOOK_INIT(binder_transaction,
> selinux_binder_transaction),
> @@ -6471,6 +6572,16 @@ static struct security_hook_list
> selinux_hooks[] __lsm_ro_after_init = {
>   LSM_HOOK_INIT(audit_rule_match, selinux_audit_rule_match),
>   LSM_HOOK_INIT(audit_rule_free, selinux_audit_rule_free),
>  #endif
> +
> +#ifdef CONFIG_BPF_SYSCALL
> + LSM_HOOK_INIT(bpf, selinux_bpf),
> + LSM_HOOK_INIT(bpf_map, selinux_bpf_map),
> + LSM_HOOK_INIT(bpf_prog, selinux_bpf_prog),
> + LSM_HOOK_INIT(bpf_map_alloc_security,
> selinux_bpf_map_alloc),
> + LSM_HOOK_INIT(bpf_prog_alloc_security,
> selinux_bpf_prog_alloc),
> + LSM_HOOK_INIT(bpf_map_free_security, selinux_bpf_map_free),
> + LSM_HOOK_INIT(bpf_prog_free_security,
> selinux_bpf_prog_free),
> +#endif
>  };
>  
>  static __init int selinux_init(void)
> diff --git a/security/selinux/include/classmap.h
> b/security/selinux/include/classmap.h
> index 35ffb29a69cb..0a7023b5f000 100644
> 

[PATCH net-next,0/3] Add init of send table and var renames

2017-10-13 Thread Haiyang Zhang
From: Haiyang Zhang 

Add initialization of send indirection table. Otherwise it may contain
old info of previous device with different number of channels.

Also, did some variable renaming for easier reading.

Haiyang Zhang (3):
  hv_netvsc: Rename ind_table to rx_table
  hv_netvsc: Rename tx_send_table to tx_table
  hv_netvsc: Add initialization of tx_table in netvsc_device_add()

 drivers/net/hyperv/hyperv_net.h   | 4 ++--
 drivers/net/hyperv/netvsc.c   | 5 -
 drivers/net/hyperv/netvsc_drv.c   | 8 
 drivers/net/hyperv/rndis_filter.c | 6 +++---
 4 files changed, 13 insertions(+), 10 deletions(-)

-- 
2.14.1



[PATCH net-next,2/3] hv_netvsc: Rename tx_send_table to tx_table

2017-10-13 Thread Haiyang Zhang
From: Haiyang Zhang 

Simplify the variable name: tx_send_table

Signed-off-by: Haiyang Zhang 
---
 drivers/net/hyperv/hyperv_net.h | 2 +-
 drivers/net/hyperv/netvsc.c | 2 +-
 drivers/net/hyperv/netvsc_drv.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 65ceb3aec40e..4958bb6b7376 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -731,7 +731,7 @@ struct net_device_context {
 
u32 tx_checksum_mask;
 
-   u32 tx_send_table[VRSS_SEND_TAB_SIZE];
+   u32 tx_table[VRSS_SEND_TAB_SIZE];
 
/* Ethtool settings */
u8 duplex;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 6e5194916bbe..d34cf37e949d 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1110,7 +1110,7 @@ static void netvsc_send_table(struct hv_device *hdev,
  nvmsg->msg.v5_msg.send_table.offset);
 
for (i = 0; i < count; i++)
-   net_device_ctx->tx_send_table[i] = tab[i];
+   net_device_ctx->tx_table[i] = tab[i];
 }
 
 static void netvsc_send_vf(struct net_device_context *net_device_ctx,
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8fa964e733ad..da216ca4f2b2 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -252,8 +252,8 @@ static inline int netvsc_get_tx_queue(struct net_device 
*ndev,
struct sock *sk = skb->sk;
int q_idx;
 
-   q_idx = ndc->tx_send_table[netvsc_get_hash(skb, ndc) &
-  (VRSS_SEND_TAB_SIZE - 1)];
+   q_idx = ndc->tx_table[netvsc_get_hash(skb, ndc) &
+ (VRSS_SEND_TAB_SIZE - 1)];
 
/* If queue index changed record the new value */
if (q_idx != old_idx &&
-- 
2.14.1



[PATCH net-next,3/3] hv_netvsc: Add initialization of tx_table in netvsc_device_add()

2017-10-13 Thread Haiyang Zhang
From: Haiyang Zhang 

tx_table is part of the private data of kernel net_device. It is only
zero-ed out when allocating net_device.

We may recreate netvsc_device w/o recreating net_device, so the private
netdev data, including tx_table, are not zeroed. It may contain channel
numbers for the older netvsc_device.

This patch adds initialization of tx_table each time we recreate
netvsc_device.

Signed-off-by: Haiyang Zhang 
---
 drivers/net/hyperv/netvsc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index d34cf37e949d..5bb6a20072dd 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1255,6 +1255,9 @@ struct netvsc_device *netvsc_device_add(struct hv_device 
*device,
if (!net_device)
return ERR_PTR(-ENOMEM);
 
+   for (i = 0; i < VRSS_SEND_TAB_SIZE; i++)
+   net_device_ctx->tx_table[i] = 0;
+
net_device->ring_size = ring_size;
 
/* Because the device uses NAPI, all the interrupt batching and
-- 
2.14.1



[PATCH net-next,1/3] hv_netvsc: Rename ind_table to rx_table

2017-10-13 Thread Haiyang Zhang
From: Haiyang Zhang 

Rename this variable because it is the Receive indirection
table.

Signed-off-by: Haiyang Zhang 
---
 drivers/net/hyperv/hyperv_net.h   | 2 +-
 drivers/net/hyperv/netvsc_drv.c   | 4 ++--
 drivers/net/hyperv/rndis_filter.c | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index a81335e8ebe8..65ceb3aec40e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -179,7 +179,7 @@ struct rndis_device {
 
u8 hw_mac_adr[ETH_ALEN];
u8 rss_key[NETVSC_HASH_KEYLEN];
-   u16 ind_table[ITAB_NUM];
+   u16 rx_table[ITAB_NUM];
 };
 
 
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 44746de3dd4c..8fa964e733ad 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1434,7 +1434,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 
*indir, u8 *key,
rndis_dev = ndev->extension;
if (indir) {
for (i = 0; i < ITAB_NUM; i++)
-   indir[i] = rndis_dev->ind_table[i];
+   indir[i] = rndis_dev->rx_table[i];
}
 
if (key)
@@ -1464,7 +1464,7 @@ static int netvsc_set_rxfh(struct net_device *dev, const 
u32 *indir,
return -EINVAL;
 
for (i = 0; i < ITAB_NUM; i++)
-   rndis_dev->ind_table[i] = indir[i];
+   rndis_dev->rx_table[i] = indir[i];
}
 
if (!key) {
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 065b204d8e17..addf9f69c58c 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -759,7 +759,7 @@ int rndis_filter_set_rss_param(struct rndis_device *rdev,
/* Set indirection table entries */
itab = (u32 *)(rssp + 1);
for (i = 0; i < ITAB_NUM; i++)
-   itab[i] = rdev->ind_table[i];
+   itab[i] = rdev->rx_table[i];
 
/* Set hask key values */
keyp = (u8 *)((unsigned long)rssp + rssp->kashkey_offset);
@@ -1284,8 +1284,8 @@ struct netvsc_device *rndis_filter_device_add(struct 
hv_device *dev,
net_device->num_chn = min(net_device->max_chn, device_info->num_chn);
 
for (i = 0; i < ITAB_NUM; i++)
-   rndis_device->ind_table[i] = ethtool_rxfh_indir_default(i,
-   net_device->num_chn);
+   rndis_device->rx_table[i] = ethtool_rxfh_indir_default(
+   i, net_device->num_chn);
 
atomic_set(_device->open_chn, 1);
vmbus_set_sc_create_callback(dev->channel, netvsc_sc_open);
-- 
2.14.1



[PATCH] Add -target to clang switch while cross compiling.

2017-10-13 Thread Abhijit Ayarekar
Update to llvm excludes assembly instructions.
llvm git revision is below

commit 65fad7c26569 ("bpf: add inline-asm support")

This change will be part of llvm  release 6.0

__ASM_SYSREG_H define is not required for native compile.
-target switch includes appropriate target specific files
while cross compiling

Tested on x86 and arm64.

Signed-off-by: Abhijit Ayarekar 
---
 samples/bpf/Makefile | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ebc2ad6..81f9fcd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -180,6 +180,7 @@ CLANG ?= clang
 # Detect that we're cross compiling and use the cross compiler
 ifdef CROSS_COMPILE
 HOSTCC = $(CROSS_COMPILE)gcc
+CLANG_ARCH_ARGS = -target $(ARCH)
 endif
 
 # Trick to allow make to be run from this directory
@@ -229,9 +230,9 @@ $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 $(obj)/%.o: $(src)/%.c
$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
-I$(srctree)/tools/testing/selftests/bpf/ \
-   -D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value 
-Wno-pointer-sign \
+   -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-   -Wno-unknown-warning-option \
+   -Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
-- 
2.7.4



Re: [net-next:master 836/858] bcmsysport.c:undefined reference to `unregister_dsa_notifier'

2017-10-13 Thread Florian Fainelli
On 10/13/2017 12:41 AM, kbuild test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
> master
> head:   2d0d21c12dfa3851620f1fa9fe2d444538f1fad4
> commit: d156576362c07e954dc36e07b0d7b0733a010f7d [836/858] net: systemport: 
> Establish lower/upper queue mapping
> config: x86_64-randconfig-it0-10131242 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
> git checkout d156576362c07e954dc36e07b0d7b0733a010f7d
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>drivers/net/ethernet/broadcom/bcmsysport.o: In function 
> `bcm_sysport_remove':
>>> bcmsysport.c:(.text+0x2020): undefined reference to 
>>> `unregister_dsa_notifier'
>drivers/net/ethernet/broadcom/bcmsysport.o: In function 
> `bcm_sysport_probe':
>>> bcmsysport.c:(.text+0x3689): undefined reference to `register_dsa_notifier'
>bcmsysport.c:(.text+0x38d1): undefined reference to 
> `unregister_dsa_notifier'

Ah this configuration is: CONFIG_SYSTEMPORT=y and CONFIG_NET_DSA=m which
would cause the following link failures.

I will submit a fix shortly after making sure that this won't cause
circular dependencies.

Thanks
-- 
Florian


Re: [PATCH 0/2] net: support bgmac with B50212E B1 PHY

2017-10-13 Thread Florian Fainelli
On 10/12/2017 01:21 AM, Rafał Miłecki wrote:
> From: Rafał Miłecki 
> 
> I got a report that a board with BCM47189 SoC and B50212E B1 PHY doesn't
> work well some devices as there is massive ping loss. After analyzing
> PHY state it has appeared that is runs in slave mode and doesn't auto
> switch to master properly when needed.
> 
> This patchset fixes this by:
> 1) Adding new flag support to the PHY driver for setting master mode
> 2) Modifying bgmac to request master mode for reported hardware

Sorry for the late reply, and it's been applied already, but this looks
good to me, thanks for implementing it this way!
-- 
Florian


RE: [PATCH net] net: dsa: mv88e6060: fix switch MAC address

2017-10-13 Thread Woojung.Huh
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vivien Didelot
> Sent: Friday, October 13, 2017 1:39 PM
> To: netdev@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org; ker...@savoirfairelinux.com; David S.
> Miller; Florian Fainelli; Andrew Lunn; David Laight; Vivien Didelot
> Subject: [PATCH net] net: dsa: mv88e6060: fix switch MAC address
> 
> The 88E6060 Ethernet switch always transmits the multicast bit of the
> switch MAC address as a zero. It re-uses the corresponding bit 8 of the
> register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr".
> 
> If the "DiffAddr" bit is 0, then all ports transmit the same source
> address. If it is set to 1, then bit 2:0 are used for the port number.
> 
> The mv88e6060 driver is currently wrongly shifting the MAC address byte
> 0 by 9. To fix this, shift it by 8 as usual and clear its bit 0.
> 
> Signed-off-by: Vivien Didelot 
> ---

Reviewed-by: Woojung Huh 

- Woojung


[Patch net] tun: call dev_get_valid_name() before register_netdevice()

2017-10-13 Thread Cong Wang
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.

We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().

Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.

Fixes: 96442e42429e ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev 
Cc: Jason Wang 
Cc: "Michael S. Tsirkin" 
Signed-off-by: Cong Wang 
---
 drivers/net/tun.c | 3 +++
 include/linux/netdevice.h | 3 +++
 net/core/dev.c| 6 +++---
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5ce580f413b9..e21bf90b819f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2027,6 +2027,9 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
 
if (!dev)
return -ENOMEM;
+   err = dev_get_valid_name(net, dev, name);
+   if (err)
+   goto err_free_dev;
 
dev_net_set(dev, net);
dev->rtnl_link_ops = _link_ops;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..2eaac7d75af4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3694,6 +3694,9 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
unsigned char name_assign_type,
void (*setup)(struct net_device *),
unsigned int txqs, unsigned int rxqs);
+int dev_get_valid_name(struct net *net, struct net_device *dev,
+  const char *name);
+
 #define alloc_netdev(sizeof_priv, name, name_assign_type, setup) \
alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, 1, 1)
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 588b473194a8..11596a302a26 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1147,9 +1147,8 @@ static int dev_alloc_name_ns(struct net *net,
return ret;
 }
 
-static int dev_get_valid_name(struct net *net,
- struct net_device *dev,
- const char *name)
+int dev_get_valid_name(struct net *net, struct net_device *dev,
+  const char *name)
 {
BUG_ON(!net);
 
@@ -1165,6 +1164,7 @@ static int dev_get_valid_name(struct net *net,
 
return 0;
 }
+EXPORT_SYMBOL(dev_get_valid_name);
 
 /**
  * dev_change_name - change name of a device
-- 
2.13.0



Re: [RFC] Support for UNARP (RFC 1868)

2017-10-13 Thread Girish Moodalbail

On 10/12/17 5:53 PM, Mahesh Bandewar (महेश बंडेवार) wrote:

On Thu, Oct 12, 2017 at 4:06 PM, Girish Moodalbail
 wrote:

Hello Eric,

The basic idea is to mark the ARP entry either FAILED or STALE as soon as we
can so that the subsequent packets that depend on that ARP entry will take
the slow path (neigh_resolve_output()).

Say, if base_reachable_time is 30 seconds, then an ARP entry will be in
reachable state somewhere between 15 to 45 seconds. Assuming the worst case,
the ARP entry will be in REACHABLE state for 45 seconds and the packets
continue to traverse the network towards the source machine and gets dropped
their since the VM has moved to destination machine.

Instead, based on the received UNARP packet if we mark the ARP entry

(a) FAILED
- we move to INCOMPLETE state and start the address resolution by sending
  out ARP packets (up to allowed maximum number) until we get ARP
response
  back at which point we move the ARP entry state to reachable.

(b) STALE
- we move to DELAY state and set the next timer to DELAY_PROBE_TIME
  (1 second) and continue to send queued packets in arp_queue.
- After 1 sec we move to PROBE state and start the address resolution
like
  in the case(a) above.

I was leaning towards (a).

One could arbitrarily increase the stale timeout (by changing no of
probes). So sender
will continue sending traffic to something that has already gone away.
STALE doesn't
mean bad but here the sender is clearly indicating it's going away so
FAILED seems to
be the only logical option.


I agree.




Will TCP flows be terminated, instead
of being smoothly migrated (TCP_REPAIR)



The TCP flows will not be terminated. Upon receiving UNARP packet, the ARP
entry will be marked FAILED. The subsequent TCP packets from the client
(towards that IP) will be queued (the first 3 packets in arp_queue and then
other packets get dropped) until the IP address is resolved again through
the slow path neigh_resolve_output().

The slow path marks the entry as INCOMPLETE and will start sending several
ARP requests (ucast_solicit + app_solicit + mcast_solicit) to resolve the
IP. If the resolution is successful, then the TCP packets will be sent out.
If not, we will invalidate the ARP entry and call arp_error_report() on the
queued packets (which will end up sending ICMP_HOST_UNREACH error). This
behavior is same as what will occur if TCP server disappears in the middle
of a connection.



What about IPv6 ? Or maybe more abruptly, do we still need to add
features to IPv4 in 2017,  22 years after this RFC came ? ;)



Legit question :). Well one thing I have seen in Networking is that an old
idea circles back around later and turns out to be useful in new contexts
and use cases. Like I enumerated in my initial email there are certain use
cases in Cloud that might benefit from UNARP.


It doesn't make sense to have this implemented only for IPv4. At this time if
equivalent IPv6 feature is missing, I don't see it being useful / acceptable.


Obviously UNARP is IPv4 only. I am not aware of any standard way of doing the 
same for IPv6. If such a standard doesn't exist, then we will have to go through 
IETF to get one done? If that is the case, can we please do this in a phased 
manner? This will atleast help the Cloud that are IPv4 only.


thanks,
~Girish


[PATCH net-next v3 5/5] net: dsa: remove .set_addr

2017-10-13 Thread Vivien Didelot
Now that there is no user for the .set_addr function, remove it from
DSA. If a switch supports this feature (like mv88e6xxx), the
implementation can be done in the driver setup.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h | 1 -
 net/dsa/dsa2.c| 6 --
 net/dsa/legacy.c  | 6 --
 3 files changed, 13 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index ce1d622734d7..2746741f74cf 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -291,7 +291,6 @@ struct dsa_switch_ops {
enum dsa_tag_protocol (*get_tag_protocol)(struct dsa_switch *ds);
 
int (*setup)(struct dsa_switch *ds);
-   int (*set_addr)(struct dsa_switch *ds, u8 *addr);
u32 (*get_phy_flags)(struct dsa_switch *ds, int port);
 
/*
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 54ed054777bd..6ac9e11d385c 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -336,12 +336,6 @@ static int dsa_ds_apply(struct dsa_switch_tree *dst, 
struct dsa_switch *ds)
if (err)
return err;
 
-   if (ds->ops->set_addr) {
-   err = ds->ops->set_addr(ds, dst->cpu_dp->netdev->dev_addr);
-   if (err < 0)
-   return err;
-   }
-
if (!ds->slave_mii_bus && ds->ops->phy_read) {
ds->slave_mii_bus = devm_mdiobus_alloc(ds->dev);
if (!ds->slave_mii_bus)
diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index 19ff6e0a21dc..b0fefbffe082 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -172,12 +172,6 @@ static int dsa_switch_setup_one(struct dsa_switch *ds,
if (ret)
return ret;
 
-   if (ops->set_addr) {
-   ret = ops->set_addr(ds, master->dev_addr);
-   if (ret < 0)
-   return ret;
-   }
-
if (!ds->slave_mii_bus && ops->phy_read) {
ds->slave_mii_bus = devm_mdiobus_alloc(ds->dev);
if (!ds->slave_mii_bus)
-- 
2.14.2



[PATCH net-next v3 0/5] net: dsa: remove .set_addr

2017-10-13 Thread Vivien Didelot
An Ethernet switch may support having a MAC address, which can be used
as the switch's source address in transmitted full-duplex Pause frames.

If a DSA switch supports the related .set_addr operation, the DSA core
sets the master's MAC address on the switch.

This won't make sense anymore in a multi-CPU ports system, because there
won't be a unique master device assigned to a switch tree.

Moreover this operation is confusing because it makes the user think
that it could be used to program the switch with the MAC address of the
CPU/management port such that MAC address learning can be disabled on
said port, but in fact, that's not how it is currently used.

To fix this, assign a random MAC address at setup time in the mv88e6060
and mv88e6xxx drivers before removing .set_addr completely from DSA.

Changes in v3:
  - include fix for mv88e6060 switch MAC address setter.

Changes in v2:
  - remove .set_addr implementation from drivers and use a random MAC.

Vivien Didelot (5):
  net: dsa: mv88e6xxx: setup random mac address
  net: dsa: mv88e6060: fix switch MAC address
  net: dsa: mv88e6060: setup random mac address
  net: dsa: dsa_loop: remove .set_addr
  net: dsa: remove .set_addr

 drivers/net/dsa/dsa_loop.c   |  8 
 drivers/net/dsa/mv88e6060.c  | 37 ++---
 drivers/net/dsa/mv88e6xxx/chip.c | 33 +
 include/net/dsa.h|  1 -
 net/dsa/dsa2.c   |  6 --
 net/dsa/legacy.c |  6 --
 6 files changed, 43 insertions(+), 48 deletions(-)

-- 
2.14.2



[PATCH net-next v3 4/5] net: dsa: dsa_loop: remove .set_addr

2017-10-13 Thread Vivien Didelot
The .set_addr function does nothing, remove the dsa_loop implementation
before getting rid of it completely in DSA.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/dsa_loop.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index d55051abf4ed..3a3f4f7ba364 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -110,13 +110,6 @@ static void dsa_loop_get_ethtool_stats(struct dsa_switch 
*ds, int port,
data[i] = ps->ports[port].mib[i].val;
 }
 
-static int dsa_loop_set_addr(struct dsa_switch *ds, u8 *addr)
-{
-   dev_dbg(ds->dev, "%s\n", __func__);
-
-   return 0;
-}
-
 static int dsa_loop_phy_read(struct dsa_switch *ds, int port, int regnum)
 {
struct dsa_loop_priv *ps = ds->priv;
@@ -263,7 +256,6 @@ static const struct dsa_switch_ops dsa_loop_driver = {
.get_strings= dsa_loop_get_strings,
.get_ethtool_stats  = dsa_loop_get_ethtool_stats,
.get_sset_count = dsa_loop_get_sset_count,
-   .set_addr   = dsa_loop_set_addr,
.phy_read   = dsa_loop_phy_read,
.phy_write  = dsa_loop_phy_write,
.port_bridge_join   = dsa_loop_port_bridge_join,
-- 
2.14.2



[PATCH net-next v3 1/5] net: dsa: mv88e6xxx: setup random mac address

2017-10-13 Thread Vivien Didelot
An Ethernet switch may support having a MAC address, which can be used
as the switch's source address in transmitted full-duplex Pause frames.

If a DSA switch supports the related .set_addr operation, the DSA core
sets the master's MAC address on the switch. This won't make sense
anymore in a multi-CPU ports system, because there won't be a unique
master device assigned to a switch tree.

Instead, setup the switch from within the Marvell driver with a random
MAC address, and remove the .set_addr implementation.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index d74c7335c512..76cf383dcf90 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -932,6 +932,19 @@ static int mv88e6xxx_irl_setup(struct mv88e6xxx_chip *chip)
return 0;
 }
 
+static int mv88e6xxx_mac_setup(struct mv88e6xxx_chip *chip)
+{
+   if (chip->info->ops->set_switch_mac) {
+   u8 addr[ETH_ALEN];
+
+   eth_random_addr(addr);
+
+   return chip->info->ops->set_switch_mac(chip, addr);
+   }
+
+   return 0;
+}
+
 static int mv88e6xxx_pvt_map(struct mv88e6xxx_chip *chip, int dev, int port)
 {
u16 pvlan = 0;
@@ -2013,6 +2026,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
if (err)
goto unlock;
 
+   err = mv88e6xxx_mac_setup(chip);
+   if (err)
+   goto unlock;
+
err = mv88e6xxx_phy_setup(chip);
if (err)
goto unlock;
@@ -2043,21 +2060,6 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
return err;
 }
 
-static int mv88e6xxx_set_addr(struct dsa_switch *ds, u8 *addr)
-{
-   struct mv88e6xxx_chip *chip = ds->priv;
-   int err;
-
-   if (!chip->info->ops->set_switch_mac)
-   return -EOPNOTSUPP;
-
-   mutex_lock(>reg_lock);
-   err = chip->info->ops->set_switch_mac(chip, addr);
-   mutex_unlock(>reg_lock);
-
-   return err;
-}
-
 static int mv88e6xxx_mdio_read(struct mii_bus *bus, int phy, int reg)
 {
struct mv88e6xxx_mdio_bus *mdio_bus = bus->priv;
@@ -3785,7 +3787,6 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = 
{
.probe  = mv88e6xxx_drv_probe,
.get_tag_protocol   = mv88e6xxx_get_tag_protocol,
.setup  = mv88e6xxx_setup,
-   .set_addr   = mv88e6xxx_set_addr,
.adjust_link= mv88e6xxx_adjust_link,
.get_strings= mv88e6xxx_get_strings,
.get_ethtool_stats  = mv88e6xxx_get_ethtool_stats,
-- 
2.14.2



[PATCH net-next v3 2/5] net: dsa: mv88e6060: fix switch MAC address

2017-10-13 Thread Vivien Didelot
The 88E6060 Ethernet switch always transmits the multicast bit of the
switch MAC address as a zero. It re-uses the corresponding bit 8 of the
register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr".

If the "DiffAddr" bit is 0, then all ports transmit the same source
address. If it is set to 1, then bit 2:0 are used for the port number.

The mv88e6060 driver is currently wrongly shifting the MAC address byte
0 by 9. To fix this, shift it by 8 as usual and clear its bit 0.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6060.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 621cdc46ad81..d64be2b83d3c 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -214,8 +214,14 @@ static int mv88e6060_setup(struct dsa_switch *ds)
 
 static int mv88e6060_set_addr(struct dsa_switch *ds, u8 *addr)
 {
-   /* Use the same MAC Address as FD Pause frames for all ports */
-   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) | addr[1]);
+   u16 val = addr[0] << 8 | addr[1];
+
+   /* The multicast bit is always transmitted as a zero, so the switch uses
+* bit 8 for "DiffAddr", where 0 means all ports transmit the same SA.
+*/
+   val &= 0xfeff;
+
+   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, val);
REG_WRITE(REG_GLOBAL, GLOBAL_MAC_23, (addr[2] << 8) | addr[3]);
REG_WRITE(REG_GLOBAL, GLOBAL_MAC_45, (addr[4] << 8) | addr[5]);
 
-- 
2.14.2



[PATCH net-next v3 3/5] net: dsa: mv88e6060: setup random mac address

2017-10-13 Thread Vivien Didelot
As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
a random MAC address, and remove the .set_addr implementation.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6060.c | 43 ++-
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index d64be2b83d3c..6173be889d95 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -9,6 +9,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -188,6 +189,27 @@ static int mv88e6060_setup_port(struct dsa_switch *ds, int 
p)
return 0;
 }
 
+static int mv88e6060_setup_addr(struct dsa_switch *ds)
+{
+   u8 addr[ETH_ALEN];
+   u16 val;
+
+   eth_random_addr(addr);
+
+   val = addr[0] << 8 | addr[1];
+
+   /* The multicast bit is always transmitted as a zero, so the switch uses
+* bit 8 for "DiffAddr", where 0 means all ports transmit the same SA.
+*/
+   val &= 0xfeff;
+
+   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, val);
+   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_23, (addr[2] << 8) | addr[3]);
+   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_45, (addr[4] << 8) | addr[5]);
+
+   return 0;
+}
+
 static int mv88e6060_setup(struct dsa_switch *ds)
 {
int ret;
@@ -203,6 +225,10 @@ static int mv88e6060_setup(struct dsa_switch *ds)
if (ret < 0)
return ret;
 
+   ret = mv88e6060_setup_addr(ds);
+   if (ret < 0)
+   return ret;
+
for (i = 0; i < MV88E6060_PORTS; i++) {
ret = mv88e6060_setup_port(ds, i);
if (ret < 0)
@@ -212,22 +238,6 @@ static int mv88e6060_setup(struct dsa_switch *ds)
return 0;
 }
 
-static int mv88e6060_set_addr(struct dsa_switch *ds, u8 *addr)
-{
-   u16 val = addr[0] << 8 | addr[1];
-
-   /* The multicast bit is always transmitted as a zero, so the switch uses
-* bit 8 for "DiffAddr", where 0 means all ports transmit the same SA.
-*/
-   val &= 0xfeff;
-
-   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, val);
-   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_23, (addr[2] << 8) | addr[3]);
-   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_45, (addr[4] << 8) | addr[5]);
-
-   return 0;
-}
-
 static int mv88e6060_port_to_phy_addr(int port)
 {
if (port >= 0 && port < MV88E6060_PORTS)
@@ -262,7 +272,6 @@ static const struct dsa_switch_ops mv88e6060_switch_ops = {
.get_tag_protocol = mv88e6060_get_tag_protocol,
.probe  = mv88e6060_drv_probe,
.setup  = mv88e6060_setup,
-   .set_addr   = mv88e6060_set_addr,
.phy_read   = mv88e6060_phy_read,
.phy_write  = mv88e6060_phy_write,
 };
-- 
2.14.2



RE: [PATCH net-next v2 2/4] net: dsa: mv88e6060: setup random mac address

2017-10-13 Thread Woojung.Huh
Hi Vivien,

> >> > +REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) |
> >> addr[1]);
> >>
> >> Is that supposed to be 9 ?
> >
> > Looks like it.
> > Check
> http://www.marvell.com/switching/assets/marvell_linkstreet_88E6060_data
> sheet.pdf
> 
> Hum, David is correct, there is a bug in the driver which needs to be
> addressed first. MAC address bit 40 is addr[0] & 0x1, thus we must
> shift byte 0 by 8 and mask it against 0xfe.
> 
> I'll respin this serie including a fix for both net and net-next.

Yes, you are right. Missed description about bit 40.

Thanks.
Woojung


Re: [PATCH] tests: Remove bashisms (s/source/.)

2017-10-13 Thread Rustad, Mark D
> On Oct 8, 2017, at 7:39 AM, Petr Vorel  wrote:
> 
> diff --git a/testsuite/tests/ip/route/add_default_route.t 
> b/testsuite/tests/ip/route/add_default_route.t
> index e5ea6473..0b566f1f 100755
> --- a/testsuite/tests/ip/route/add_default_route.t
> +++ b/testsuite/tests/ip/route/add_default_route.t
> @@ -1,6 +1,6 @@
> -#!/bin/sh
> +#!/bin/bash

Funny - ^^^ choosing bash explicitly while
    removing the bashism?

> -source lib/generic.sh
> +. lib/generic.sh
> 
> ts_log "[Testing add default route]"

I noticed a couple other files already specified /bin/bash, yet you removed the 
bashisms. But the above struck me as something that would not seem to have been 
intended.

--
Mark Rustad, Networking Division, Intel Corporation



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH v2 5/5] fsl/fman: add dpaa in module names

2017-10-13 Thread Florian Fainelli
On 10/13/2017 07:50 AM, Madalin Bucur wrote:
> Signed-off-by: Madalin Bucur 

You should provide a line or two to explain why are you making this
change, it is to resolve modular build configurations?

> ---
>  drivers/net/ethernet/freescale/fman/Makefile | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fman/Makefile 
> b/drivers/net/ethernet/freescale/fman/Makefile
> index 2c38119..4ae524a 100644
> --- a/drivers/net/ethernet/freescale/fman/Makefile
> +++ b/drivers/net/ethernet/freescale/fman/Makefile
> @@ -1,9 +1,9 @@
>  subdir-ccflags-y +=  -I$(srctree)/drivers/net/ethernet/freescale/fman
>  
> -obj-$(CONFIG_FSL_FMAN) += fsl_fman.o
> -obj-$(CONFIG_FSL_FMAN) += fsl_fman_port.o
> -obj-$(CONFIG_FSL_FMAN) += fsl_mac.o
> +obj-$(CONFIG_FSL_FMAN) += fsl_dpaa_fman.o
> +obj-$(CONFIG_FSL_FMAN) += fsl_dpaa_fman_port.o
> +obj-$(CONFIG_FSL_FMAN) += fsl_dpaa_mac.o
>  
> -fsl_fman-objs:= fman_muram.o fman.o fman_sp.o fman_keygen.o
> -fsl_fman_port-objs := fman_port.o
> -fsl_mac-objs:= mac.o fman_dtsec.o fman_memac.o fman_tgec.o
> +fsl_dpaa_fman-objs   := fman_muram.o fman.o fman_sp.o fman_keygen.o
> +fsl_dpaa_fman_port-objs := fman_port.o
> +fsl_dpaa_mac-objs:= mac.o fman_dtsec.o fman_memac.o fman_tgec.o
> 


-- 
Florian


[PATCH net] net: dsa: mv88e6060: fix switch MAC address

2017-10-13 Thread Vivien Didelot
The 88E6060 Ethernet switch always transmits the multicast bit of the
switch MAC address as a zero. It re-uses the corresponding bit 8 of the
register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr".

If the "DiffAddr" bit is 0, then all ports transmit the same source
address. If it is set to 1, then bit 2:0 are used for the port number.

The mv88e6060 driver is currently wrongly shifting the MAC address byte
0 by 9. To fix this, shift it by 8 as usual and clear its bit 0.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6060.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 621cdc46ad81..d64be2b83d3c 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -214,8 +214,14 @@ static int mv88e6060_setup(struct dsa_switch *ds)
 
 static int mv88e6060_set_addr(struct dsa_switch *ds, u8 *addr)
 {
-   /* Use the same MAC Address as FD Pause frames for all ports */
-   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) | addr[1]);
+   u16 val = addr[0] << 8 | addr[1];
+
+   /* The multicast bit is always transmitted as a zero, so the switch uses
+* bit 8 for "DiffAddr", where 0 means all ports transmit the same SA.
+*/
+   val &= 0xfeff;
+
+   REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, val);
REG_WRITE(REG_GLOBAL, GLOBAL_MAC_23, (addr[2] << 8) | addr[3]);
REG_WRITE(REG_GLOBAL, GLOBAL_MAC_45, (addr[4] << 8) | addr[5]);
 
-- 
2.14.2



Re: [Patch net-next v2] tcp: add a tracepoint for tcp_retransmit_skb()

2017-10-13 Thread Cong Wang
On Thu, Oct 12, 2017 at 4:18 PM, Alexei Starovoitov
 wrote:
> On Thu, Oct 12, 2017 at 03:48:07PM -0700, Cong Wang wrote:
>> We need a real-time notification for tcp retransmission
>> for monitoring.
>>
>> Of course we could use ftrace to dynamically instrument this
>> kernel function too, however we can't retrieve the connection
>> information at the same time, for example perf-tools [1] reads
>> /proc/net/tcp for socket details, which is slow when we have
>> a lots of connections.
>>
>> Therefore, this patch adds a tracepoint for tcp_retransmit_skb()
>> and exposes src/dst IP addresses and ports of the connection.
>> This also makes it easier to integrate into perf.
>>
>> Note, I expose both IPv4 and IPv6 addresses at the same time:
>> for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
>> for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
>> Also, add sk and skb pointers as they are useful for BPF.
>>
>> 1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans
>>
>> Cc: Eric Dumazet 
>> Cc: Alexei Starovoitov 
>> Cc: Hannes Frederic Sowa 
>> Cc: Brendan Gregg 
>> Cc: Neal Cardwell 
>> Signed-off-by: Cong Wang 
>> ---
>>  include/trace/events/tcp.h | 68 
>> ++
>>  net/core/net-traces.c  |  1 +
>>  net/ipv4/tcp_output.c  |  3 ++
>>  3 files changed, 72 insertions(+)
>>  create mode 100644 include/trace/events/tcp.h
>>
>> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
>> new file mode 100644
>> index ..749f93c542ab
>> --- /dev/null
>> +++ b/include/trace/events/tcp.h
>> @@ -0,0 +1,68 @@
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM tcp
>> +
>> +#if !defined(_TRACE_TCP_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_TCP_H
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +TRACE_EVENT(tcp_retransmit_skb,
>> +
>> + TP_PROTO(struct sock *sk, struct sk_buff *skb, int segs),
>> +
>> + TP_ARGS(sk, skb, segs),
>> +
>> + TP_STRUCT__entry(
>> + __field(void *, skbaddr)
>> + __field(void *, skaddr)
>> + __field(__u16, sport)
>> + __field(__u16, dport)
>> + __array(__u8, saddr, 4)
>> + __array(__u8, daddr, 4)
>> + __array(__u8, saddr_v6, 16)
>> + __array(__u8, daddr_v6, 16)
>> + ),
> ...
>>   if (likely(!err)) {
>>   TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
>> + trace_tcp_retransmit_skb(sk, skb, segs);
>
> looks great to me, but why 'segs' is there?
> It's unused.

Ah, I copy-n-paste the tcp_retransmit_skb() prototype...
I will remove it.

Thanks.


Re: Fw: [Bug 197213] New: panic in interrupt after ioctl to tun

2017-10-13 Thread Cong Wang
On Fri, Oct 13, 2017 at 8:11 AM, Stephen Hemminger
 wrote:
> Hi,
>
> this is one more corner case found by syzkaller.
> I'm not sure that 'Networking' is the right category for this, but the panic
> was triggered by ioctl to /dev/net/tun...
>
>
> [   13.728009] BUG: unable to handle kernel NULL pointer dereference at
>   (null)
> [   13.728903] IP: run_timer_softirq+0x315/0x3f0
> [   13.729401] PGD 7bd8b067 P4D 7bd8b067 PUD 7bd7f067 PMD 0
> [   13.730040] Oops: 0002 [#1] SMP
> [   13.730400] Modules linked in:
> [   13.730747] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc4-with-tun 
> #1
> [   13.731533] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [   13.732672] task: a280f480 task.stack: a280
> [   13.72] RIP: 0010:run_timer_softirq+0x315/0x3f0
> [   13.733883] RSP: 0018:961b7fc03ed0 EFLAGS: 00010086
> [   13.734467] RAX: 961b7bf070c0 RBX: 961b7fc10cc0 RCX:
> 
> [   13.735265] RDX: dead0200 RSI: fe01 RDI:
> 961b7fc10cc0
> [   13.736059] RBP: 961b7fc03f50 R08: fffba1c0 R09:
> 961b7fc11168
> [   13.736857] R10: 961b7fc03ee8 R11: 961b7fc10d30 R12:
> 961b7fc03ee0
> [   13.737652] R13: dead0200 R14: 0001 R15:
> 961b7bf070c0
> [   13.738463] FS:  () GS:961b7fc0()
> knlGS:
> [   13.739017] CS:  0010 DS:  ES:  CR0: 80050033
> [   13.739339] CR2:  CR3: 7bcf8000 CR4:
> 06f0
> [   13.739741] Call Trace:
> [   13.739882]  
> [   13.74]  ? ktime_get+0x3b/0x90
> [   13.740196]  ? lapic_next_event+0x18/0x20
> [   13.740413]  __do_softirq+0xcf/0x2a8
> [   13.740606]  irq_exit+0xab/0xb0
> [   13.740778]  smp_apic_timer_interrupt+0x64/0x110
> [   13.741025]  apic_timer_interrupt+0x90/0xa0
> [   13.741250]  
> [   13.741367] RIP: 0010:default_idle+0x18/0xf0
> [   13.741596] RSP: 0018:a2803e60 EFLAGS: 0246 ORIG_RAX:
> ff10
> [   13.741998] RAX: 8000 RBX: a293f5e0 RCX:
> 
> [   13.742370] RDX:  RSI:  RDI:
> 
> [   13.742750] RBP: a2803e78 R08: 00040a453dcd R09:
> 9c324031f930
> [   13.743128] R10:  R11: 0069d14f9aee R12:
> 
> [   13.743504] R13:  R14: a2a37780 R15:
> 
> [   13.743883]  arch_cpu_idle+0xa/0x10
> [   13.744072]  default_idle_call+0x1e/0x30
> [   13.744284]  do_idle+0x14f/0x1a0
> [   13.744458]  cpu_startup_entry+0x18/0x20
> [   13.744670]  rest_init+0xa9/0xb0
> [   13.744845]  start_kernel+0x3c6/0x3d3
> [   13.745043]  x86_64_start_reservations+0x24/0x26
> [   13.745291]  x86_64_start_kernel+0x6f/0x72
> [   13.745512]  secondary_startup_64+0xa5/0xa5
> [   13.745741] Code: 88 4c 39 65 88 0f 84 3b ff ff ff 49 8b 04 24 48 85 c0 74
> 56 4d 8b 3c 24 4c 89 7b 08 0f 1f 44 00 00 49 8b 17 49 8b 4f 08 48 85 d2 <48> 
> 89
> 11 74 04 48 89 4a 08 41 f6 47 2a 20 49 c7 47 08 00 00 00
> [   13.746745] RIP: run_timer_softirq+0x315/0x3f0 RSP: 961b7fc03ed0
> [   13.747087] CR2: 
> [   13.747270] ---[ end trace 04d492145975c7cc ]---
> [   13.747516] Kernel panic - not syncing: Fatal exception in interrupt
> [   13.747946] Kernel Offset: 0x20a0 from 0x8100 (relocation
> range: 0x8000-0xbfff)
> [   13.748515] ---[ end Kernel panic - not syncing: Fatal exception in
> interrupt
>
> Reproducer:
>
> #include 
> #include 
> #include 
> #include 
>
> char addr[40] = {0xcf, 0x0b, 0x0b, 0x99, 0x22, 0x33, 0x96, 0xdf, 0xbd, 0x2e,
> 0x29, 0x1b, 0x4d, 0xc0, 0x2a, 0xee, 0x03};
>
> void test() {
> int fd = -1;
> fd = open("/dev/net/tun", 0, 0);
> syscall(__NR_ioctl, fd, 0x400454caul, addr);
> }
>
> #define max_iter 10
> int main(void) {
> int iter;
> for (iter = 0; iter test();
> printf("done %d of %d\n", iter+1, max_iter);
> }
> return 0;
> }

I just make a patch to fix this, however it uncovers another bug,
so I am trying to fix both of them (if not more)...


Thanks!


[PATCH net] l2tp: check ps->sock before running pppol2tp_session_ioctl()

2017-10-13 Thread Guillaume Nault
When pppol2tp_session_ioctl() is called by pppol2tp_tunnel_ioctl(),
the session may be unconnected. That is, it was created by
pppol2tp_session_create() and hasn't been connected with
pppol2tp_connect(). In this case, ps->sock is NULL, so we need to check
for this case in order to avoid dereferencing a NULL pointer.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_ppp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index bc6e8bfc5be4..f50452b919d5 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -988,6 +988,9 @@ static int pppol2tp_session_ioctl(struct l2tp_session 
*session,
 session->name, cmd, arg);
 
sk = ps->sock;
+   if (!sk)
+   return -EBADR;
+
sock_hold(sk);
 
switch (cmd) {
-- 
2.15.0.rc0



Re: [PATCH] net: stmmac: dwmac_lib: fix interchanged sleep/timeout values in DMA reset function

2017-10-13 Thread David Miller
From: Emiliano Ingrassia 
Date: Thu, 12 Oct 2017 11:00:47 +0200

> The DMA reset timeout, used in read_poll_timeout, is
> ten times shorter than the sleep time.
> This patch fixes these values interchanging them, as it was
> before the read_poll_timeout introduction.
> 
> Fixes: 8a70aeca80c2 ("net: stmmac: Use readl_poll_timeout")
> 
> Signed-off-by: Emiliano Ingrassia 

Applied.


Re: [PATCH] [net] liquidio: fix timespec64_to_ns typo

2017-10-13 Thread David Miller
From: Arnd Bergmann 
Date: Thu, 12 Oct 2017 11:48:31 +0200

> While experimenting with changes to the timekeeping code, I
> ran into a build error in the liquidio driver:
> 
> drivers/net/ethernet/cavium/liquidio/lio_main.c: In function 
> 'liquidio_ptp_settime':
> drivers/net/ethernet/cavium/liquidio/lio_main.c:1850:22: error: passing 
> argument 1 of 'timespec_to_ns' from incompatible pointer type 
> [-Werror=incompatible-pointer-types]
> 
> The driver had a type mismatch since it was first merged, but
> this never caused problems because it is only built on 64-bit
> architectures that define timespec and timespec64 to the same
> type.
> 
> If we ever want to compile-test the driver on 32-bit or change
> the way that 64-bit timespec64 is defined, we need to fix it,
> so let's just do it now.
> 
> Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters")
> Signed-off-by: Arnd Bergmann 

Applied.


Re: [next-queue PATCH v7 4/6] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-10-13 Thread Vinicius Costa Gomes
Hi,

Eric Dumazet  writes:

[...]

>
> Your mixing of s64 and u64 is disturbing.
>
> do_div() handles u64, not s64.
>
> div64_s64() might be needed in place of do_div()

I wasn't very comfortable about the signal juggling either. Didn't know
about div64_s64(), looks much better. Will fix, thanks.


Cheers,
--
Vinicius


Re: [PATCH v1] pch_gbe: Switch to new PCI IRQ allocation API

2017-10-13 Thread Andy Shevchenko
On Fri, 2017-10-13 at 20:02 +0300, Andy Shevchenko wrote:
> This removes custom flag handling.

Please, discard this one, slipped to early.

> 
> Signed-off-by: Andy Shevchenko 
> ---
>  drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h|  3 +-
>  .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 42 +--
> ---
>  2 files changed, 17 insertions(+), 28 deletions(-)
> 
> diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
> b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
> index 8d710a3b4db0..697e29dd4bd3 100644
> --- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
> +++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
> @@ -613,7 +613,6 @@ struct pch_gbe_privdata {
>   * @rx_ring: Pointer of Rx descriptor ring structure
>   * @rx_buffer_len:   Receive buffer length
>   * @tx_queue_len:Transmit queue length
> - * @have_msi:PCI MSI mode flag
>   * @pch_gbe_privdata:PCI Device ID driver_data
>   */
>  
> @@ -623,6 +622,7 @@ struct pch_gbe_adapter {
>   atomic_t irq_sem;
>   struct net_device *netdev;
>   struct pci_dev *pdev;
> + int irq;
>   struct net_device *polling_netdev;
>   struct napi_struct napi;
>   struct pch_gbe_hw hw;
> @@ -637,7 +637,6 @@ struct pch_gbe_adapter {
>   struct pch_gbe_rx_ring *rx_ring;
>   unsigned long rx_buffer_len;
>   unsigned long tx_queue_len;
> - bool have_msi;
>   bool rx_stop_flag;
>   int hwts_tx_en;
>   int hwts_rx_en;
> diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
> b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
> index 5ae9681a2da7..0284a3bc019c 100644
> --- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
> +++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
> @@ -781,11 +781,8 @@ static void pch_gbe_free_irq(struct
> pch_gbe_adapter *adapter)
>  {
>   struct net_device *netdev = adapter->netdev;
>  
> - free_irq(adapter->pdev->irq, netdev);
> - if (adapter->have_msi) {
> - pci_disable_msi(adapter->pdev);
> - netdev_dbg(netdev, "call pci_disable_msi\n");
> - }
> + free_irq(adapter->irq, netdev);
> + pci_free_irq_vectors(adapter->pdev);
>  }
>  
>  /**
> @@ -799,7 +796,7 @@ static void pch_gbe_irq_disable(struct
> pch_gbe_adapter *adapter)
>   atomic_inc(>irq_sem);
>   iowrite32(0, >reg->INT_EN);
>   ioread32(>reg->INT_ST);
> - synchronize_irq(adapter->pdev->irq);
> + synchronize_irq(adapter->irq);
>  
>   netdev_dbg(adapter->netdev, "INT_EN reg : 0x%08x\n",
>  ioread32(>reg->INT_EN));
> @@ -1903,30 +1900,23 @@ static int pch_gbe_request_irq(struct
> pch_gbe_adapter *adapter)
>  {
>   struct net_device *netdev = adapter->netdev;
>   int err;
> - int flags;
>  
> - flags = IRQF_SHARED;
> - adapter->have_msi = false;
> - err = pci_enable_msi(adapter->pdev);
> - netdev_dbg(netdev, "call pci_enable_msi\n");
> - if (err) {
> - netdev_dbg(netdev, "call pci_enable_msi - Error:
> %d\n", err);
> - } else {
> - flags = 0;
> - adapter->have_msi = true;
> - }
> - err = request_irq(adapter->pdev->irq, _gbe_intr,
> -   flags, netdev->name, netdev);
> + err = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
> + if (err < 0)
> + return err;
> +
> + adapter->irq = pci_irq_vector(pdev, 0);
> +
> + err = request_irq(adapter->irq, _gbe_intr, IRQF_SHARED,
> +   netdev->name, netdev);
>   if (err)
>   netdev_err(netdev, "Unable to allocate interrupt
> Error: %d\n",
>  err);
> - netdev_dbg(netdev,
> -"adapter->have_msi : %d  flags : 0x%04x  return :
> 0x%04x\n",
> -adapter->have_msi, flags, err);
> + netdev_dbg(netdev, "have_msi : %d  return : 0x%04x\n",
> +pci_dev_msi_enabled(adapter->pdev), err);
>   return err;
>  }
>  
> -
>  /**
>   * pch_gbe_up - Up GbE network device
>   * @adapter:  Board private structure
> @@ -2399,9 +2389,9 @@ static void pch_gbe_netpoll(struct net_device
> *netdev)
>  {
>   struct pch_gbe_adapter *adapter = netdev_priv(netdev);
>  
> - disable_irq(adapter->pdev->irq);
> - pch_gbe_intr(adapter->pdev->irq, netdev);
> - enable_irq(adapter->pdev->irq);
> + disable_irq(adapter->irq);
> + pch_gbe_intr(adapter->irq, netdev);
> + enable_irq(adapter->irq);
>  }
>  #endif
>  

-- 
Andy Shevchenko 
Intel Finland Oy


[PATCH v1] pch_gbe: Switch to new PCI IRQ allocation API

2017-10-13 Thread Andy Shevchenko
This removes custom flag handling.

Signed-off-by: Andy Shevchenko 
---
 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h|  3 +-
 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 42 +-
 2 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
index 8d710a3b4db0..697e29dd4bd3 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
@@ -613,7 +613,6 @@ struct pch_gbe_privdata {
  * @rx_ring:   Pointer of Rx descriptor ring structure
  * @rx_buffer_len: Receive buffer length
  * @tx_queue_len:  Transmit queue length
- * @have_msi:  PCI MSI mode flag
  * @pch_gbe_privdata:  PCI Device ID driver_data
  */
 
@@ -623,6 +622,7 @@ struct pch_gbe_adapter {
atomic_t irq_sem;
struct net_device *netdev;
struct pci_dev *pdev;
+   int irq;
struct net_device *polling_netdev;
struct napi_struct napi;
struct pch_gbe_hw hw;
@@ -637,7 +637,6 @@ struct pch_gbe_adapter {
struct pch_gbe_rx_ring *rx_ring;
unsigned long rx_buffer_len;
unsigned long tx_queue_len;
-   bool have_msi;
bool rx_stop_flag;
int hwts_tx_en;
int hwts_rx_en;
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 5ae9681a2da7..0284a3bc019c 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -781,11 +781,8 @@ static void pch_gbe_free_irq(struct pch_gbe_adapter 
*adapter)
 {
struct net_device *netdev = adapter->netdev;
 
-   free_irq(adapter->pdev->irq, netdev);
-   if (adapter->have_msi) {
-   pci_disable_msi(adapter->pdev);
-   netdev_dbg(netdev, "call pci_disable_msi\n");
-   }
+   free_irq(adapter->irq, netdev);
+   pci_free_irq_vectors(adapter->pdev);
 }
 
 /**
@@ -799,7 +796,7 @@ static void pch_gbe_irq_disable(struct pch_gbe_adapter 
*adapter)
atomic_inc(>irq_sem);
iowrite32(0, >reg->INT_EN);
ioread32(>reg->INT_ST);
-   synchronize_irq(adapter->pdev->irq);
+   synchronize_irq(adapter->irq);
 
netdev_dbg(adapter->netdev, "INT_EN reg : 0x%08x\n",
   ioread32(>reg->INT_EN));
@@ -1903,30 +1900,23 @@ static int pch_gbe_request_irq(struct pch_gbe_adapter 
*adapter)
 {
struct net_device *netdev = adapter->netdev;
int err;
-   int flags;
 
-   flags = IRQF_SHARED;
-   adapter->have_msi = false;
-   err = pci_enable_msi(adapter->pdev);
-   netdev_dbg(netdev, "call pci_enable_msi\n");
-   if (err) {
-   netdev_dbg(netdev, "call pci_enable_msi - Error: %d\n", err);
-   } else {
-   flags = 0;
-   adapter->have_msi = true;
-   }
-   err = request_irq(adapter->pdev->irq, _gbe_intr,
- flags, netdev->name, netdev);
+   err = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
+   if (err < 0)
+   return err;
+
+   adapter->irq = pci_irq_vector(pdev, 0);
+
+   err = request_irq(adapter->irq, _gbe_intr, IRQF_SHARED,
+ netdev->name, netdev);
if (err)
netdev_err(netdev, "Unable to allocate interrupt Error: %d\n",
   err);
-   netdev_dbg(netdev,
-  "adapter->have_msi : %d  flags : 0x%04x  return : 0x%04x\n",
-  adapter->have_msi, flags, err);
+   netdev_dbg(netdev, "have_msi : %d  return : 0x%04x\n",
+  pci_dev_msi_enabled(adapter->pdev), err);
return err;
 }
 
-
 /**
  * pch_gbe_up - Up GbE network device
  * @adapter:  Board private structure
@@ -2399,9 +2389,9 @@ static void pch_gbe_netpoll(struct net_device *netdev)
 {
struct pch_gbe_adapter *adapter = netdev_priv(netdev);
 
-   disable_irq(adapter->pdev->irq);
-   pch_gbe_intr(adapter->pdev->irq, netdev);
-   enable_irq(adapter->pdev->irq);
+   disable_irq(adapter->irq);
+   pch_gbe_intr(adapter->irq, netdev);
+   enable_irq(adapter->irq);
 }
 #endif
 
-- 
2.14.2



RE: [PATCH net-next 5/5] net: dsa: split dsa_port's netdev member

2017-10-13 Thread Vivien Didelot
Hi David,

David Laight  writes:

> From: Vivien Didelot
>> Sent: 13 October 2017 16:29
>> Vivien Didelot  writes:
>> 
>> >>> How about using:
>> >>>
>> >>>  union {
>> >>>  struct net_device *master;
>> >>>  struct net_device *slave;
>> >>>  } netdev;
>> >> ...
>> >>
>> >> You can remove the 'netdev' all the compilers support unnamed unions.
>> >
>> > There are issues with older GCC versions, see the commit 42275bd8fcb3
>> > ("switchdev: don't use anonymous union on switchdev attr/obj structs")
>> >
>> > That's why I kept it in the v2 I sent.
>> 
>> At the same time, I can see that struct sk_buff uses anonym union a lot.
>> 
>> It seems weird that one raised a compiler issue for switchdev but not
>> for skbuff.h... Do you think it is viable to drop the name here then?
>
> I believe the problem is with initialisers for static structures
> that contain anonymous unions.

The dsa_port structures are dynamically allocated so this seems safe to
use an anonymous union here.

BTW v2 never left my computer in fact, so this will be fixed up in v2.


Thanks!

Vivien


[RFC PATCH] net: realtek: r8169: implement set_link_ksettings()

2017-10-13 Thread Tobias Jakobi
Commit 6fa1ba61520576cf1346c4ff09a056f2950cb3bf partially
implemented the new ethtool API, by replacing get_settings()
with get_link_ksettings(). This breaks ethtool, since the
userspace tool (according to the new API specs) never tries
the legacy set() call, when the new get() call succeeds.

All attempts to chance some setting from userspace result in:
> Cannot set new settings: Operation not supported

Implement the missing set() call.

Signed-off-by: Tobias Jakobi 
---
 drivers/net/ethernet/realtek/r8169.c | 40 +---
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index e03fcf914690..24e8f7133038 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -2025,21 +2025,6 @@ static int rtl8169_set_speed(struct net_device *dev,
return ret;
 }
 
-static int rtl8169_set_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
-{
-   struct rtl8169_private *tp = netdev_priv(dev);
-   int ret;
-
-   del_timer_sync(>timer);
-
-   rtl_lock_work(tp);
-   ret = rtl8169_set_speed(dev, cmd->autoneg, ethtool_cmd_speed(cmd),
-   cmd->duplex, cmd->advertising);
-   rtl_unlock_work(tp);
-
-   return ret;
-}
-
 static netdev_features_t rtl8169_fix_features(struct net_device *dev,
netdev_features_t features)
 {
@@ -2166,6 +2151,29 @@ static int rtl8169_get_link_ksettings(struct net_device 
*dev,
return rc;
 }
 
+static int rtl8169_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
+{
+   struct rtl8169_private *tp = netdev_priv(dev);
+   int rc;
+   u32 advertising;
+
+   if (!ethtool_convert_link_mode_to_legacy_u32(,
+   cmd->link_modes.advertising))
+   return -EINVAL;
+
+   del_timer_sync(>timer);
+
+   rtl_lock_work(tp);
+
+   rc = rtl8169_set_speed(dev, cmd->base.autoneg, cmd->base.speed,
+  cmd->base.duplex, advertising);
+
+   rtl_unlock_work(tp);
+
+   return rc;
+}
+
 static void rtl8169_get_regs(struct net_device *dev, struct ethtool_regs *regs,
 void *p)
 {
@@ -2367,7 +2375,6 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
.get_drvinfo= rtl8169_get_drvinfo,
.get_regs_len   = rtl8169_get_regs_len,
.get_link   = ethtool_op_get_link,
-   .set_settings   = rtl8169_set_settings,
.get_msglevel   = rtl8169_get_msglevel,
.set_msglevel   = rtl8169_set_msglevel,
.get_regs   = rtl8169_get_regs,
@@ -2379,6 +2386,7 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
.get_ts_info= ethtool_op_get_ts_info,
.nway_reset = rtl8169_nway_reset,
.get_link_ksettings = rtl8169_get_link_ksettings,
+   .set_link_ksettings = rtl8169_set_link_ksettings,
 };
 
 static void rtl8169_get_mac_version(struct rtl8169_private *tp,
-- 
2.13.5



RE: [PATCH net-next v2 2/4] net: dsa: mv88e6060: setup random mac address

2017-10-13 Thread Vivien Didelot
Hi David, Woojung,

woojung@microchip.com writes:

>> From: Vivien Didelot
>> > Sent: 13 October 2017 02:41
>> > As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
>> > a random MAC address, and remove the .set_addr implementation.
>> >
>> > Signed-off-by: Vivien Didelot 
>> > ---
>> >  drivers/net/dsa/mv88e6060.c | 30 +++---
>> >  1 file changed, 19 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
>> > index 621cdc46ad81..2f9d5e6a0f97 100644
>> > --- a/drivers/net/dsa/mv88e6060.c
>> > +++ b/drivers/net/dsa/mv88e6060.c
>> ...
>> > +  REG_WRITE(REG_GLOBAL, GLOBAL_MAC_01, (addr[0] << 9) |
>> addr[1]);
>> 
>> Is that supposed to be 9 ?
>
> Looks like it.
> Check 
> http://www.marvell.com/switching/assets/marvell_linkstreet_88E6060_datasheet.pdf

Hum, David is correct, there is a bug in the driver which needs to be
addressed first. MAC address bit 40 is addr[0] & 0x1, thus we must
shift byte 0 by 8 and mask it against 0xfe.

I'll respin this serie including a fix for both net and net-next.


Thanks,

Vivien


  1   2   3   >