Re: [PATCH net-next 1/8] net/ncsi: Avoid unused-value build warning from ia64-linux-gcc

2016-09-28 Thread David Miller
From: Gavin Shan 
Date: Thu, 29 Sep 2016 15:03:08 +1000

> This replaces the atomic access to NCSI channel's state with READ_ONCE()
> and WRITE_ONCE() to avoid the above build warning. We needn't hold the
> channel's lock when updating its state as well. No logical changes
> introduced.

I don't understand this.

If it's important to take the lock for the list add/del, then it must
be important to make the state change appear atomic wrt. that lock as
well.

Can parallel threads of control enter these functions which change the
state?  If so, then you need to make the state changes under the lock.
In fact, you probably have to make the state tests under the locks as
well.

If not, please explain what prevents it from happening.

Thanks.


Re: [PATCH 0/3] [v2] Add basic ACPI support to the Qualcomm Technologies EMAC driver

2016-09-28 Thread David Miller
From: Timur Tabi 
Date: Wed, 28 Sep 2016 11:58:41 -0500

> This patch series adds support to the EMAC driver for extracting addresses,
> interrupts, and some _DSDs (properties) from ACPI.  The first two patches
> clean up the code, and the third patch adds ACPI-specific functionality.
> 
> The first patch fixes a bug with handling the platform_device for the
> internal PHY.  This phy is treated as a separate device in both DT and
> ACPI, but since the platform is not released automatically when the 
> driver unloads, managed functions like devm_ioremap_resource cannot be
> used. 
> 
> The second patch replaces of_get_mac_address with its platform-independent
> equivalent device_get_mac_address.  
> 
> The third patch parses the ACPI tables to obtain the platform_device for
> the primary EMAC node ("QCOM8070") and the internal phy node ("QCOM8071").

Series applied, thanks.


Re: [PATCH net-next v5] bpf: allow access into map value arrays

2016-09-28 Thread David Miller
From: Josef Bacik 
Date: Wed, 28 Sep 2016 10:54:32 -0400

> Suppose you have a map array value that is something like this
> 
> struct foo {
>   unsigned iter;
>   int array[SOME_CONSTANT];
> };
> 
> You can easily insert this into an array, but you cannot modify the contents 
> of
> foo->array[] after the fact.  This is because we have no way to verify we 
> won't
> go off the end of the array at verification time.  This patch provides a start
> for this work.  We accomplish this by keeping track of a minimum and maximum
> value a register could be while we're checking the code.  Then at the time we
> try to do an access into a MAP_VALUE we verify that the maximum offset into 
> that
> region is a valid access into that memory region.  So in practice, code such 
> as
> this
> 
> unsigned index = 0;
> 
> if (foo->iter >= SOME_CONSTANT)
>   foo->iter = index;
> else
>   index = foo->iter++;
> foo->array[index] = bar;
> 
> would be allowed, as we can verify that index will always be between 0 and
> SOME_CONSTANT-1.  If you wish to use signed values you'll have to have an 
> extra
> check to make sure the index isn't less than 0, or do something like index %=
> SOME_CONSTANT.
> 
> Signed-off-by: Josef Bacik 
> Acked-by: Alexei Starovoitov 

Applied, thanks.


[PATCH kernel v2] PCI: Enable access to custom VPD for Chelsio devices (cxgb3)

2016-09-28 Thread Alexey Kardashevskiy
There is at least one Chelsio 10Gb card which uses VPD area to store
some custom blocks (example below). However pci_vpd_size() returns
the length of the first block only assuming that there can be only
one VPD "End Tag" and VFIO blocks access beyond that offset
(since 4e1a63555) which leads to the situation when the guest "cxgb3"
driver fails to probe the device. The host system does not have this
problem as the drives accesses the config space directly without
pci_read_vpd()/...

This adds a quirk to override the VPD size to a bigger value.
The maximum size is taken from EEPROMSIZE in
drivers/net/ethernet/chelsio/cxgb3/common.h. We do not read the tag
as the cxgb3 driver does as the driver supports writing to EEPROM/VPD
and when it writes, it only checks for 8192 bytes boundary. The quirk
is registerted for all devices supported by the cxgb3 driver.

This adds a quirk to the PCI layer (not to the cxgb3 driver) as
the cxgb3 driver itself accesses VPD directly and the problem only exists
with the vfio-pci driver (when cxgb3 is not running on the host and
may not be even loaded) which blocks accesses beyond the first block
of VPD data. However vfio-pci itself does not have quirks mechanism so
we add it to PCI.

Tested on:
Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port 
Adapter [1425:0030]

This is its VPD:
 Large item 42 bytes; name 0x2 Identifier String
b'10 Gigabit Ethernet-SR PCI Express Adapter'
#00 [EC] len=7: b'D76809 '
#0a [FN] len=7: b'46K7897'
#14 [PN] len=7: b'46K7897'
#1e [MN] len=4: b'1037'
#25 [FC] len=4: b'5769'
#2c [SN] len=12: b'YL102035603V'
#3b [NA] len=12: b'00145E992ED1'

0c00 Large item 16 bytes; name 0x2 Identifier String
b'S310E-SR-X  '
0c13 Large item 234 bytes; name 0x10
#00 [PN] len=16: b'TBD '
#13 [EC] len=16: b'110107730D2 '
#26 [SN] len=16: b'97YL102035603V  '
#39 [NA] len=12: b'00145E992ED1'
#48 [V0] len=6: b'175000'
#51 [V1] len=6: b'26'
#5a [V2] len=6: b'26'
#63 [V3] len=6: b'2000  '
#6c [V4] len=2: b'1 '
#71 [V5] len=6: b'c2'
#7a [V6] len=6: b'0 '
#83 [V7] len=2: b'1 '
#88 [V8] len=2: b'0 '
#8d [V9] len=2: b'0 '
#92 [VA] len=2: b'0 '
#97 [RV] len=80: 
b's\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
0d00 Large item 252 bytes; name 0x11
#00 [VC] len=16: b'122310_1222 dp  '
#13 [VD] len=16: b'610-0001-00 H1\x00\x00'
#26 [VE] len=16: b'122310_1353 fp  '
#39 [VF] len=16: b'610-0001-00 H1\x00\x00'
#4c [RW] len=173: 
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
0dff Small item 0 bytes; name 0xf End Tag

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* used pci_set_vpd_size() helper
* added explicit list of IDs from cxgb3 driver
* added a note in the commit log why the quirk is not in cxgb3
---
 drivers/pci/quirks.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44e0ff3..b22fce5 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3243,6 +3243,28 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 
PCI_DEVICE_ID_INTEL_CACTUS_RIDGE_4C
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PORT_RIDGE,
quirk_thunderbolt_hotplug_msi);
 
+static void quirk_chelsio_extend_vpd(struct pci_dev *dev)
+{
+   if (!dev->vpd)
+   return;
+
+   pci_set_vpd_size(dev, max_t(unsigned int, dev->vpd->len, 8192));
+}
+
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x20, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x21, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x22, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x23, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x24, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x25, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x26, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x30, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x31, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x32, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x35, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x36, quirk_chelsio_extend_vpd);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_CHELSIO, 0x37, quirk_chelsio_extend_vpd);
+
 #ifdef CONFIG_ACPI
 /*
  * Apple: Shutdown Cactus Ridge Thunderbolt controller.
-- 
2.5.0.rc3



[PATCH net-next 7/8] net/ncsi: Introduce ncsi_stop_dev()

2016-09-28 Thread Gavin Shan
This introduces ncsi_stop_dev(), as counterpart to ncsi_start_dev(),
to stop the NCSI device so that it can be reenabled in future. This
API should be called when the network device driver is going to
shutdown the device. There are 3 things done in the function: Stop
the channel monitoring; Reset channels to inactive state; Report
NCSI link down.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 include/net/ncsi.h |  5 +
 net/ncsi/ncsi-manage.c | 33 ++---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/net/ncsi.h b/include/net/ncsi.h
index 1dbf42f..68680ba 100644
--- a/include/net/ncsi.h
+++ b/include/net/ncsi.h
@@ -31,6 +31,7 @@ struct ncsi_dev {
 struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
   void (*notifier)(struct ncsi_dev *nd));
 int ncsi_start_dev(struct ncsi_dev *nd);
+void ncsi_stop_dev(struct ncsi_dev *nd);
 void ncsi_unregister_dev(struct ncsi_dev *nd);
 #else /* !CONFIG_NET_NCSI */
 static inline struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
@@ -44,6 +45,10 @@ static inline int ncsi_start_dev(struct ncsi_dev *nd)
return -ENOTTY;
 }
 
+static void ncsi_stop_dev(struct ncsi_dev *nd)
+{
+}
+
 static inline void ncsi_unregister_dev(struct ncsi_dev *nd)
 {
 }
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 1b797c9..6a96873 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1161,9 +1161,7 @@ EXPORT_SYMBOL_GPL(ncsi_register_dev);
 int ncsi_start_dev(struct ncsi_dev *nd)
 {
struct ncsi_dev_priv *ndp = TO_NCSI_DEV_PRIV(nd);
-   struct ncsi_package *np;
-   struct ncsi_channel *nc;
-   int old_state, ret;
+   int ret;
 
if (nd->state != ncsi_dev_state_registered &&
nd->state != ncsi_dev_state_functional)
@@ -1175,9 +1173,27 @@ int ncsi_start_dev(struct ncsi_dev *nd)
return 0;
}
 
-   /* Reset channel's state and start over */
+   if (ndp->flags & NCSI_DEV_HWA)
+   ret = ncsi_enable_hwa(ndp);
+   else
+   ret = ncsi_choose_active_channel(ndp);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(ncsi_start_dev);
+
+void ncsi_stop_dev(struct ncsi_dev *nd)
+{
+   struct ncsi_dev_priv *ndp = TO_NCSI_DEV_PRIV(nd);
+   struct ncsi_package *np;
+   struct ncsi_channel *nc;
+   int old_state;
+
+   /* Stop the channel monitor and reset channel's state */
NCSI_FOR_EACH_PACKAGE(ndp, np) {
NCSI_FOR_EACH_CHANNEL(np, nc) {
+   ncsi_stop_channel_monitor(nc);
+
old_state = READ_ONCE(nc->state);
WRITE_ONCE(nc->state, NCSI_CHANNEL_INACTIVE);
WARN_ON_ONCE(!list_empty(>link) ||
@@ -1185,14 +1201,9 @@ int ncsi_start_dev(struct ncsi_dev *nd)
}
}
 
-   if (ndp->flags & NCSI_DEV_HWA)
-   ret = ncsi_enable_hwa(ndp);
-   else
-   ret = ncsi_choose_active_channel(ndp);
-
-   return ret;
+   ncsi_report_link(ndp, true);
 }
-EXPORT_SYMBOL_GPL(ncsi_start_dev);
+EXPORT_SYMBOL_GPL(ncsi_stop_dev);
 
 void ncsi_unregister_dev(struct ncsi_dev *nd)
 {
-- 
2.1.0



[PATCH net-next 8/8] net/faraday: Stop NCSI device on shutdown

2016-09-28 Thread Gavin Shan
This stops NCSI device when closing the network device so that the
NCSI device can be reenabled later.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 drivers/net/ethernet/faraday/ftgmac100.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
b/drivers/net/ethernet/faraday/ftgmac100.c
index 90f9c54..2625872 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1190,6 +1190,8 @@ static int ftgmac100_stop(struct net_device *netdev)
napi_disable(>napi);
if (netdev->phydev)
phy_stop(netdev->phydev);
+   else if (priv->use_ncsi)
+   ncsi_stop_dev(priv->ndev);
 
ftgmac100_stop_hw(priv);
free_irq(priv->irq, netdev);
-- 
2.1.0



[PATCH net-next 0/8] net/ncsi: NCSI Improvment and bug fixes

2016-09-28 Thread Gavin Shan
This series of patches improves NCSI stack according to the comments
I received after the NCSI code was merged to 4.8.rc1:

  * PATCH[1/8] fixes the build warning caused by xchg() with ia64-linux-gcc.
The atomic operations are replaced with {READ, WRITE}_ONCE().
  * Channel ID (0x1f) is the reserved one and it cannot be valid channel ID.
So we needn't try to probe channel whose ID is 0x1f. PATCH[2/8] and
PATCH[3/8] are addressing this issue.
  * The request IDs are assigned in round-robin fashion, but it's broken.
PATCH[4/8] make it work.
  * PATCH[5/8] and PATCH[6/8] reworks the channel monitoring to improve the
code readability and its robustness.
  * PATCH[7/8] and PATCH[8/8] introduces ncsi_stop_dev() so that the network
device can be closed and opened afterwards. No error will be seen.

Gavin Shan (8):
  net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
  net/ncsi: Introduce NCSI_RESERVED_CHANNEL
  net/ncsi: Don't probe on the reserved channel ID (0x1f)
  net/ncsi: Rework request index allocation
  net/ncsi: Allow to extend NCSI request properties
  net/ncsi: Rework the channel monitoring
  net/ncsi: Introduce ncsi_stop_dev()
  net/faraday: Stop NCSI device on shutdown

 drivers/net/ethernet/faraday/ftgmac100.c |   2 +
 include/net/ncsi.h   |   5 +
 net/ncsi/internal.h  |  22 ++--
 net/ncsi/ncsi-aen.c  |  19 ++--
 net/ncsi/ncsi-cmd.c  |   2 +-
 net/ncsi/ncsi-manage.c   | 166 ++-
 net/ncsi/ncsi-rsp.c  |  17 +---
 7 files changed, 137 insertions(+), 96 deletions(-)

-- 
2.1.0



[PATCH net-next 1/8] net/ncsi: Avoid unused-value build warning from ia64-linux-gcc

2016-09-28 Thread Gavin Shan
xchg() is used to set NCSI channel's state in order for consistent
access to the state. xchg()'s return value should be used. Otherwise,
one build warning will be raised (with -Wunused-value) as below message
indicates. It is reported by ia64-linux-gcc (GCC) 4.9.0.

 net/ncsi/ncsi-manage.c: In function 'ncsi_channel_monitor':
 arch/ia64/include/uapi/asm/cmpxchg.h:56:2: warning: value computed is \
 not used [-Wunused-value]
  ((__typeof__(*(ptr))) __xchg((unsigned long) (x), (ptr), sizeof(*(ptr
   ^
 net/ncsi/ncsi-manage.c:202:3: note: in expansion of macro 'xchg'
  xchg(>state, NCSI_CHANNEL_INACTIVE);

This replaces the atomic access to NCSI channel's state with READ_ONCE()
and WRITE_ONCE() to avoid the above build warning. We needn't hold the
channel's lock when updating its state as well. No logical changes
introduced.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 net/ncsi/ncsi-aen.c| 19 +--
 net/ncsi/ncsi-manage.c | 37 +
 net/ncsi/ncsi-rsp.c| 13 +++--
 3 files changed, 37 insertions(+), 32 deletions(-)

diff --git a/net/ncsi/ncsi-aen.c b/net/ncsi/ncsi-aen.c
index d463468..abc5473 100644
--- a/net/ncsi/ncsi-aen.c
+++ b/net/ncsi/ncsi-aen.c
@@ -53,6 +53,7 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv *ndp,
struct ncsi_aen_lsc_pkt *lsc;
struct ncsi_channel *nc;
struct ncsi_channel_mode *ncm;
+   int old_state;
unsigned long old_data;
unsigned long flags;
 
@@ -70,12 +71,14 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv *ndp,
if (!((old_data ^ ncm->data[2]) & 0x1) ||
!list_empty(>link))
return 0;
-   if (!(nc->state == NCSI_CHANNEL_INACTIVE && (ncm->data[2] & 0x1)) &&
-   !(nc->state == NCSI_CHANNEL_ACTIVE && !(ncm->data[2] & 0x1)))
+
+   old_state = READ_ONCE(nc->state);
+   if (!(old_state == NCSI_CHANNEL_INACTIVE && (ncm->data[2] & 0x1)) &&
+   !(old_state == NCSI_CHANNEL_ACTIVE && !(ncm->data[2] & 0x1)))
return 0;
 
if (!(ndp->flags & NCSI_DEV_HWA) &&
-   nc->state == NCSI_CHANNEL_ACTIVE)
+   old_state == NCSI_CHANNEL_ACTIVE)
ndp->flags |= NCSI_DEV_RESHUFFLE;
 
ncsi_stop_channel_monitor(nc);
@@ -90,6 +93,7 @@ static int ncsi_aen_handler_cr(struct ncsi_dev_priv *ndp,
   struct ncsi_aen_pkt_hdr *h)
 {
struct ncsi_channel *nc;
+   int old_state;
unsigned long flags;
 
/* Find the NCSI channel */
@@ -97,13 +101,14 @@ static int ncsi_aen_handler_cr(struct ncsi_dev_priv *ndp,
if (!nc)
return -ENODEV;
 
+   old_state = READ_ONCE(nc->state);
if (!list_empty(>link) ||
-   nc->state != NCSI_CHANNEL_ACTIVE)
+   old_state != NCSI_CHANNEL_ACTIVE)
return 0;
 
ncsi_stop_channel_monitor(nc);
+   WRITE_ONCE(nc->state, NCSI_CHANNEL_INACTIVE);
spin_lock_irqsave(>lock, flags);
-   xchg(>state, NCSI_CHANNEL_INACTIVE);
list_add_tail_rcu(>link, >channel_queue);
spin_unlock_irqrestore(>lock, flags);
 
@@ -116,6 +121,7 @@ static int ncsi_aen_handler_hncdsc(struct ncsi_dev_priv 
*ndp,
struct ncsi_channel *nc;
struct ncsi_channel_mode *ncm;
struct ncsi_aen_hncdsc_pkt *hncdsc;
+   int old_state;
unsigned long flags;
 
/* Find the NCSI channel */
@@ -127,8 +133,9 @@ static int ncsi_aen_handler_hncdsc(struct ncsi_dev_priv 
*ndp,
ncm = >modes[NCSI_MODE_LINK];
hncdsc = (struct ncsi_aen_hncdsc_pkt *)h;
ncm->data[3] = ntohl(hncdsc->status);
+   old_state = READ_ONCE(nc->state);
if (!list_empty(>link) ||
-   nc->state != NCSI_CHANNEL_ACTIVE ||
+   old_state != NCSI_CHANNEL_ACTIVE ||
(ncm->data[3] & 0x1))
return 0;
 
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index ef017b8..fc0ae54 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -143,7 +143,7 @@ static void ncsi_report_link(struct ncsi_dev_priv *ndp, 
bool force_down)
NCSI_FOR_EACH_PACKAGE(ndp, np) {
NCSI_FOR_EACH_CHANNEL(np, nc) {
if (!list_empty(>link) ||
-   nc->state != NCSI_CHANNEL_ACTIVE)
+   READ_ONCE(nc->state) != NCSI_CHANNEL_ACTIVE)
continue;
 
if (nc->modes[NCSI_MODE_LINK].data[2] & 0x1) {
@@ -166,7 +166,7 @@ static void ncsi_channel_monitor(unsigned long data)
bool enabled;
unsigned int timeout;
unsigned long flags;
-   int ret;
+   int old_state, ret;
 
spin_lock_irqsave(>lock, flags);
timeout = nc->timeout;
@@ -175,8 +175,10 @@ static void ncsi_channel_monitor(unsigned long data)
 
if (!enabled || !list_empty(>link))

[PATCH net-next 6/8] net/ncsi: Rework the channel monitoring

2016-09-28 Thread Gavin Shan
The original NCSI channel monitoring was implemented based on a
backoff algorithm: the GLS response should be received in the
specified interval. Otherwise, the channel is regarded as dead
and failover should be taken if current channel is an active one.
There are several problems in the implementation: (A) On BCM5718,
we found when the IID (Instance ID) in the GLS command packet
changes from 255 to 1, the response corresponding to IID#1 never
comes in. It means we cannot make the unfair judgement that the
channel is dead when one response is missed. (B) The code's
readability should be improved. (C) We should do failover when
current channel is active one and the channel monitoring should
be marked as disabled before doing failover.

This reworks the channel monitoring to address all above issues.
The fields for channel monitoring is put into separate struct
and the state of channel monitoring is predefined. The channel
is regarded alive if the network controller responses to one of
two GLS commands or both of them in 5 seconds.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 net/ncsi/internal.h| 12 +---
 net/ncsi/ncsi-manage.c | 46 ++
 net/ncsi/ncsi-rsp.c|  2 +-
 3 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 26e9295..13290a7 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -187,9 +187,15 @@ struct ncsi_channel {
struct ncsi_channel_modemodes[NCSI_MODE_MAX];
struct ncsi_channel_filter  *filters[NCSI_FILTER_MAX];
struct ncsi_channel_stats   stats;
-   struct timer_list   timer;  /* Link monitor timer  */
-   boolenabled;/* Timer is enabled*/
-   unsigned inttimeout;/* Times of timeout*/
+   struct {
+   struct timer_list   timer;
+   boolenabled;
+   unsigned intstate;
+#define NCSI_CHANNEL_MONITOR_START 0
+#define NCSI_CHANNEL_MONITOR_RETRY 1
+#define NCSI_CHANNEL_MONITOR_WAIT  2
+#define NCSI_CHANNEL_MONITOR_WAIT_MAX  5
+   } monitor;
struct list_headnode;
struct list_headlink;
 };
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 41d6af9..1b797c9 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -164,13 +164,13 @@ static void ncsi_channel_monitor(unsigned long data)
struct ncsi_dev_priv *ndp = np->ndp;
struct ncsi_cmd_arg nca;
bool enabled;
-   unsigned int timeout;
+   unsigned int monitor_state;
unsigned long flags;
int old_state, ret;
 
spin_lock_irqsave(>lock, flags);
-   timeout = nc->timeout;
-   enabled = nc->enabled;
+   enabled = nc->monitor.enabled;
+   monitor_state = nc->monitor.state;
spin_unlock_irqrestore(>lock, flags);
 
if (!enabled || !list_empty(>link))
@@ -181,7 +181,9 @@ static void ncsi_channel_monitor(unsigned long data)
old_state != NCSI_CHANNEL_ACTIVE)
return;
 
-   if (!(timeout % 2)) {
+   switch (monitor_state) {
+   case NCSI_CHANNEL_MONITOR_START:
+   case NCSI_CHANNEL_MONITOR_RETRY:
nca.ndp = ndp;
nca.package = np->id;
nca.channel = nc->id;
@@ -193,14 +195,18 @@ static void ncsi_channel_monitor(unsigned long data)
   ret);
return;
}
-   }
 
-   if (timeout + 1 >= 3) {
+   break;
+   case NCSI_CHANNEL_MONITOR_WAIT ... NCSI_CHANNEL_MONITOR_WAIT_MAX:
+   break;
+   default:
if (!(ndp->flags & NCSI_DEV_HWA) &&
-   old_state == NCSI_CHANNEL_ACTIVE)
+   old_state == NCSI_CHANNEL_ACTIVE) {
ncsi_report_link(ndp, true);
+   ndp->flags |= NCSI_DEV_RESHUFFLE;
+   }
 
-   WRITE_ONCE(nc->state, NCSI_CHANNEL_INACTIVE);
+   WRITE_ONCE(nc->state, NCSI_CHANNEL_ACTIVE);
spin_lock_irqsave(>lock, flags);
list_add_tail_rcu(>link, >channel_queue);
spin_unlock_irqrestore(>lock, flags);
@@ -209,10 +215,9 @@ static void ncsi_channel_monitor(unsigned long data)
}
 
spin_lock_irqsave(>lock, flags);
-   nc->timeout = timeout + 1;
-   nc->enabled = true;
+   nc->monitor.state++;
spin_unlock_irqrestore(>lock, flags);
-   mod_timer(>timer, jiffies + HZ * (1 << (nc->timeout / 2)));
+   mod_timer(>monitor.timer, jiffies + HZ);
 }
 
 void ncsi_start_channel_monitor(struct ncsi_channel *nc)
@@ -220,12 +225,12 @@ void ncsi_start_channel_monitor(struct ncsi_channel *nc)
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   

[PATCH net-next 5/8] net/ncsi: Allow to extend NCSI request properties

2016-09-28 Thread Gavin Shan
There is only one NCSI request property for now: the response for
the sent command need drive the workqueue or not. So we had one
field (@driven) for the purpose. We lost the flexibility to extend
NCSI request properties.

This replaces @driven with @flags and @req_flags in NCSI request
and NCSI command argument struct. Each bit of the newly introduced
field can be used for one property. No functional changes introduced.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 net/ncsi/internal.h|  8 +---
 net/ncsi/ncsi-cmd.c|  2 +-
 net/ncsi/ncsi-manage.c | 19 ++-
 net/ncsi/ncsi-rsp.c|  2 +-
 4 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index c956fe8..26e9295 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -207,7 +207,8 @@ struct ncsi_package {
 struct ncsi_request {
unsigned charid;  /* Request ID - 0 to 255   */
bool used;/* Request that has been assigned  */
-   bool driven;  /* Drive state machine */
+   unsigned int flags;   /* NCSI request property   */
+#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
struct sk_buff   *cmd;/* Associated NCSI command packet  */
struct sk_buff   *rsp;/* Associated NCSI response packet */
@@ -276,7 +277,7 @@ struct ncsi_cmd_arg {
unsigned charpackage; /* Destination package ID*/
unsigned charchannel; /* Detination channel ID or 0x1f */
unsigned short   payload; /* Command packet payload length */
-   bool driven;  /* Drive the state machine?  */
+   unsigned int req_flags;   /* NCSI request properties   */
union {
unsigned char  bytes[16]; /* Command packet specific data  */
unsigned short words[8];
@@ -315,7 +316,8 @@ void ncsi_find_package_and_channel(struct ncsi_dev_priv 
*ndp,
   unsigned char id,
   struct ncsi_package **np,
   struct ncsi_channel **nc);
-struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool 
driven);
+struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp,
+   unsigned int req_flags);
 void ncsi_free_request(struct ncsi_request *nr);
 struct ncsi_dev *ncsi_find_dev(struct net_device *dev);
 int ncsi_process_next_channel(struct ncsi_dev_priv *ndp);
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 21057a8..db7083b 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -272,7 +272,7 @@ static struct ncsi_request *ncsi_alloc_command(struct 
ncsi_cmd_arg *nca)
struct sk_buff *skb;
struct ncsi_request *nr;
 
-   nr = ncsi_alloc_request(ndp, nca->driven);
+   nr = ncsi_alloc_request(ndp, nca->req_flags);
if (!nr)
return NULL;
 
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 64aab38..41d6af9 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -186,7 +186,7 @@ static void ncsi_channel_monitor(unsigned long data)
nca.package = np->id;
nca.channel = nc->id;
nca.type = NCSI_PKT_CMD_GLS;
-   nca.driven = false;
+   nca.req_flags = 0;
ret = ncsi_xmit_cmd();
if (ret) {
netdev_err(ndp->ndev.dev, "Error %d sending GLS\n",
@@ -407,7 +407,8 @@ void ncsi_find_package_and_channel(struct ncsi_dev_priv 
*ndp,
  * be same. Otherwise, the bogus response might be replied. So
  * the available IDs are allocated in round-robin fashion.
  */
-struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp, bool driven)
+struct ncsi_request *ncsi_alloc_request(struct ncsi_dev_priv *ndp,
+   unsigned int req_flags)
 {
struct ncsi_request *nr = NULL;
int i, limit = ARRAY_SIZE(ndp->requests);
@@ -421,7 +422,7 @@ struct ncsi_request *ncsi_alloc_request(struct 
ncsi_dev_priv *ndp, bool driven)
 
nr = >requests[i];
nr->used = true;
-   nr->driven = driven;
+   nr->flags = req_flags;
ndp->request_id = i + 1;
goto found;
}
@@ -433,7 +434,7 @@ struct ncsi_request *ncsi_alloc_request(struct 
ncsi_dev_priv *ndp, bool driven)
 
nr = >requests[i];
nr->used = true;
-   nr->driven = driven;
+   nr->flags = req_flags;
ndp->request_id = i + 1;
goto found;
}
@@ -461,7 +462,7 @@ void ncsi_free_request(struct ncsi_request *nr)
nr->cmd = NULL;
nr->rsp = 

[PATCH net-next 4/8] net/ncsi: Rework request index allocation

2016-09-28 Thread Gavin Shan
The NCSI request index (struct ncsi_request::id) is put into instance
ID (IID) field while sending NCSI command packet. It was designed the
available IDs are given in round-robin fashion. @ndp->request_id was
introduced to represent the next available ID, but it has been used
as number of successively allocated IDs. It breaks the round-robin
design. Besides, we shouldn't put 0 to NCSI command packet's IID
field, meaning ID#0 should be reserved according section 6.3.1.1
in NCSI spec (v1.1.0).

This fixes above two issues. With it applied, the available IDs will
be assigned in round-robin fashion and ID#0 won't be assigned.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 net/ncsi/internal.h|  1 +
 net/ncsi/ncsi-manage.c | 17 +
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 66dc851..c956fe8 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -259,6 +259,7 @@ struct ncsi_dev_priv {
struct list_headpackages;/* List of packages   */
struct ncsi_request requests[256];   /* Request table  */
unsigned intrequest_id;  /* Last used request ID   */
+#define NCSI_REQ_START_IDX 1
unsigned intpending_req_num; /* Number of pending requests */
struct ncsi_package *active_package; /* Currently handled package  */
struct ncsi_channel *active_channel; /* Currently handled channel  */
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 3eeb915..64aab38 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -415,30 +415,31 @@ struct ncsi_request *ncsi_alloc_request(struct 
ncsi_dev_priv *ndp, bool driven)
 
/* Check if there is one available request until the ceiling */
spin_lock_irqsave(>lock, flags);
-   for (i = ndp->request_id; !nr && i < limit; i++) {
+   for (i = ndp->request_id; i < limit; i++) {
if (ndp->requests[i].used)
continue;
 
nr = >requests[i];
nr->used = true;
nr->driven = driven;
-   if (++ndp->request_id >= limit)
-   ndp->request_id = 0;
+   ndp->request_id = i + 1;
+   goto found;
}
 
/* Fail back to check from the starting cursor */
-   for (i = 0; !nr && i < ndp->request_id; i++) {
+   for (i = NCSI_REQ_START_IDX; i < ndp->request_id; i++) {
if (ndp->requests[i].used)
continue;
 
nr = >requests[i];
nr->used = true;
nr->driven = driven;
-   if (++ndp->request_id >= limit)
-   ndp->request_id = 0;
+   ndp->request_id = i + 1;
+   goto found;
}
-   spin_unlock_irqrestore(>lock, flags);
 
+found:
+   spin_unlock_irqrestore(>lock, flags);
return nr;
 }
 
@@ -1122,7 +1123,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
/* Initialize private NCSI device */
spin_lock_init(>lock);
INIT_LIST_HEAD(>packages);
-   ndp->request_id = 0;
+   ndp->request_id = NCSI_REQ_START_IDX;
for (i = 0; i < ARRAY_SIZE(ndp->requests); i++) {
ndp->requests[i].id = i;
ndp->requests[i].ndp = ndp;
-- 
2.1.0



[PATCH net-next 3/8] net/ncsi: Don't probe on the reserved channel ID (0x1f)

2016-09-28 Thread Gavin Shan
We needn't send CIS (Clear Initial State) command to the NCSI
reserved channel (0x1f) in the enumeration. We shouldn't receive
a valid response from CIS on NCSI channel 0x1f.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 net/ncsi/ncsi-manage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 73883ed..3eeb915 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -887,12 +887,12 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
nd->state = ncsi_dev_state_probe_cis;
break;
case ncsi_dev_state_probe_cis:
-   ndp->pending_req_num = 32;
+   ndp->pending_req_num = NCSI_RESERVED_CHANNEL;
 
/* Clear initial state */
nca.type = NCSI_PKT_CMD_CIS;
nca.package = ndp->active_package->id;
-   for (index = 0; index < 0x20; index++) {
+   for (index = 0; index < NCSI_RESERVED_CHANNEL; index++) {
nca.channel = index;
ret = ncsi_xmit_cmd();
if (ret)
-- 
2.1.0



[PATCH net-next 2/8] net/ncsi: Introduce NCSI_RESERVED_CHANNEL

2016-09-28 Thread Gavin Shan
This defines NCSI_RESERVED_CHANNEL as the reserved NCSI channel
ID (0x1f). No logical changes introduced.

Signed-off-by: Gavin Shan 
Reviewed-by: Joel Stanley 
---
 net/ncsi/internal.h|  1 +
 net/ncsi/ncsi-manage.c | 14 +++---
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 33738c0..66dc851 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -170,6 +170,7 @@ struct ncsi_package;
 
 #define NCSI_PACKAGE_SHIFT 5
 #define NCSI_PACKAGE_INDEX(c)  (((c) >> NCSI_PACKAGE_SHIFT) & 0x7)
+#define NCSI_RESERVED_CHANNEL  0x1f
 #define NCSI_CHANNEL_INDEX(c)  ((c) & ((1 << NCSI_PACKAGE_SHIFT) - 1))
 #define NCSI_TO_CHANNEL(p, c)  (((p) << NCSI_PACKAGE_SHIFT) | (c))
 
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index fc0ae54..73883ed 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -529,7 +529,7 @@ static void ncsi_suspend_channel(struct ncsi_dev_priv *ndp)
nca.package = np->id;
if (nd->state == ncsi_dev_state_suspend_select) {
nca.type = NCSI_PKT_CMD_SP;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
if (ndp->flags & NCSI_DEV_HWA)
nca.bytes[0] = 0;
else
@@ -546,7 +546,7 @@ static void ncsi_suspend_channel(struct ncsi_dev_priv *ndp)
nd->state = ncsi_dev_state_suspend_deselect;
} else if (nd->state == ncsi_dev_state_suspend_deselect) {
nca.type = NCSI_PKT_CMD_DP;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
nd->state = ncsi_dev_state_suspend_done;
}
 
@@ -592,7 +592,7 @@ static void ncsi_configure_channel(struct ncsi_dev_priv 
*ndp)
else
nca.bytes[0] = 1;
nca.package = np->id;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
ret = ncsi_xmit_cmd();
if (ret)
goto error;
@@ -810,7 +810,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
 
/* Deselect all possible packages */
nca.type = NCSI_PKT_CMD_DP;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
for (index = 0; index < 8; index++) {
nca.package = index;
ret = ncsi_xmit_cmd();
@@ -826,7 +826,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
/* Select all possible packages */
nca.type = NCSI_PKT_CMD_SP;
nca.bytes[0] = 1;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
for (index = 0; index < 8; index++) {
nca.package = index;
ret = ncsi_xmit_cmd();
@@ -879,7 +879,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
nca.type = NCSI_PKT_CMD_SP;
nca.bytes[0] = 1;
nca.package = ndp->active_package->id;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
ret = ncsi_xmit_cmd();
if (ret)
goto error;
@@ -936,7 +936,7 @@ static void ncsi_probe_channel(struct ncsi_dev_priv *ndp)
/* Deselect the active package */
nca.type = NCSI_PKT_CMD_DP;
nca.package = ndp->active_package->id;
-   nca.channel = 0x1f;
+   nca.channel = NCSI_RESERVED_CHANNEL;
ret = ncsi_xmit_cmd();
if (ret)
goto error;
-- 
2.1.0



Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-28 Thread Eric Dumazet
On Wed, 2016-09-28 at 20:54 -0700, Tom Herbert wrote:
> xps_flows maintains a per device flow table that is indexed by the
> skbuff hash. The table is only consulted when there is no queue saved in
> a transmit socket for an skbuff.
> 
> Each entry in the flow table contains a queue index and a queue
> pointer. The queue pointer is set when a queue is chosen using a
> flow table entry. This pointer is set to the head pointer in the
> transmit queue (which is maintained by BQL).
> 
> The new function get_xps_flows_index that looks up flows in the
> xps_flows table. The entry returned gives the last queue a matching flow
> used. The returned queue is compared against the normal XPS queue. If
> they are different, then we only switch if the tail pointer in the TX
> queue has advanced past the pointer saved in the entry. In this
> way OOO should be avoided when XPS wants to use a different queue.

There is something I do not understand with this.

If this OOO avoidance is tied to BQL, it means that packets sitting in a
qdisc wont be part of the detection.

So packets of flow X could possibly be queued on multiple qdiscs.





[PATCH v2 net-next] mlx5: Add ndo_poll_controller() implementation

2016-09-28 Thread Calvin Owens
This implements ndo_poll_controller in net_device_ops callbacks for mlx5,
which is necessary to use netconsole with this driver.

Acked-By: Saeed Mahameed 
Signed-off-by: Calvin Owens 
---
Changes in v2:
* Only iterate channels to avoid redundant napi_schedule() calls

drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b58cfe3..7eaf380 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3188,6 +3188,20 @@ static int mlx5e_xdp(struct net_device *dev, struct 
netdev_xdp *xdp)
}
 }
 
+#ifdef CONFIG_NET_POLL_CONTROLLER
+/* Fake "interrupt" called by netpoll (eg netconsole) to send skbs without
+ * reenabling interrupts.
+ */
+static void mlx5e_netpoll(struct net_device *dev)
+{
+   struct mlx5e_priv *priv = netdev_priv(dev);
+   int i;
+
+   for (i = 0; i < priv->params.num_channels; i++)
+   napi_schedule(>channel[i]->napi);
+}
+#endif
+
 static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open= mlx5e_open,
.ndo_stop= mlx5e_close,
@@ -3208,6 +3222,9 @@ static const struct net_device_ops mlx5e_netdev_ops_basic 
= {
 #endif
.ndo_tx_timeout  = mlx5e_tx_timeout,
.ndo_xdp = mlx5e_xdp,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -3240,6 +3257,9 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov 
= {
.ndo_get_vf_stats= mlx5e_get_vf_stats,
.ndo_tx_timeout  = mlx5e_tx_timeout,
.ndo_xdp = mlx5e_xdp,
+#ifdef CONFIG_NET_POLL_CONTROLLER
+   .ndo_poll_controller = mlx5e_netpoll,
+#endif
 };
 
 static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
-- 
2.9.3



Re: Explaining RX-stages for XDP

2016-09-28 Thread Alexei Starovoitov
On Wed, Sep 28, 2016 at 12:44:31PM +0200, Jesper Dangaard Brouer wrote:
> 
> The idea is quite different.  It has nothing to do with Edward's
> proposal[3].  The RX packet-vector is simply an array, either of pointers
> or index numbers (into the RX-ring).  The needed changes are completely
> contained inside the driver.
> 
> > As far as intermixed XDP vs stack traffic, I think for DoS case the
> > traffic patterns are binary. Either all of it is good or under attack
> > most of the traffic is bad, so makes sense to optimize for these two.
> > 50/50 case I think is artificial and not worth optimizing for.
> 
> Sorry, but I feel you have completely misunderstood the concept of my
> idea.  It does not matter what traffic pattern you believe or don't
> believe in, it is irrelevant.  The fact is that intermixed traffic is
> possible with the current solution.  The core of my idea is to remove
> the possibility for this intermixed traffic to occur, simply by seeing
> XDP as a RX-stage before the stack.

Is the idea to add two extra 'to_stack' and 'not_to_stack' arrays
after XDP program made a decision ?
Sounds interesting, but I cannot evaluate it properly, since I fail
to see what problem it solves and how it can be benchmarked.
Since you're saying the percentage of good vs bad traffic is irrelevant
then this idea should help all types of traffic in all cases?
That would be truly awesome.

I think the discussion around performance optimizations
should start from:
1. describing the problem: What is the use case?
What is the traffic pattern?
After problem is understood and accepted as a valid problem
and not some research subject then
2. define a benchmark that simulates it
when benchmarking methodology is agreed upon
we can go the next step of
3. come up with multiple ideas/proposals to solve the problem 1.
and finally discuss
4. different implementations of the best proposal from step 3

Without 1 it's easy to come up with a fake benchmark in step 2
that can justify addition of any code in 4
Without 2, the comparison of different proposals will be
based on subjective opinions instead of hard data.
Without 3 the proposed patch will be a hard sell, since
other alternatives were not on the table.

For new features the steps are different, of course.

When we started on XDP among everything else the first two problems
that it's supposed to solve were DoS and loadbalancer/ila_router.
For former the attack traffic is mostly dropped after parsing.
That behavior is simulated by samples/bpf/xdp1_kern.c
For load balancer most of the traffic is parsed and transmitted
out on the same device after packet rewrite. That scenario
is simulated by samples/bpf/xdp2_kern.c
Therefore the benchmarking and performance numbers were
centered around these two programs and I believe XDP itself
is a good solution for DoS and ila_router based on
the performance numbers from these two benchmarks.
I think XDP is _not_ a good solution for many other use cases.

The DoS protection feature when not under attack was
not benchmarked, so this is the area to work on in the future.
(the appropriate benchmark is xdp_pass+tcp+netperf)

Another area that was not benchmarked is how XDP drop/tx
interacts with normal control plane traffic.
We assumed that different hw tx queues should provide
sufficient separation. That is certainly to be tested.
A lot of work to do, no doubt.

> > Optimizing xdp for 'mostly good' traffic is indeed a challange.
> > We'd need all the tricks to make it as good as normal skb-based traffic.
> >
> > I haven't seen any tests yet comparing xdp with 'return XDP_PASS' program
> > vs no xdp at all running netperf tcp/udp in user space. It shouldn't
> > be too far off.
> 
> Well, I did post numbers to the list with a 'return XDP_PASS' program[4]:
>  https://mid.mail-archive.com/netdev@vger.kernel.org/msg122350.html
> 
> Wake-up and smell the coffee, please revise your assumptions:
>  * It showed that the performance reduction is 25.98%!!!
>(AB comparison dropping packets in iptables raw)

sure. iptables drop is slow with xdp_pass vs iptables without xdp.
This is not a benchmark I was interested in, since I don't
understand what use case it simulates.

> Conclusion: These measurements confirm that we need a page recycle
> facility for the drivers before switching to order-0 allocations.
...
> page_pool work
>  - iptables-raw-drop: driver mlx5
>* 4,487,518 pps - baseline-before =>  100.0%
>* 3,624,237 pps - mlx5 order0-patch   => - 19.2% (slower)
>* 4,806,142 pps - PoC page_pool patch =>   +7.1% (faster)

I agree that generic page_pool is a useful facility, but
I don't think it's the right approach to design it based on
iptables_drop+xdp_pass benchmark.
You're saying the protoype gives 7.1% improvement. Ok,
but what is the problem being solved?
If the use case is DoS and fast as possible drop, then XDP_DROP
is a better alternative. Why design/benchmark page_pool
based on iptables_drop ?
If page_pool replaces 

[PATCH v2 net-next 5/5] xps: Documentation for transmit socketles flow steering

2016-09-28 Thread Tom Herbert
Signed-off-by: Tom Herbert 
---
 Documentation/ABI/testing/sysfs-class-net |  8 
 Documentation/networking/scaling.txt  | 26 ++
 2 files changed, 34 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net 
b/Documentation/ABI/testing/sysfs-class-net
index 668604f..0d2a6a9 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -251,3 +251,11 @@ Contact:   netdev@vger.kernel.org
 Description:
Indicates the unique physical switch identifier of a switch this
port belongs to, as a string.
+
+What:  /sys/class/net//xps_dev_flow_table_cnt
+Date:  October 2016
+KernelVersion: 4.8
+Contact:   netdev@vger.kernel.org
+Description:
+   Indicates the number of entries in the XPS socketless flow
+   table.
diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index 59f4db2..7f9590c 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -400,6 +400,26 @@ transport layer is responsible for setting ooo_okay 
appropriately. TCP,
 for instance, sets the flag when all data for a connection has been
 acknowledged.
 
+XPS: Transmit Socketless Flow Steering
+==
+
+Not all flows have an associated socket in which the transmit queue and
+ooo information can be saved. For instance, packets being transmitted over
+OVS for VMs do not have a socket that is visible to the kernel. To allow
+XPS to be effectively used a flow hash table is employed to save queue and
+ooo information for such socketless flows.
+
+The idea is that transmit queues are saved for socketless flows in a
+flow table that is indexed by skbuff hash. The flow table entries have
+two values: the queue_index and the head cnt of packets from the TX
+queue. When a packet is being sent without an associated socket the
+hash table is consulted which returns the queue_index to use. The
+returned queue is compared to the queue that would be selected by XPS.
+If they are are different the XPS queue is selected only if the tail cnt
+in the TX queue advances beyond the recorded head cnt. This checks if
+sending the packet on the new queue could cause ooo packets for the
+flow.
+
  XPS Configuration
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
@@ -409,6 +429,12 @@ queue is configured using the sysfs file entry:
 
 /sys/class/net//queues/tx-/xps_cpus
 
+Transmit socketless flow steering is enabled by setting the number of
+entries in the XPS flow table. This is configured in the sysfs file
+entry:
+
+/sys/class/net//xps_dev_flow_table_cnt
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration
-- 
2.9.3



[PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-28 Thread Tom Herbert
xps_flows maintains a per device flow table that is indexed by the
skbuff hash. The table is only consulted when there is no queue saved in
a transmit socket for an skbuff.

Each entry in the flow table contains a queue index and a queue
pointer. The queue pointer is set when a queue is chosen using a
flow table entry. This pointer is set to the head pointer in the
transmit queue (which is maintained by BQL).

The new function get_xps_flows_index that looks up flows in the
xps_flows table. The entry returned gives the last queue a matching flow
used. The returned queue is compared against the normal XPS queue. If
they are different, then we only switch if the tail pointer in the TX
queue has advanced past the pointer saved in the entry. In this
way OOO should be avoided when XPS wants to use a different queue.

Signed-off-by: Tom Herbert 
---
 net/Kconfig|  6 
 net/core/dev.c | 87 +-
 2 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..f77fad1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -255,6 +255,12 @@ config XPS
depends on SMP
default y
 
+config XPS_FLOWS
+   bool
+   depends on XPS
+   depends on BQL
+   default y
+
 config HWBM
bool
 
diff --git a/net/core/dev.c b/net/core/dev.c
index c0c291f..1ca08b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3210,6 +3210,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
+/* Must be called with RCU read_lock */
 static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
@@ -3217,7 +3218,6 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
struct xps_map *map;
int queue_index = -1;
 
-   rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_maps);
if (dev_maps) {
map = rcu_dereference(
@@ -3228,15 +3228,62 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
else
queue_index = 
map->queues[reciprocal_scale(skb_get_hash(skb),
   
map->len)];
-   if (unlikely(queue_index >= dev->real_num_tx_queues))
-   queue_index = -1;
+   if (queue_index >= 0 &&
+   likely(queue_index < dev->real_num_tx_queues))
+   return queue_index;
}
}
-   rcu_read_unlock();
+#endif
+   return skb_tx_hash(dev, skb);
+}
+
+/* Must be called with RCU read_lock */
+static int get_xps_flows_index(struct net_device *dev, struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS_FLOWS
+   struct xps_dev_flow_table *flow_table;
+   struct xps_dev_flow ent;
+   int queue_index;
+   struct netdev_queue *txq;
+   u32 hash;
+
+   queue_index = get_xps_queue(dev, skb);
+   if (queue_index < 0)
+   return -1;
+
+   flow_table = rcu_dereference(dev->xps_flow_table);
+   if (!flow_table)
+   return queue_index;
+
+   hash = skb_get_hash(skb);
+   if (!hash)
+   return queue_index;
+
+   ent.v64 = flow_table->flows[hash & flow_table->mask].v64;
+
+   if (queue_index != ent.queue_index &&
+   ent.queue_index >= 0 &&
+   ent.queue_index < dev->real_num_tx_queues) {
+   txq = netdev_get_tx_queue(dev, ent.queue_index);
+   if ((int)(txq->dql.num_completed_ops - ent.queue_ptr) < 0)  {
+   /* The current queue's tail has not advanced beyond the
+* last packet that was enqueued using the table entry.
+* We can't change queues without risking OOO. Stick
+* with the queue listed in the flow table.
+*/
+   queue_index = ent.queue_index;
+   }
+   }
+
+   /* Save the updated entry */
+   txq = netdev_get_tx_queue(dev, queue_index);
+   ent.queue_index = queue_index;
+   ent.queue_ptr = txq->dql.num_enqueue_ops;
+   flow_table->flows[hash & flow_table->mask].v64 = ent.v64;
 
return queue_index;
 #else
-   return -1;
+   return get_xps_queue(dev, skb);
 #endif
 }
 
@@ -3244,22 +3291,24 @@ static u16 __netdev_pick_tx(struct net_device *dev, 
struct sk_buff *skb)
 {
struct sock *sk = skb->sk;
int queue_index = sk_tx_queue_get(sk);
-
-   if (queue_index < 0 || skb->ooo_okay ||
-   queue_index >= dev->real_num_tx_queues) {
-   int new_index = get_xps_queue(dev, skb);
-   if (new_index < 0)
-   new_index = skb_tx_hash(dev, skb);
-
-   if (queue_index != new_index && sk &&
-

[PATCH v2 net-next 1/5] net: Set SW hash in skb_set_hash_from_sk

2016-09-28 Thread Tom Herbert
Use the __skb_set_sw_hash to set the hash in an skbuff from the socket
txhash.

Signed-off-by: Tom Herbert 
---
 include/net/sock.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ebf75db..17d379a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1920,10 +1920,8 @@ static inline void sock_poll_wait(struct file *filp,
 
 static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk)
 {
-   if (sk->sk_txhash) {
-   skb->l4_hash = 1;
-   skb->hash = sk->sk_txhash;
-   }
+   if (sk->sk_txhash)
+   __skb_set_sw_hash(skb, sk->sk_txhash, true);
 }
 
 void skb_set_owner_w(struct sk_buff *skb, struct sock *sk);
-- 
2.9.3



[PATCH v2 net-next 2/5] dql: Add counters for number of queuing and completion operations

2016-09-28 Thread Tom Herbert
Add two new counters to struct dql that are num_enqueue_ops and
num_completed_ops. num_enqueue_ops is incremented by one in each call to
dql_queued. num_enqueue_ops is incremented in dql_completed which takes
an argument indicating number of operations completed. These counters
are only intended for statistics and do not impact the BQL algorithm.

We add a new sysfs entry in byte_queue_limits named inflight_pkts.
This provides the number of packets in flight for the queue by
dql->num_enqueue_ops - dql->num_completed_ops.

Signed-off-by: Tom Herbert 
---
 include/linux/dynamic_queue_limits.h |  7 ++-
 include/linux/netdevice.h|  2 +-
 lib/dynamic_queue_limits.c   |  3 ++-
 net/core/net-sysfs.c | 14 ++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/dynamic_queue_limits.h 
b/include/linux/dynamic_queue_limits.h
index a4be703..b6a4804 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -43,6 +43,8 @@ struct dql {
unsigned intadj_limit;  /* limit + num_completed */
unsigned intlast_obj_cnt;   /* Count at last queuing */
 
+   unsigned intnum_enqueue_ops;/* Number of queue operations */
+
/* Fields accessed only by completion path (dql_completed) */
 
unsigned intlimit cacheline_aligned_in_smp; /* Current limit */
@@ -55,6 +57,8 @@ struct dql {
unsigned intlowest_slack;   /* Lowest slack found */
unsigned long   slack_start_time;   /* Time slacks seen */
 
+   unsigned intnum_completed_ops;  /* Number of complete ops */
+
/* Configuration */
unsigned intmax_limit;  /* Max limit */
unsigned intmin_limit;  /* Minimum limit */
@@ -83,6 +87,7 @@ static inline void dql_queued(struct dql *dql, unsigned int 
count)
barrier();
 
dql->num_queued += count;
+   dql->num_enqueue_ops++;
 }
 
 /* Returns how many objects can be queued, < 0 indicates over limit. */
@@ -92,7 +97,7 @@ static inline int dql_avail(const struct dql *dql)
 }
 
 /* Record number of completed objects and recalculate the limit. */
-void dql_completed(struct dql *dql, unsigned int count);
+void dql_completed(struct dql *dql, unsigned int count, unsigned int ops);
 
 /* Reset dql state */
 void dql_reset(struct dql *dql);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 136ae6bb..9567107 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3015,7 +3015,7 @@ static inline void netdev_tx_completed_queue(struct 
netdev_queue *dev_queue,
if (unlikely(!bytes))
return;
 
-   dql_completed(_queue->dql, bytes);
+   dql_completed(_queue->dql, bytes, pkts);
 
/*
 * Without the memory barrier there is a small possiblity that
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index f346715..d5e7a27 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -14,7 +14,7 @@
 #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
 
 /* Records completed count and recalculates the queue limit */
-void dql_completed(struct dql *dql, unsigned int count)
+void dql_completed(struct dql *dql, unsigned int count, unsigned int ops)
 {
unsigned int inprogress, prev_inprogress, limit;
unsigned int ovlimit, completed, num_queued;
@@ -108,6 +108,7 @@ void dql_completed(struct dql *dql, unsigned int count)
dql->prev_ovlimit = ovlimit;
dql->prev_last_obj_cnt = dql->last_obj_cnt;
dql->num_completed = completed;
+   dql->num_completed_ops += ops;
dql->prev_num_queued = num_queued;
 }
 EXPORT_SYMBOL(dql_completed);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 6e4f347..ab7b0b6 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1147,6 +1147,19 @@ static ssize_t bql_show_inflight(struct netdev_queue 
*queue,
 static struct netdev_queue_attribute bql_inflight_attribute =
__ATTR(inflight, S_IRUGO, bql_show_inflight, NULL);
 
+static ssize_t bql_show_inflight_pkts(struct netdev_queue *queue,
+ struct netdev_queue_attribute *attr,
+ char *buf)
+{
+   struct dql *dql = >dql;
+
+   return sprintf(buf, "%u\n",
+  dql->num_enqueue_ops - dql->num_completed_ops);
+}
+
+static struct netdev_queue_attribute bql_inflight_pkts_attribute =
+   __ATTR(inflight_pkts, S_IRUGO, bql_show_inflight_pkts, NULL);
+
 #define BQL_ATTR(NAME, FIELD)  \
 static ssize_t bql_show_ ## NAME(struct netdev_queue *queue,   \
 struct netdev_queue_attribute *attr,   \
@@ -1176,6 +1189,7 @@ static struct attribute *dql_attrs[] = {
_limit_min_attribute.attr,

[PATCH v2 net-next 3/5] net: Add xps_dev_flow_table_cnt

2016-09-28 Thread Tom Herbert
Add infrastructure and definitions to create XFS flow tables. This
creates the new sys entry /sys/class/net/eth*/xps_dev_flow_table_cnt

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h | 24 +
 net/core/net-sysfs.c  | 89 +++
 2 files changed, 113 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9567107..60063b3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -736,6 +736,27 @@ struct xps_dev_maps {
 (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */
 
+#ifdef CONFIG_XPS_FLOWS
+struct xps_dev_flow {
+   union {
+   u64 v64;
+   struct {
+   int queue_index;
+   unsigned intqueue_ptr;
+   };
+   };
+};
+
+struct xps_dev_flow_table {
+   unsigned int mask;
+   struct rcu_head rcu;
+   struct xps_dev_flow flows[0];
+};
+#define XPS_DEV_FLOW_TABLE_SIZE(_num) (sizeof(struct xps_dev_flow_table) + \
+   ((_num) * sizeof(struct xps_dev_flow)))
+
+#endif /* CONFIG_XPS_FLOWS */
+
 #define TC_MAX_QUEUE   16
 #define TC_BITMASK 15
 /* HW offloaded queuing disciplines txq count and offset maps */
@@ -1825,6 +1846,9 @@ struct net_device {
 #ifdef CONFIG_XPS
struct xps_dev_maps __rcu *xps_maps;
 #endif
+#ifdef CONFIG_XPS_FLOWS
+   struct xps_dev_flow_table __rcu *xps_flow_table;
+#endif
 #ifdef CONFIG_NET_CLS_ACT
struct tcf_proto __rcu  *egress_cl_list;
 #endif
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index ab7b0b6..0d00b9c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -503,6 +503,92 @@ static ssize_t phys_switch_id_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(phys_switch_id);
 
+#ifdef CONFIG_XPS_FLOWS
+static void xps_dev_flow_table_release(struct rcu_head *rcu)
+{
+   struct xps_dev_flow_table *table = container_of(rcu,
+   struct xps_dev_flow_table, rcu);
+   vfree(table);
+}
+
+static int change_xps_dev_flow_table_cnt(struct net_device *dev,
+unsigned long count)
+{
+   unsigned long mask;
+   struct xps_dev_flow_table *table, *old_table;
+   static DEFINE_SPINLOCK(xps_dev_flow_lock);
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (count) {
+   mask = count - 1;
+   /* mask = roundup_pow_of_two(count) - 1;
+* without overflows...
+*/
+   while ((mask | (mask >> 1)) != mask)
+   mask |= (mask >> 1);
+   /* On 64 bit arches, must check mask fits in table->mask (u32),
+* and on 32bit arches, must check
+* XPS_DEV_FLOW_TABLE_SIZE(mask + 1) doesn't overflow.
+*/
+#if BITS_PER_LONG > 32
+   if (mask > (unsigned long)(u32)mask)
+   return -EINVAL;
+#else
+   if (mask > (ULONG_MAX - XPS_DEV_FLOW_TABLE_SIZE(1))
+   / sizeof(struct xps_dev_flow)) {
+   /* Enforce a limit to prevent overflow */
+   return -EINVAL;
+   }
+#endif
+   table = vmalloc(XPS_DEV_FLOW_TABLE_SIZE(mask + 1));
+   if (!table)
+   return -ENOMEM;
+
+   table->mask = mask;
+   for (count = 0; count <= mask; count++)
+   table->flows[count].queue_index = -1;
+   } else
+   table = NULL;
+
+   spin_lock(_dev_flow_lock);
+   old_table = rcu_dereference_protected(dev->xps_flow_table,
+ 
lockdep_is_held(_dev_flow_lock));
+   rcu_assign_pointer(dev->xps_flow_table, table);
+   spin_unlock(_dev_flow_lock);
+
+   if (old_table)
+   call_rcu(_table->rcu, xps_dev_flow_table_release);
+
+   return 0;
+}
+
+static ssize_t xps_dev_flow_table_cnt_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t len)
+{
+   return netdev_store(dev, attr, buf, len, change_xps_dev_flow_table_cnt);
+}
+
+static ssize_t xps_dev_flow_table_cnt_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct net_device *netdev = to_net_dev(dev);
+   struct xps_dev_flow_table *table;
+   unsigned int cnt = 0;
+
+   rcu_read_lock();
+   table = rcu_dereference(netdev->xps_flow_table);
+   if (table)
+   cnt = table->mask + 1;
+   rcu_read_unlock();
+
+   return sprintf(buf, fmt_dec, cnt);
+}
+DEVICE_ATTR_RW(xps_dev_flow_table_cnt);
+#endif /* CONFIG_XPS_FLOWS */
+
 static struct attribute *net_class_attrs[] = {
_attr_netdev_group.attr,
_attr_type.attr,
@@ -531,6 +617,9 

[PATCH v2 net-next 0/5] xps_flows: XPS flow steering when there is no socket

2016-09-28 Thread Tom Herbert
This patch set introduces transmit flow steering for socketless packets.
The idea is that we record the transmit queues in a flow table that is
indexed by skbuff hash.  The flow table entries have two values: the
queue_index and the head cnt of packets from the TX queue. We only allow
a queue to change for a flow if the tail cnt in the TX queue advances
beyond the recorded head cnt. That is the condition that should indicate
that all outstanding packets for the flow have completed transmission so
the queue can change.

Tracking the inflight queue is performed as part of DQL. Two fields are
added to the dql structure: num_enqueue_ops and num_completed_ops.
num_enqueue_ops incremented in dql_queued and num_completed_ops is
incremented in dql_completed by the number of operations completed (a
new argument to the function).

This patch set creates /sys/class/net/eth*/xps_dev_flow_table_cnt
which number of entries in the XPS flow table.

Note that the functionality here is technically best effort (for
instance we don't obtain a lock while processing a flow table entry).
Under high load it is possible that OOO packets can still be generated
due to XPS if two threads are hammering on the same flow table entry.
The assumption of these patches is that OOO packets are not the end of
the world and these should prevent OOO in most common use cases with
XPS.

This is a followup to previous RFC version. Fixes from RFC are:

  - Move counters to DQL
  - Fixed typo
  - Simplified get flow index funtion
  - Fixed sysfs flow_table_cnt to properly use DEVICE_ATTR_RW
  - Renamed the mechanism

V2:
  - Added documentation in scaling.txt and sysfs documentation
  - Call skb_tx_hash directly from get_xps_queue. This allows
the socketless transmit flow steering to work properly if
a flow is bouncing between non-XPS and XPS CPUS. (suggested
by Alexander Duyck).
  - Added a whold bunch of tested results provided by Rick Jones
(Thanks Rick!)

Tested:
  Manually forced all packets to go through the xps_flows path.
  Observed that some flows were deferred to change queues because
  packets were in flight witht the flow bucket.

Testing done by Rick Jones:

  Here is a quick look at performance tests for the result of trying the
  prototype fix for the packet reordering problem with VMs sending over
  an XPS-configured NIC.  In particular, the Emulex/Avago/Broadcom
  Skyhawk.  The fix was applied to a 4.4 kernel.

  Before: 3884 Mbit/s
  After: 8897 Mbit/s

  That was from a VM on a node with a Skyhawk and 2 E5-2640 processors
  to baremetal E5-2640 with a BE3.  Physical MTU was 1500, the VM's
  vNIC's MTU was 1400.  Systems were HPE ProLiants in OS Control Mode
  for power management, with the "performance" frequency governor
  loaded. An OpenStack Mitaka setup with Distributed Virtual Router.

  We had some other NIC types in the setup as well.  XPS was also
  enabled on the ConnectX3-Pro.  It was not enabled on the 82599ES (a
  function of the kernel being used, which had it disabled from the
  first reports of XPS negatively affecting VM traffic at the beginning
  of the year)

  Average Mbit/s From NIC type To Bare Metal BE3:

   NIC Type,
   CPU on VM HostBeforeAfter
  
  ConnectX-3 Pro,E5-2670v39224 9271
  BE3, E5-26409016 9022
  82599, E5-2640  9192 9003
  BCM57840, E5-2640   9213 9153
  Skyhawk, E5-26403884 8897

  For completeness:
  Average Mbit/s To NIC type from Bare Metal BE3:

  NIC Type,
  CPU on VM HostBeforeAfter
  
  ConnectX-3 Pro,E5-2670v39322 9144
  BE3, E5-26409074 9017
  82599, E5-2640  8670 8564
  BCM57840, E5-2640   2468 *   7979
  Skyhawk, E5-26408897 9269

  * This is the Busted bnx2x NIC FW GRO implementation issue.  It was
not visible in the "After" because the system was setup to disable
the NIC FW GRO by the time it booted on the fix kernel.

  Average Transactions/s Between NIC type and Bare Metal BE3:

  NIC Type,
  CPU on VM HostBeforeAfter
  
  ConnectX-3 Pro,E5-2670v3   12421 12612
  BE3, E5-26408178  8484
  82599, E5-2640  8499  8549
  BCM57840, E5-2640   8544  8560
  Skyhawk, E5-26408537  8701

Tom Herbert (5):
  net: Set SW hash in skb_set_hash_from_sk
  dql: Add counters for number of queuing and completion operations
  net: Add xps_dev_flow_table_cnt
  xps_flows: XPS for packets that don't have a socket
  xps: Documentation for transmit socketles flow steering

 Documentation/ABI/testing/sysfs-class-net |   8 +++
 Documentation/networking/scaling.txt  |  26 
 include/linux/dynamic_queue_limits.h 

Re: [PATCH net v2] L2TP:Adjust intf MTU,factor underlay L3,overlay L2

2016-09-28 Thread R. Parameswaran


Hi David,

Please see inline:

On Wed, 28 Sep 2016, David Miller wrote:

> From: "R. Parameswaran" 
> Date: Tue, 27 Sep 2016 12:17:21 -0700 (PDT)
> 
> > Later, in vxlan_dev_configure(), called from vxlan_dev_create(), it gets 
> > adjusted to account for the headers:
> > 
> > vxlan_dev_configure():
> > ...
> >  if (!conf->mtu)
> > dev->mtu = lowerdev->mtu - (use_ipv6 ? 
> > VXLAN6_HEADROOM : VXLAN_HEADROOM);
> > 
> > 
> > where VXLAN_HEADROOM is defined as follows: 
> > 
> > /* IP header + UDP + VXLAN + Ethernet header */
> > #define VXLAN_HEADROOM (20 + 8 + 8 + 14)
> > /* IPv6 header + UDP + VXLAN + Ethernet header */
> > #define VXLAN6_HEADROOM (40 + 8 + 8 + 14)
> 
> Right but I don't see it going through the effort to make use of the
> PMTU like you are.
> 
> I have another strong concern related to this.  There seems to be no
> mechanism used to propagate any PMTU events into the device's MTU.
> 
> Because if there is a limiting nexthop in the route to the other end
> of the UDP tunnel, you won't learn the PMTU until you (or some other
> entity on the machine) actually starts sending traffic to the tunnel's
> endpoint.
> 
> If the PMTU events aren't propagated into the tunnel's MTU or similar
> I think this is an ad-hoc solution.
> 
> I would suggest that you either:
> 
> 1) Do what VXLAN appears to do an ignore the PMTu
> 

I'd like to point out one difference with VXLAN - in VXLAN, the 
local physical interface is directly specified at the time of 
creation of the tunnel, and the data structure seems to have the ifindex 
of the local interface with which it is able to directly pull up the 
underlay interface device. Whereas in L2TP, we only have the IP
address of the remote tunnel end-point and thus only the socket and the 
dst from which we need to derive this. 

Also, dst_mtu references dst->ops->mtu, which if I followed the pointer
chain correctly, will dereference to ipv4_mtu() (for the IPv4 case, as
an example). The code in ipv4_mtu looks like the following:

ipv4_mtu():

unsigned int mtu = rt->rt_pmtu;

if (!mtu || time_after_eq(jiffies, rt->dst.expires))
mtu = dst_metric_raw(dst, RTAX_MTU);

if (mtu)
return mtu;

mtu = dst->dev->mtu;

if (unlikely(dst_metric_locked(dst, RTAX_MTU))) {
if (rt->rt_uses_gateway && mtu > 576)
mtu = 576;
}

return min_t(unsigned int, mtu, IP_MAX_MTU);

The code above does not depend on PMTU to be working. If no PMTU 
discovered MTU exists, it eventually falls back to the local 
underlay device MTU - and this is the mode in which I tested the fix - PMTU 
was off in my testbed, but it was picking up the local device MTU correctly.

Basically, this looks better than the VXLAN handling as far as I can 
tell - at least it will pick up the existing discovered PMTU on a best 
effort basis, while falling back to the underlay device if all else fails. 

I agree that something like 2. below would be needed in the long run (it 
will need some effort and redesign -e.g. how do I lookup the parent tunnel 
from the socket when receiving a PMTU update, existing pointer chain runs 
from tunnel to socket).  

But since the existing (Ethernet over L2TP) MTU derivation is incorrect, I am 
hoping this may be acceptable as an interim solution. 

thanks,

Ramkumar


> 2) Add code to handle PMTU events that land on the UDP tunnel
>socket.
> 
> Thanks.
> 


Re: [PATCH] netfilter: bridge: clarify bridge/netfilter message

2016-09-28 Thread Florian Westphal
Stefan Agner  wrote:
> When using bridge without bridge netfilter enabled the message
> displayed is rather confusing and leads to belive that a deprecated
> feature is in use. Use IS_MODULE to be explicit that the message only
> affects users which use bridge netfilter as module and reword the
> message.

Acked-by: Florian Westphal 



Re: UDP wierdness around skb_copy_and_csum_datagram_msg()

2016-09-28 Thread Jay Smith
Actually, on a little more searching of this list's archives, I think
that this discussion:  https://patchwork.kernel.org/patch/9260733/ is
about exactly the same issue I've found, except from the TCP side. I'm
cc'ing a few of the participants from that discussion.

So is the patch proposed there (copying and restoring the entire
iov_iter in skb_copy_and_csum_datagram_msg()) being considered as a
fix?  If not, would an alternate one that concealed the
save-and-restore logic inside iov_iter.c be more acceptable?  I'd be
happy to produce whatever's needed, or yield to someone with stronger
feelings about where it should go...

On Wed, Sep 28, 2016 at 6:24 PM, Eric Dumazet  wrote:
> On Wed, 2016-09-28 at 17:18 -0700, Jay Smith wrote:
>> I've spent the last week or so trying to track down a recurring
>> problem I'm seeing with UDP datagram handling.  I'm new to the
>> internals of the Linux network stack, but it appears to me that
>> there's a substantial error in recent kernels' handling of UDP
>> checksum errors.
>>
>> The behavior I'm seeing:  I have a UDP server that receives lots of
>> datagrams from many devices on a single port. A small number of those
>> devices occasionally send packets with bad UDP checksums.  After I
>> receive one of these bad packets, the next recvmsg() made on the
>> socket returns data from the bad-checksum packet, but the
>> source-address and length of the next (good) packet that arrived at
>> the port.
>>
>> I believe this issue was introduced between linux 3.18 and 3.19, by a
>> set of changes to net/core/datagram.c that made
>> skb_copy_and_csum_datagram_msg() and related functions use the
>> iov_iter structure to copy data to user buffers.  In the case where
>> those functions copy a datagram and then conclude that the checksum is
>> invalid, they don't remove the already-copied data from the user's
>> iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a
>> second time, looking at the next datagram on the queue, that second
>> datagram's data is appended to the first datagram's.  So when
>> recvmsg() returns to the user, the return value and msg_name reflect
>> the second datagram, but the first bytes in the user's iovec come from
>> the first (bad) datagram.
>>
>> (I've attached a test program that demonstrates the problem.  Note
>> that it sees correct behavior unless the bad-checksum packet has > 68
>> bytes of UDP data; otherwise, the packet doesn't make it past the
>> CHECKSUM_BREAK test, and never enters
>> skp_copy_and_csum_datagram_msg().)
>>
>> The fix for UDP seems pretty simple; the iov_iter's iov_offset member
>> just needs to be set back to zero on a checksum failure.  But it looks
>> like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c,
>> where I assume that multiple sk_buffs can be copied-and-csum'd into
>> the same iov -- if that's right, it seems like iov_iter needs some
>> additional state to support rolling-back the most recent copy without
>> losing previous ones.
>>
>> Any thoughts?
>
> Nice catch !
>
> What about clearing iov_offset from UDP (v4 & v6) only ?
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 
> 7d96dc2d3d08fa909f247dfbcbd0fc1eeb59862b..928da2fbb3caa6de4d0e1d889c237590f71607ea
>  100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1342,6 +1342,7 @@ csum_copy_err:
> /* starting over for a new packet, but check if we need to yield */
> cond_resched();
> msg->msg_flags &= ~MSG_TRUNC;
> +   msg->msg_iter.iov_offset = 0;
> goto try_again;
>  }
>
>
> (similar for ipv6)
>
>


Re: [PATCH net-next v3 2/3] udp: implement memory accounting helpers

2016-09-28 Thread Eric Dumazet
On Wed, 2016-09-28 at 12:52 +0200, Paolo Abeni wrote:

> +static void udp_rmem_release(struct sock *sk, int partial)
> +{
> + struct udp_sock *up = udp_sk(sk);
> + int fwd, amt;
> +
> + if (partial && !udp_under_memory_pressure(sk))
> + return;
> +
> + /* we can have concurrent release; if we catch any conflict
> +  * we let only one of them do the work
> +  */
> + if (atomic_dec_if_positive(>can_reclaim) < 0)
> + return;
> +
> + fwd = __udp_forward(up, atomic_read(>sk_rmem_alloc));
> + if (fwd < SK_MEM_QUANTUM + partial) {
> + atomic_inc(>can_reclaim);
> + return;
> + }
> +
> + amt = (fwd - partial) & ~(SK_MEM_QUANTUM - 1);
> + atomic_sub(amt, >mem_allocated);
> + atomic_inc(>can_reclaim);
> +
> + __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
> + sk->sk_forward_alloc = fwd - amt;
> +}


This is racy... all these atomics make me nervous...





Re: UDP wierdness around skb_copy_and_csum_datagram_msg()

2016-09-28 Thread Eric Dumazet
On Wed, 2016-09-28 at 17:18 -0700, Jay Smith wrote:
> I've spent the last week or so trying to track down a recurring
> problem I'm seeing with UDP datagram handling.  I'm new to the
> internals of the Linux network stack, but it appears to me that
> there's a substantial error in recent kernels' handling of UDP
> checksum errors.
> 
> The behavior I'm seeing:  I have a UDP server that receives lots of
> datagrams from many devices on a single port. A small number of those
> devices occasionally send packets with bad UDP checksums.  After I
> receive one of these bad packets, the next recvmsg() made on the
> socket returns data from the bad-checksum packet, but the
> source-address and length of the next (good) packet that arrived at
> the port.
> 
> I believe this issue was introduced between linux 3.18 and 3.19, by a
> set of changes to net/core/datagram.c that made
> skb_copy_and_csum_datagram_msg() and related functions use the
> iov_iter structure to copy data to user buffers.  In the case where
> those functions copy a datagram and then conclude that the checksum is
> invalid, they don't remove the already-copied data from the user's
> iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a
> second time, looking at the next datagram on the queue, that second
> datagram's data is appended to the first datagram's.  So when
> recvmsg() returns to the user, the return value and msg_name reflect
> the second datagram, but the first bytes in the user's iovec come from
> the first (bad) datagram.
> 
> (I've attached a test program that demonstrates the problem.  Note
> that it sees correct behavior unless the bad-checksum packet has > 68
> bytes of UDP data; otherwise, the packet doesn't make it past the
> CHECKSUM_BREAK test, and never enters
> skp_copy_and_csum_datagram_msg().)
> 
> The fix for UDP seems pretty simple; the iov_iter's iov_offset member
> just needs to be set back to zero on a checksum failure.  But it looks
> like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c,
> where I assume that multiple sk_buffs can be copied-and-csum'd into
> the same iov -- if that's right, it seems like iov_iter needs some
> additional state to support rolling-back the most recent copy without
> losing previous ones.
> 
> Any thoughts?

Nice catch !

What about clearing iov_offset from UDP (v4 & v6) only ?

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 
7d96dc2d3d08fa909f247dfbcbd0fc1eeb59862b..928da2fbb3caa6de4d0e1d889c237590f71607ea
 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1342,6 +1342,7 @@ csum_copy_err:
/* starting over for a new packet, but check if we need to yield */
cond_resched();
msg->msg_flags &= ~MSG_TRUNC;
+   msg->msg_iter.iov_offset = 0;
goto try_again;
 }
 

(similar for ipv6)




Re: linux-next: manual merge of the netfilter-next tree with the net tree

2016-09-28 Thread Stephen Rothwell
Hi all,

On Tue, 13 Sep 2016 10:12:50 +1000 Stephen Rothwell  
wrote:
>
> Today's linux-next merge of the netfilter-next tree got a conflict in:
> 
>   net/netfilter/nf_tables_netdev.c
> 
> between commit:
> 
>   c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr 
> assignment")
> 
> from the net tree and commit:
> 
>   ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, 
> ipv6}_validate()")
> 
> from the netfilter-next tree.
> 
> I fixed it up (I used the latter version of this file and applied the
> patch below) and can carry the fix as necessary. This is now fixed as
> far as linux-next is concerned, but any non trivial conflicts should be
> mentioned to your upstream maintainer when your tree is submitted for
> merging.  You may also want to consider cooperating with the maintainer
> of the conflicting tree to minimise any particularly complex conflicts.
> 
> From: Stephen Rothwell 
> Date: Tue, 13 Sep 2016 10:08:58 +1000
> Subject: [PATCH] netfilter: merge fixup for "nf_tables_netdev: remove
>  redundant ip_hdr assignment"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  include/net/netfilter/nf_tables_ipv4.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/net/netfilter/nf_tables_ipv4.h 
> b/include/net/netfilter/nf_tables_ipv4.h
> index 968f00b82fb5..25e33aee91e7 100644
> --- a/include/net/netfilter/nf_tables_ipv4.h
> +++ b/include/net/netfilter/nf_tables_ipv4.h
> @@ -33,7 +33,6 @@ __nft_set_pktinfo_ipv4_validate(struct nft_pktinfo *pkt,
>   if (!iph)
>   return -1;
>  
> - iph = ip_hdr(skb);
>   if (iph->ihl < 5 || iph->version != 4)
>   return -1;
>  
> -- 
> 2.8.1

The above merge fix patch is now needed when the net-next tree is
merged with Linus' tree.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH net-next] net: do not export sk_stream_write_space

2016-09-28 Thread David Miller
From: Eric Dumazet 
Date: Wed, 28 Sep 2016 08:41:16 -0700

> From: Eric Dumazet 
> 
> Since commit 900f65d361d3 ("tcp: move duplicate code from
> tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need
> to export sk_stream_write_space()
> 
> From: Eric Dumazet 

Applied, thanks Eric.


UDP wierdness around skb_copy_and_csum_datagram_msg()

2016-09-28 Thread Jay Smith
I've spent the last week or so trying to track down a recurring
problem I'm seeing with UDP datagram handling.  I'm new to the
internals of the Linux network stack, but it appears to me that
there's a substantial error in recent kernels' handling of UDP
checksum errors.

The behavior I'm seeing:  I have a UDP server that receives lots of
datagrams from many devices on a single port. A small number of those
devices occasionally send packets with bad UDP checksums.  After I
receive one of these bad packets, the next recvmsg() made on the
socket returns data from the bad-checksum packet, but the
source-address and length of the next (good) packet that arrived at
the port.

I believe this issue was introduced between linux 3.18 and 3.19, by a
set of changes to net/core/datagram.c that made
skb_copy_and_csum_datagram_msg() and related functions use the
iov_iter structure to copy data to user buffers.  In the case where
those functions copy a datagram and then conclude that the checksum is
invalid, they don't remove the already-copied data from the user's
iovec; when udp_recvmsg() calls skb_copy_and_csum_datagram_msg() for a
second time, looking at the next datagram on the queue, that second
datagram's data is appended to the first datagram's.  So when
recvmsg() returns to the user, the return value and msg_name reflect
the second datagram, but the first bytes in the user's iovec come from
the first (bad) datagram.

(I've attached a test program that demonstrates the problem.  Note
that it sees correct behavior unless the bad-checksum packet has > 68
bytes of UDP data; otherwise, the packet doesn't make it past the
CHECKSUM_BREAK test, and never enters
skp_copy_and_csum_datagram_msg().)

The fix for UDP seems pretty simple; the iov_iter's iov_offset member
just needs to be set back to zero on a checksum failure.  But it looks
like skb_copy_and_csum_datagram_msg() is also called from tcp_input.c,
where I assume that multiple sk_buffs can be copied-and-csum'd into
the same iov -- if that's right, it seems like iov_iter needs some
additional state to support rolling-back the most recent copy without
losing previous ones.

Any thoughts?
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 


int main(int argc, char **argv)
{
  int ret;

  if (argc < 2) {
fprintf(stderr, "Usage: csumtest \n");
exit(1);
  }

  struct sockaddr_in addr;
  socklen_t addrLen = sizeof(addr);  
  addr.sin_family = AF_INET;
  addr.sin_addr.s_addr = inet_addr("127.0.0.1");
  addr.sin_port = htons(0);

  int listen = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
  if (listen < 0)
  {
fprintf(stderr, "failed to create listening socket (err=%d: %s)\n", errno, strerror(errno));
exit(1);
  }

  ret = bind(listen, (struct sockaddr *) , addrLen);
  if (ret != 0)
  {
fprintf(stderr, "failed to bind listening socket socket (err=%d: %s)\n", errno, strerror(errno));
exit(1);
  }

  ret = getsockname(listen, (struct sockaddr *) , );
  if (ret != 0)
  {
fprintf(stderr, "getsockname failed for listening socket socket (err=%d: %s)\n", errno, strerror(errno));
exit(1);
  }
  else
  {
printf("listening on port %d\n", ntohs(addr.sin_port));
  }


  
  // Send a packet with deliberately bad checksum
  char rawBuf[4096];
  memset(rawBuf, 0, 4096);
  struct udphdr *udph = (struct udphdr *) rawBuf;

  // Send as UDP data the string "BAD DATA", repeated as necessary
  // to fill the number of bytes requested on the command-line
  char *data = rawBuf + sizeof(struct udphdr);
  int badBytes = atoi(argv[1]);
  int i;
  for (i = 0; i < badBytes; i += 8)
  {
strncpy(data + i, "BAD DATA", 8);
  }

  // UDP headers contain a bogus checksum, but are otherwise reasonable
  udph->check = htons(0xbadd);
  udph->source = htons(18);
  udph->dest = addr.sin_port;
  udph->len = htons(sizeof(struct udphdr) + badBytes);

  int raw = socket(AF_INET, SOCK_RAW, IPPROTO_UDP);
  ret = sendto(raw, rawBuf,
	   sizeof(struct udphdr) + badBytes, 0,
	   (struct sockaddr *) , addrLen);
  if (ret < 0)
  {
fprintf(stderr, "raw send failed (err=%d: %s)\n", errno, strerror(errno));
exit(1);
  }
  
  
  // Send a second, legitimate UDP packet
  int cooked = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
  char *second = "Good data";
  ret = sendto(cooked, second, strlen(second), 0,
	   (struct sockaddr *) , addrLen);
  if (ret < 0)
  {
fprintf(stderr, "second cooked send failed (err=%d: %s)\n", errno, strerror(errno));
exit(1);
  }

  // Read what's queued up for us.
  while (1)
  {
struct msghdr msg;
struct iovec  iov;
char readBuf[4096];
ssize_t bytes;

memset(readBuf, 0, 4096);

iov.iov_base = readBuf;
iov.iov_len = sizeof(readBuf);

msg.msg_name = NULL;
msg.msg_namelen = 0;
msg.msg_iov = 
msg.msg_iovlen = 1;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_flags = 0;

bytes = recvmsg(listen, , 0);

Re: mv643xxx_eth driver, failed to linearize skb with tiny unaligned fragment

2016-09-28 Thread Eric Dumazet
On Wed, 2016-09-28 at 22:37 +0200, Frans van de Wiel wrote:
> We compiled kernel 4.6.6 for our nas devices running on ARM kirkwood cpu’s. 
> These nas devices have only 256 MB of RAM so limited memory resources
> When we fully load the cpu and are copying files via the interface we get 
> frequently this “ error” message in dmesg log
> 
> [ 1978.149929] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
> skb with tiny unaligned fragment
> [ 1978.150138] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
> skb with tiny unaligned fragment
> [ 1978.150318] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
> skb with tiny unaligned fragment
> [ 1978.150360] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
> skb with tiny unaligned fragment
> 
> ...
> 
> CPU[|||98.9%] Tasks: 29, 0 thr; 2 
> running
>   Mem[30/244MB] Load average: 1.36 1.29 
> 0.92
>   Swp[ 0/511MB] Uptime: 00:38:11
> 
> We analyzed the driver code code and think that it does not seems to be an 
> error but a warning, this we want to verify and if this warning can be taken 
> out without risk
> 
> The message originates from this part of the code of the driver
> 
> 1023 if (has_tiny_unaligned_frags(skb) && __skb_linearize(skb)) {
> 1024 netdev_printk(KERN_DEBUG, dev,
> 1025   "failed to linearize skb with tiny 
> unaligned fragment\n");
> 1026 return NETDEV_TX_BUSY;
> 1027 }
> 
> __skb_linearize is set in skbuff.h and comments are useful
> 
> static inline int __skb_linearize(struct sk_buff *skb)
> 1636 {
> 1637 return __pskb_pull_tail(skb, skb->data_len) ? 0 : -ENOMEM;
> 1638 }
> 1639
> 1640 /**
> 1641  *  skb_linearize - convert paged skb to linear one
> 1642  *  @skb: buffer to linarize
> 1643  *
> 1644  *  If there is no free memory -ENOMEM is returned, otherwise zero
> 1645  *  is returned and the old skb data released.
> 1646  */
> 1647 static inline int skb_linearize(struct sk_buff *skb)
> 1648 {
> 
> So return a false state (0) if there is enough free memory and true on the 
> other case.
> 
> Please to note mv643xx_eth.c returns also a busy state. So I assume on this 
> case the driver repeats this step up to success
> 
> So in my opinion, this is not an error message but a warning and does not 
> mean corrupted data and I think is should be possible to remove it.
> Is conclusion is correct ?

Well, linearizing an skb only because one fragment is tiny and not
aligned is unfortunate.

This driver should copy the ~7 bytes into temp storage instead.

It would be faster, and would avoid dropping packet when the (possibly
huge) allocation fails.

Strange thing is that txq_submit_tso() can happily present 'tiny frags'
after segmentation is done on a 'big frag', so I wonder if the
has_tiny_unaligned_frags() is really needed.

If this is needed, then a fix in txq_submit_tso() is also needed.




How to submit potential patch in linux kernel

2016-09-28 Thread Shyam Saini
Hi everyone,

I'm Shyam, final year undergraduate student. I wanted to know how one can
submit potential linux kernel patch in networking subsystem.

Thanks,
Shyam


[PATCH net-next] nfp: bpf: zero extend 4 byte context loads

2016-09-28 Thread Jakub Kicinski
Set upper 32 bits of destination register to zeros after
load from the context structure.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c 
b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
index 3de819acd68c..f8df5300f49c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_bpf_jit.c
@@ -1134,6 +1134,8 @@ static int mem_ldx4(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
else
return -ENOTSUPP;
 
+   wrp_immed(nfp_prog, reg_both(meta->insn.dst_reg * 2 + 1), 0);
+
return 0;
 }
 
-- 
1.9.1



Re: [PATCH] Add netdev all_adj_list refcnt propagation to fix panic

2016-09-28 Thread Andrew Collins

On 09/28/2016 12:06 PM, David Ahern wrote:

Andrew Collins posted this patch as RFC in March:
http://patchwork.ozlabs.org/patch/603101/

It has apparently fallen through the cracks and never applied.

It solves a refcnt problem (thanks Nik for pointing out this patch)


I have been running the patch in production for the past few months without 
issue, in spite of my concerns about the potential side effects of the change.  
That said, I suspect my use cases are a subset of yours.

If nobody has a better fix yet, I'm happy to respin this patch as non-RFC with 
proper signoff.

Regards,
Andrew Collins


[PATCH v2 net-next 2/2] net: deprecate eth_change_mtu, remove usage

2016-09-28 Thread Jarod Wilson
With centralized MTU checking, there's nothing productive done by
eth_change_mtu that isn't already done in dev_set_mtu, so mark it as
deprecated and remove all usage of it in the kernel. All callers have been
audited for calls to alloc_etherdev* or ether_setup directly, which means
they all have a valid dev->min_mtu and dev->max_mtu. Now eth_change_mtu
prints out a netdev_warn about being deprecated, for the benefit of
out-of-tree drivers that might be utilizing it.

Of note, dvb_net.c actually had dev->mtu = 4096, while using
eth_change_mtu, meaning that if you ever tried changing it's mtu, you
couldn't set it above 1500 anymore. It's now getting dev->max_mtu also set
to 4096 to remedy that.

v2: fix up lantiq_etop, missed breakage due to drive not compiling on x86

CC: "David S. Miller" 
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 arch/m68k/emu/nfeth.c |  1 -
 drivers/isdn/hysdn/hysdn_net.c|  1 -
 drivers/media/dvb-core/dvb_net.c  |  2 +-
 drivers/net/appletalk/ipddp.c |  1 -
 drivers/net/cris/eth_v10.c|  1 -
 drivers/net/ethernet/3com/3c509.c |  1 -
 drivers/net/ethernet/3com/3c515.c |  1 -
 drivers/net/ethernet/3com/3c574_cs.c  |  1 -
 drivers/net/ethernet/3com/3c589_cs.c  |  1 -
 drivers/net/ethernet/3com/3c59x.c |  2 --
 drivers/net/ethernet/3com/typhoon.c   |  1 -
 drivers/net/ethernet/8390/8390.c  |  1 -
 drivers/net/ethernet/8390/8390p.c |  1 -
 drivers/net/ethernet/8390/ax88796.c   |  1 -
 drivers/net/ethernet/8390/axnet_cs.c  |  1 -
 drivers/net/ethernet/8390/etherh.c|  1 -
 drivers/net/ethernet/8390/hydra.c |  1 -
 drivers/net/ethernet/8390/mac8390.c   |  1 -
 drivers/net/ethernet/8390/mcf8390.c   |  1 -
 drivers/net/ethernet/8390/ne2k-pci.c  |  1 -
 drivers/net/ethernet/8390/pcnet_cs.c  |  1 -
 drivers/net/ethernet/8390/smc-ultra.c |  1 -
 drivers/net/ethernet/8390/wd.c|  1 -
 drivers/net/ethernet/8390/zorro8390.c |  1 -
 drivers/net/ethernet/adaptec/starfire.c   |  1 -
 drivers/net/ethernet/adi/bfin_mac.c   |  1 -
 drivers/net/ethernet/allwinner/sun4i-emac.c   |  1 -
 drivers/net/ethernet/amd/a2065.c  |  1 -
 drivers/net/ethernet/amd/am79c961a.c  |  1 -
 drivers/net/ethernet/amd/ariadne.c|  1 -
 drivers/net/ethernet/amd/atarilance.c |  1 -
 drivers/net/ethernet/amd/au1000_eth.c |  1 -
 drivers/net/ethernet/amd/declance.c   |  1 -
 drivers/net/ethernet/amd/hplance.c|  1 -
 drivers/net/ethernet/amd/lance.c  |  1 -
 drivers/net/ethernet/amd/mvme147.c|  1 -
 drivers/net/ethernet/amd/ni65.c   |  1 -
 drivers/net/ethernet/amd/nmclan_cs.c  |  1 -
 drivers/net/ethernet/amd/pcnet32.c|  1 -
 drivers/net/ethernet/amd/sun3lance.c  |  1 -
 drivers/net/ethernet/amd/sunlance.c   |  1 -
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  |  1 -
 drivers/net/ethernet/apple/bmac.c |  1 -
 drivers/net/ethernet/apple/mace.c |  1 -
 drivers/net/ethernet/apple/macmace.c  |  1 -
 drivers/net/ethernet/aurora/nb8800.c  |  1 -
 drivers/net/ethernet/cadence/macb.c   |  1 -
 drivers/net/ethernet/cirrus/cs89x0.c  |  1 -
 drivers/net/ethernet/cirrus/ep93xx_eth.c  |  1 -
 drivers/net/ethernet/cirrus/mac89x0.c |  1 -
 drivers/net/ethernet/davicom/dm9000.c |  1 -
 drivers/net/ethernet/dec/tulip/de2104x.c  |  1 -
 drivers/net/ethernet/dec/tulip/de4x5.c|  1 -
 drivers/net/ethernet/dec/tulip/dmfe.c |  1 -
 drivers/net/ethernet/dec/tulip/tulip_core.c   |  1 -
 drivers/net/ethernet/dec/tulip/uli526x.c  |  1 -
 drivers/net/ethernet/dec/tulip/winbond-840.c  |  1 -
 drivers/net/ethernet/dec/tulip/xircom_cb.c|  1 -
 drivers/net/ethernet/dnet.c   |  1 -
 drivers/net/ethernet/ec_bhf.c |  1 -
 drivers/net/ethernet/fealnx.c |  1 -
 drivers/net/ethernet/freescale/fec_main.c |  1 -
 drivers/net/ethernet/freescale/fec_mpc52xx.c  |  1 -
 drivers/net/ethernet/freescale/fs_enet/fs_enet-main.c |  1 -
 drivers/net/ethernet/freescale/ucc_geth.c |  1 -
 drivers/net/ethernet/fujitsu/fmvj18x_cs.c |  1 -
 

[PATCH v2 net-next 0/2] net: centralize net_device MTU bounds checking

2016-09-28 Thread Jarod Wilson
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

This pair of patches looks to introduce centralized MTU range checking
infrastructure, while maintaining compatibility with all existing drivers,
and start to make use of it, converting all eth_change_mtu/ether_setup users
over to this new infra.

Assuming these pass review muster, I've got a ton of follow-on patches to
clean up MTU settings for everything in the kernel with an ndo_change_mtu.

This work is all staged in a (rebasing) git tree here:

https://github.com/jarodwilson/linux-muck

The master branch is based on net-next from today, and carries these two
patches, plus a ton of follow-on patches to eliminate MTU range checks
and change_mtu functions where possible. All patches were successfully
built across 160 various arch and config combos by the 0-day folks.
(Thanks to Andrew Lunn for the suggestion to get that going).

Resending, because I forgot to add CC list to the actual patches.

Jarod Wilson (2):
  net: centralize net_device min/max MTU checking
  net: deprecate eth_change_mtu, remove usage

 arch/m68k/emu/nfeth.c |  1 -
 drivers/isdn/hysdn/hysdn_net.c|  1 -
 drivers/media/dvb-core/dvb_net.c  |  2 +-
 drivers/net/appletalk/ipddp.c |  1 -
 drivers/net/cris/eth_v10.c|  1 -
 drivers/net/ethernet/3com/3c509.c |  1 -
 drivers/net/ethernet/3com/3c515.c |  1 -
 drivers/net/ethernet/3com/3c574_cs.c  |  1 -
 drivers/net/ethernet/3com/3c589_cs.c  |  1 -
 drivers/net/ethernet/3com/3c59x.c |  2 --
 drivers/net/ethernet/3com/typhoon.c   |  1 -
 drivers/net/ethernet/8390/8390.c  |  1 -
 drivers/net/ethernet/8390/8390p.c |  1 -
 drivers/net/ethernet/8390/ax88796.c   |  1 -
 drivers/net/ethernet/8390/axnet_cs.c  |  1 -
 drivers/net/ethernet/8390/etherh.c|  1 -
 drivers/net/ethernet/8390/hydra.c |  1 -
 drivers/net/ethernet/8390/mac8390.c   |  1 -
 drivers/net/ethernet/8390/mcf8390.c   |  1 -
 drivers/net/ethernet/8390/ne2k-pci.c  |  1 -
 drivers/net/ethernet/8390/pcnet_cs.c  |  1 -
 drivers/net/ethernet/8390/smc-ultra.c |  1 -
 drivers/net/ethernet/8390/wd.c|  1 -
 drivers/net/ethernet/8390/zorro8390.c |  1 -
 drivers/net/ethernet/adaptec/starfire.c   |  1 -
 drivers/net/ethernet/adi/bfin_mac.c   |  1 -
 drivers/net/ethernet/allwinner/sun4i-emac.c   |  1 -
 drivers/net/ethernet/amd/a2065.c  |  1 -
 drivers/net/ethernet/amd/am79c961a.c  |  1 -
 drivers/net/ethernet/amd/ariadne.c|  1 -
 drivers/net/ethernet/amd/atarilance.c |  1 -
 drivers/net/ethernet/amd/au1000_eth.c |  1 -
 drivers/net/ethernet/amd/declance.c   |  1 -
 drivers/net/ethernet/amd/hplance.c|  1 -
 drivers/net/ethernet/amd/lance.c  |  1 -
 drivers/net/ethernet/amd/mvme147.c|  1 -
 drivers/net/ethernet/amd/ni65.c   |  1 -
 drivers/net/ethernet/amd/nmclan_cs.c  |  1 -
 drivers/net/ethernet/amd/pcnet32.c|  1 -
 drivers/net/ethernet/amd/sun3lance.c  |  1 -
 drivers/net/ethernet/amd/sunlance.c   |  1 -
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  |  1 -
 drivers/net/ethernet/apple/bmac.c |  1 -
 drivers/net/ethernet/apple/mace.c |  1 -
 drivers/net/ethernet/apple/macmace.c  |  1 -
 drivers/net/ethernet/aurora/nb8800.c  |  1 -
 drivers/net/ethernet/cadence/macb.c   |  1 -
 drivers/net/ethernet/cirrus/cs89x0.c  |  1 -
 drivers/net/ethernet/cirrus/ep93xx_eth.c  |  1 -
 drivers/net/ethernet/cirrus/mac89x0.c |  1 -
 drivers/net/ethernet/davicom/dm9000.c |  1 -
 drivers/net/ethernet/dec/tulip/de2104x.c  |  1 -
 drivers/net/ethernet/dec/tulip/de4x5.c|  1 -
 drivers/net/ethernet/dec/tulip/dmfe.c |  1 -
 

[PATCH v2 net-next 1/2] net: centralize net_device min/max MTU checking

2016-09-28 Thread Jarod Wilson
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.

CC: "David S. Miller" 
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 12 ++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 136ae6bb..fbdf923 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1506,6 +1506,8 @@ enum netdev_priv_flags {
  * @if_port:   Selectable AUI, TP, ...
  * @dma:   DMA channel
  * @mtu:   Interface MTU value
+ * @min_mtu:   Interface Minimum MTU value
+ * @max_mtu:   Interface Maximum MTU value
  * @type:  Interface hardware type
  * @hard_header_len: Maximum hardware header length.
  *
@@ -1726,6 +1728,8 @@ struct net_device {
unsigned char   dma;
 
unsigned intmtu;
+   unsigned intmin_mtu;
+   unsigned intmax_mtu;
unsigned short  type;
unsigned short  hard_header_len;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index c0c291f..5343799 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6493,9 +6493,17 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
if (new_mtu == dev->mtu)
return 0;
 
-   /*  MTU must be positive.*/
-   if (new_mtu < 0)
+   if (new_mtu < dev->min_mtu) {
+   net_err_ratelimited("%s: Invalid MTU %d requested, hw min %d\n",
+   dev->name, new_mtu, dev->min_mtu);
return -EINVAL;
+   }
+
+   if (dev->max_mtu > 0 && new_mtu > dev->max_mtu) {
+   net_err_ratelimited("%s: Invalid MTU %d requested, hw max %d\n",
+   dev->name, new_mtu, dev->min_mtu);
+   return -EINVAL;
+   }
 
if (!netif_device_present(dev))
return -ENODEV;
-- 
2.10.0



[PATCH] netfilter: bridge: clarify bridge/netfilter message

2016-09-28 Thread Stefan Agner
When using bridge without bridge netfilter enabled the message
displayed is rather confusing and leads to belive that a deprecated
feature is in use. Use IS_MODULE to be explicit that the message only
affects users which use bridge netfilter as module and reword the
message.

Signed-off-by: Stefan Agner 
---
 net/bridge/br.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br.c b/net/bridge/br.c
index 3addc05..889e564 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -227,9 +227,11 @@ static int __init br_init(void)
br_fdb_test_addr_hook = br_fdb_test_addr;
 #endif
 
-   pr_info("bridge: automatic filtering via arp/ip/ip6tables has been "
-   "deprecated. Update your scripts to load br_netfilter if you "
+#if IS_MODULE(CONFIG_BRIDGE_NETFILTER)
+   pr_info("bridge: filtering via arp/ip/ip6tables is no longer available "
+   "by default. Update your scripts to load br_netfilter if you "
"need this.\n");
+#endif
 
return 0;
 
-- 
2.10.0



Re: [PATCH RFC 5/6] net: phy: Trigger state machine on state change and not polling.

2016-09-28 Thread Florian Fainelli
On 09/28/2016 01:32 AM, Andrew Lunn wrote:
> The phy_start() is used to indicate the PHY is now ready to do its
> work. The state is changed, normally to PHY_UP which means that both
> the MAC and the PHY are ready.
> 
> If the phy driver is using polling, when the next poll happens, the
> state machine notices the PHY is now in PHY_UP, and kicks off
> auto-negotiation, if needed.
> 
> If however, the PHY is using interrupts, there is no polling. The phy
> is stuck in PHY_UP until the next interrupt comes along. And there is
> no reason for the PHY to interrupt.
> 
> Have phy_start() schedule the state machine to run, which both speeds
> up the polling use case, and makes the interrupt use case actually
> work.
> 
> This problems exists whenever there is a state change which will not
> cause an interrupt. Trigger the state machine in these cases,
> e.g. phy_error().
> 
> Signed-off-by: Andrew Lunn 

No particular objections, this should also fix this:

http://lists.openwall.net/netdev/2016/05/17/147

> ---
>  drivers/net/phy/phy.c | 22 --
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 940450c6cd2c..f446ce04caf3 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -608,6 +608,21 @@ void phy_start_machine(struct phy_device *phydev)
>  }
>  
>  /**
> + * phy_trigger_machine - trigger the state machine to run
> + *
> + * @phydev: the phy_device struct
> + *
> + * Description: There has been a change in state which requires that the
> + *   state machine runs.
> + */
> +
> +static void phy_trigger_machine(struct phy_device *phydev)
> +{
> + cancel_delayed_work_sync(>state_queue);
> + queue_delayed_work(system_power_efficient_wq, >state_queue, 0);
> +}
> +
> +/**
>   * phy_stop_machine - stop the PHY state machine tracking
>   * @phydev: target phy_device struct
>   *
> @@ -639,6 +654,8 @@ static void phy_error(struct phy_device *phydev)
>   mutex_lock(>lock);
>   phydev->state = PHY_HALTED;
>   mutex_unlock(>lock);
> +
> + phy_trigger_machine(phydev);
>  }
>  
>  /**
> @@ -785,8 +802,7 @@ void phy_change(struct phy_device *phydev)
>   }
>  
>   /* reschedule state queue work to run as soon as possible */
> - cancel_delayed_work_sync(>state_queue);
> - queue_delayed_work(system_power_efficient_wq, >state_queue, 0);
> + phy_trigger_machine(phydev);
>   return;
>  
>  ignore:
> @@ -875,6 +891,8 @@ void phy_start(struct phy_device *phydev)
>   /* if phy was suspended, bring the physical link up again */
>   if (do_resume)
>   phy_resume(phydev);
> +
> + phy_trigger_machine(phydev);
>  }
>  EXPORT_SYMBOL(phy_start);
>  
> 


-- 
Florian


[PATCH v2 net-next 0/2] net: centralize net_device MTU bounds checking

2016-09-28 Thread Jarod Wilson
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().

This pair of patches looks to introduce centralized MTU range checking
infrastructure, while maintaining compatibility with all existing drivers,
and start to make use of it, converting all eth_change_mtu/ether_setup users
over to this new infra.

Assuming these pass review muster, I've got a ton of follow-on patches to
clean up MTU settings for everything in the kernel with an ndo_change_mtu.

This work is all staged in a (rebasing) git tree here:

https://github.com/jarodwilson/linux-muck

The master branch is based on net-next from today, and carries these two
patches, plus a ton of follow-on patches to eliminate MTU range checks
and change_mtu functions where possible. All patches were successfully
built across 160 various arch and config combos by the 0-day folks.

Jarod Wilson (2):
  net: centralize net_device min/max MTU checking
  net: deprecate eth_change_mtu, remove usage

 arch/m68k/emu/nfeth.c |  1 -
 drivers/isdn/hysdn/hysdn_net.c|  1 -
 drivers/media/dvb-core/dvb_net.c  |  2 +-
 drivers/net/appletalk/ipddp.c |  1 -
 drivers/net/cris/eth_v10.c|  1 -
 drivers/net/ethernet/3com/3c509.c |  1 -
 drivers/net/ethernet/3com/3c515.c |  1 -
 drivers/net/ethernet/3com/3c574_cs.c  |  1 -
 drivers/net/ethernet/3com/3c589_cs.c  |  1 -
 drivers/net/ethernet/3com/3c59x.c |  2 --
 drivers/net/ethernet/3com/typhoon.c   |  1 -
 drivers/net/ethernet/8390/8390.c  |  1 -
 drivers/net/ethernet/8390/8390p.c |  1 -
 drivers/net/ethernet/8390/ax88796.c   |  1 -
 drivers/net/ethernet/8390/axnet_cs.c  |  1 -
 drivers/net/ethernet/8390/etherh.c|  1 -
 drivers/net/ethernet/8390/hydra.c |  1 -
 drivers/net/ethernet/8390/mac8390.c   |  1 -
 drivers/net/ethernet/8390/mcf8390.c   |  1 -
 drivers/net/ethernet/8390/ne2k-pci.c  |  1 -
 drivers/net/ethernet/8390/pcnet_cs.c  |  1 -
 drivers/net/ethernet/8390/smc-ultra.c |  1 -
 drivers/net/ethernet/8390/wd.c|  1 -
 drivers/net/ethernet/8390/zorro8390.c |  1 -
 drivers/net/ethernet/adaptec/starfire.c   |  1 -
 drivers/net/ethernet/adi/bfin_mac.c   |  1 -
 drivers/net/ethernet/allwinner/sun4i-emac.c   |  1 -
 drivers/net/ethernet/amd/a2065.c  |  1 -
 drivers/net/ethernet/amd/am79c961a.c  |  1 -
 drivers/net/ethernet/amd/ariadne.c|  1 -
 drivers/net/ethernet/amd/atarilance.c |  1 -
 drivers/net/ethernet/amd/au1000_eth.c |  1 -
 drivers/net/ethernet/amd/declance.c   |  1 -
 drivers/net/ethernet/amd/hplance.c|  1 -
 drivers/net/ethernet/amd/lance.c  |  1 -
 drivers/net/ethernet/amd/mvme147.c|  1 -
 drivers/net/ethernet/amd/ni65.c   |  1 -
 drivers/net/ethernet/amd/nmclan_cs.c  |  1 -
 drivers/net/ethernet/amd/pcnet32.c|  1 -
 drivers/net/ethernet/amd/sun3lance.c  |  1 -
 drivers/net/ethernet/amd/sunlance.c   |  1 -
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c  |  1 -
 drivers/net/ethernet/apple/bmac.c |  1 -
 drivers/net/ethernet/apple/mace.c |  1 -
 drivers/net/ethernet/apple/macmace.c  |  1 -
 drivers/net/ethernet/aurora/nb8800.c  |  1 -
 drivers/net/ethernet/cadence/macb.c   |  1 -
 drivers/net/ethernet/cirrus/cs89x0.c  |  1 -
 drivers/net/ethernet/cirrus/ep93xx_eth.c  |  1 -
 drivers/net/ethernet/cirrus/mac89x0.c |  1 -
 drivers/net/ethernet/davicom/dm9000.c |  1 -
 drivers/net/ethernet/dec/tulip/de2104x.c  |  1 -
 drivers/net/ethernet/dec/tulip/de4x5.c|  1 -
 drivers/net/ethernet/dec/tulip/dmfe.c |  1 -
 drivers/net/ethernet/dec/tulip/tulip_core.c   |  1 -
 drivers/net/ethernet/dec/tulip/uli526x.c  |  1 -
 

Re: [PATCH RFC 2/6] net: phy: Use threaded IRQ, to allow IRQ from sleeping devices

2016-09-28 Thread Andrew Lunn
On Wed, Sep 28, 2016 at 02:38:03PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 9/28/2016 11:32 AM, Andrew Lunn wrote:
> 
> >The interrupt lines from PHYs maybe connected to I2C bus expanders, or
> >from switches on MDIO busses. Such interrupts are sourced from devices
> >which sleep, so use threaded interrupts. Threaded interrupts require
> >that the interrupt requester also uses the threaded API. Change the
> >phylib to use the threaded API, which is backwards compatible with
> >none-threaded IRQs.
> >
> >Signed-off-by: Andrew Lunn 
> >---
> > drivers/net/phy/phy.c | 7 +++
> > 1 file changed, 3 insertions(+), 4 deletions(-)
> >
> >diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> >index c6f66832a1a6..5c29ed72f721 100644
> >--- a/drivers/net/phy/phy.c
> >+++ b/drivers/net/phy/phy.c
> >@@ -722,10 +722,9 @@ phy_err:
> > int phy_start_interrupts(struct phy_device *phydev)
> > {
> > atomic_set(>irq_disable, 0);
> >-if (request_irq(phydev->irq, phy_interrupt,
> >-IRQF_SHARED,
> >-"phy_interrupt",
> >-phydev) < 0) {
> >+if (request_threaded_irq(phydev->irq, NULL, phy_interrupt,
> >+ IRQF_ONESHOT, "phy_interrupt",
> 
>What about IRQF_SHARED?

IRQF_ONESHOT | IRQF_SHARED does work.

However, it requires all uses of the interrupt use the same flags.
The commit messages suggests IRQF_SHARED is there because of boards
which have multiple PHYs using one IRQ. That case will work, since all
PHYs will be using this code.

What will be an issue is if the sharing is between a PHY and something
else. That something else will also need to use IRQF_ONESHOT.  Anybody
know if this situation actually exists?

Thanks
 Andrew


mv643xxx_eth driver, failed to linearize skb with tiny unaligned fragment

2016-09-28 Thread Frans van de Wiel
We compiled kernel 4.6.6 for our nas devices running on ARM kirkwood cpu’s. 
These nas devices have only 256 MB of RAM so limited memory resources
When we fully load the cpu and are copying files via the interface we get 
frequently this “ error” message in dmesg log


[ 1978.149929] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
skb with tiny unaligned fragment
[ 1978.150138] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
skb with tiny unaligned fragment
[ 1978.150318] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
skb with tiny unaligned fragment
[ 1978.150360] mv643xx_eth_port mv643xx_eth_port.0 eth0: failed to linearize 
skb with tiny unaligned fragment


...

CPU[|||98.9%] Tasks: 29, 0 thr; 2 
running
 Mem[30/244MB] Load average: 1.36 1.29 
0.92

 Swp[ 0/511MB] Uptime: 00:38:11

We analyzed the driver code code and think that it does not seems to be an 
error but a warning, this we want to verify and if this warning can be taken 
out without risk


The message originates from this part of the code of the driver

1023 if (has_tiny_unaligned_frags(skb) && __skb_linearize(skb)) {
1024 netdev_printk(KERN_DEBUG, dev,
1025   "failed to linearize skb with tiny 
unaligned fragment\n");

1026 return NETDEV_TX_BUSY;
1027 }

__skb_linearize is set in skbuff.h and comments are useful

static inline int __skb_linearize(struct sk_buff *skb)
1636 {
1637 return __pskb_pull_tail(skb, skb->data_len) ? 0 : -ENOMEM;
1638 }
1639
1640 /**
1641  *  skb_linearize - convert paged skb to linear one
1642  *  @skb: buffer to linarize
1643  *
1644  *  If there is no free memory -ENOMEM is returned, otherwise zero
1645  *  is returned and the old skb data released.
1646  */
1647 static inline int skb_linearize(struct sk_buff *skb)
1648 {

So return a false state (0) if there is enough free memory and true on the 
other case.


Please to note mv643xx_eth.c returns also a busy state. So I assume on this 
case the driver repeats this step up to success


So in my opinion, this is not an error message but a warning and does not 
mean corrupted data and I think is should be possible to remove it.

Is conclusion is correct ?



Re: [PATCH net-next 2/2] net: phy: Add PHY Auto/Mdi/Mdix set driver for Microsemi PHYs.

2016-09-28 Thread Andrew Lunn
> + reg_val = phy_read(phydev, MSCC_PHY_BYPASS_CONTROL);
> + if ((mdix == ETH_TP_MDI) || (mdix == ETH_TP_MDI_X)) {
> + reg_val |= (DISABLE_PAIR_SWAP_CORR_MASK |
> + DISABLE_POLARITY_CORR_MASK  |
> + DISABLE_HP_AUTO_MDIX_MASK);
> + } else {
> + reg_val &= ~(DISABLE_PAIR_SWAP_CORR_MASK |
> +  DISABLE_POLARITY_CORR_MASK  |
> +  DISABLE_HP_AUTO_MDIX_MASK);
> + }
> + rc = phy_write(phydev, MSCC_PHY_BYPASS_CONTROL, reg_val);
> + if (rc != 0)
> + goto out_unlock;
> +
> + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
> + if (rc != 0)
> + goto out_unlock;
> +
> + reg_val = phy_read(phydev, MSCC_PHY_EXT_MODE_CNTL);
> + reg_val &= ~(FORCE_MDI_CROSSOVER_MASK);
> + if (mdix == ETH_TP_MDI)
> + reg_val |= FORCE_MDI_CROSSOVER_MDI;
> + else if (mdix == ETH_TP_MDI_X)
> + reg_val |= FORCE_MDI_CROSSOVER_MDIX;
> + rc = phy_write(phydev, MSCC_PHY_EXT_MODE_CNTL, reg_val);
> + if (rc != 0)
> + goto out_unlock;
> +
> + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
> +
> +out_unlock:

out_unlock seems a bit of an odd name, since you are not unlocking
anything.

I also wonder if you should try to reset to MSCC_PHY_PAGE_STANDARD in
the error condition?

> +
> + return rc;
> +}
> +
>  static int vsc85xx_wol_set(struct phy_device *phydev,
>  struct ethtool_wolinfo *wol)
>  {
> @@ -227,6 +281,7 @@ static int vsc85xx_default_config(struct phy_device 
> *phydev)
>   int rc;
>   u16 reg_val;
>  
> + phydev->mdix = ETH_TP_MDI_AUTO;

Humm, interesting. The only other driver supporting mdix is the
Marvell one. It does not do this, it leaves it to its default value of
ETH_TP_MDI_INVALID. It does however interpret ETH_TP_MDI_INVALID as
meaning as ETH_TP_MDI_AUTO.

There needs to be consistency here. You either need to do the same as
the Marvell driver, or you need to modify the Marvell driver to also
set phydev->mdix to ETH_TP_MDI_AUTO.

I don't yet know which of these two is the right thing to do.

Florian?

Andrew


Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-28 Thread Vlastimil Babka
On 09/28/2016 06:30 PM, David Laight wrote:
> From: Vlastimil Babka
>> Sent: 27 September 2016 12:51
> ...
>> Process name suggests it's part of db2 database. It seems it has to implement
>> its own interface to select() syscall, because glibc itself seems to have a
>> FD_SETSIZE limit of 1024, which is probably why this wasn't an issue for all 
>> the
>> years...
> 
> ISTR the canonical way to increase the size being to set FD_SETSIZE
> to a larger value before including any of the headers.
> 
> Or doesn't that work with linux and glibc ??

Doesn't seem so.

> 
>   David
> 



Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-28 Thread Sowmini Varadhan
On (09/28/16 11:08), Alexander Duyck wrote:
> > - then udp_gro_receive -> vxlan_gro_receive pulls up vxlan header
> >   into linear part, and then..
> 
> This is the point where we need to stop, drop the existing headers,
> call skb_reserve(NET_IP_ALIGN), and then pick back up where we left
> off.  We just have to make sure the skbuff isn't shared.
  :
> In general the interface behind the tunnel shouldn't need to know
> about any of the data in front of the tunnel, so you should be able to
> drop the original header offsets and reset things if you want so you
> could overwrite the old headers.

Except that some of these encapsulations may be doing things like
[inner-mac, outer-ip] fdb learning, so we may not be able to lose
the outer headers at this early point in all cases.

> If I am not mistaken I think the multi-buffer approach is the approach
> taken by other OSes, although for us it is more difficult since we

yes, I think that's approximately what gets done with mblk/mbufs
(with pullups as needed). But as you point out, this type of
->frag_list chaining is a non-trivial change.

> I'm sure we are all going to be talking about this in great detail
> next week at netdev/netconf.. :-)

Agree.

--Sowmini



[PATCH net] sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock

2016-09-28 Thread Xin Long
When sctp dumps all the ep->assocs, it needs to lock_sock first,
but now it locks sock in rcu_read_lock, and lock_sock may sleep,
which would break rcu_read_lock.

This patch is to get and hold one sock when traversing the list.
After that and get out of rcu_read_lock, lock and dump it. Then
it will traverse the list again to get the next one until all
sctp socks are dumped.

For sctp_diag_dump_one, it fixes this issue by holding asoc and
moving cb() out of rcu_read_lock in sctp_transport_lookup_process.

Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file")
Signed-off-by: Xin Long 
---
 net/sctp/sctp_diag.c | 58 
 net/sctp/socket.c| 10 ++---
 2 files changed, 47 insertions(+), 21 deletions(-)

diff --git a/net/sctp/sctp_diag.c b/net/sctp/sctp_diag.c
index f3508aa..cef0cee 100644
--- a/net/sctp/sctp_diag.c
+++ b/net/sctp/sctp_diag.c
@@ -272,28 +272,17 @@ out:
return err;
 }
 
-static int sctp_tsp_dump(struct sctp_transport *tsp, void *p)
+static int sctp_sock_dump(struct sock *sk, void *p)
 {
-   struct sctp_endpoint *ep = tsp->asoc->ep;
+   struct sctp_endpoint *ep = sctp_sk(sk)->ep;
struct sctp_comm_param *commp = p;
-   struct sock *sk = ep->base.sk;
struct sk_buff *skb = commp->skb;
struct netlink_callback *cb = commp->cb;
const struct inet_diag_req_v2 *r = commp->r;
-   struct sctp_association *assoc =
-   list_entry(ep->asocs.next, struct sctp_association, asocs);
+   struct sctp_association *assoc;
int err = 0;
 
-   /* find the ep only once through the transports by this condition */
-   if (tsp->asoc != assoc)
-   goto out;
-
-   if (r->sdiag_family != AF_UNSPEC && sk->sk_family != r->sdiag_family)
-   goto out;
-
lock_sock(sk);
-   if (sk != assoc->base.sk)
-   goto release;
list_for_each_entry(assoc, >asocs, asocs) {
if (cb->args[4] < cb->args[1])
goto next;
@@ -312,7 +301,7 @@ static int sctp_tsp_dump(struct sctp_transport *tsp, void 
*p)
cb->nlh->nlmsg_seq,
NLM_F_MULTI, cb->nlh) < 0) {
cb->args[3] = 1;
-   err = 2;
+   err = 1;
goto release;
}
cb->args[3] = 1;
@@ -321,7 +310,7 @@ static int sctp_tsp_dump(struct sctp_transport *tsp, void 
*p)
sk_user_ns(NETLINK_CB(cb->skb).sk),
NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, 0, cb->nlh) < 0) {
-   err = 2;
+   err = 1;
goto release;
}
 next:
@@ -333,10 +322,35 @@ next:
cb->args[4] = 0;
 release:
release_sock(sk);
+   sock_put(sk);
return err;
+}
+
+static int sctp_get_sock(struct sctp_transport *tsp, void *p)
+{
+   struct sctp_endpoint *ep = tsp->asoc->ep;
+   struct sctp_comm_param *commp = p;
+   struct sock *sk = ep->base.sk;
+   struct netlink_callback *cb = commp->cb;
+   const struct inet_diag_req_v2 *r = commp->r;
+   struct sctp_association *assoc =
+   list_entry(ep->asocs.next, struct sctp_association, asocs);
+
+   /* find the ep only once through the transports by this condition */
+   if (tsp->asoc != assoc)
+   goto out;
+
+   if (r->sdiag_family != AF_UNSPEC && sk->sk_family != r->sdiag_family)
+   goto out;
+
+   sock_hold(sk);
+   cb->args[5] = (long)sk;
+
+   return 1;
+
 out:
cb->args[2]++;
-   return err;
+   return 0;
 }
 
 static int sctp_ep_dump(struct sctp_endpoint *ep, void *p)
@@ -472,10 +486,18 @@ skip:
 * 2 : to record the transport pos of this time's traversal
 * 3 : to mark if we have dumped the ep info of the current asoc
 * 4 : to work as a temporary variable to traversal list
+* 5 : to save the sk we get from travelsing the tsp list.
 */
if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
goto done;
-   sctp_for_each_transport(sctp_tsp_dump, net, cb->args[2], );
+
+next:
+   cb->args[5] = 0;
+   sctp_for_each_transport(sctp_get_sock, net, cb->args[2], );
+
+   if (cb->args[5] && !sctp_sock_dump((struct sock *)cb->args[5], ))
+   goto next;
+
 done:
cb->args[1] = cb->args[4];
cb->args[4] = 0;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 9fc417a..8ed2d99 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4469,17 +4469,21 @@ int sctp_transport_lookup_process(int (*cb)(struct 
sctp_transport *, void *),
  const union sctp_addr *paddr, void *p)
 {
struct 

[PATCHv3 net 0/3] sctp: a bunch of fixes for prsctp polices

2016-09-28 Thread Xin Long
This patchset is to fix 2 issues for prsctp polices:

  1. patch 1 and 2 fix "netperf-Throughput_Mbps -37.2% regression" issue
 when overloading the CPU.

  2. patch 3 fix "prsctp polices should check both sides' prsctp_capable,
 instead of only local side".

Xin Long (3):
  sctp: move sent_count to the memory hole in sctp_chunk
  sctp: remove prsctp_param from sctp_chunk
  sctp: change to check peer prsctp_capable when using prsctp polices

 include/net/sctp/structs.h | 13 +++--
 net/sctp/chunk.c   | 11 ---
 net/sctp/outqueue.c| 12 ++--
 net/sctp/sm_make_chunk.c   | 15 ---
 4 files changed, 17 insertions(+), 34 deletions(-)

-- 
2.1.0



[PATCHv3 net 3/3] sctp: change to check peer prsctp_capable when using prsctp polices

2016-09-28 Thread Xin Long
Now before using prsctp polices, sctp uses asoc->prsctp_enable to
check if prsctp is enabled. However asoc->prsctp_enable is set only
means local host support prsctp, sctp should not abandon packet if
peer host doesn't enable prsctp.

So this patch is to use asoc->peer.prsctp_capable to check if prsctp
is enabled on both side, instead of asoc->prsctp_enable, as asoc's
peer.prsctp_capable is set only when local and peer both enable prsctp.

Fixes: a6c2f792873a ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long 
---
 net/sctp/chunk.c| 4 ++--
 net/sctp/outqueue.c | 8 
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 14990e2..0a3dbec 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -179,7 +179,7 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
sctp_association *asoc,
 msg, msg->expires_at, jiffies);
}
 
-   if (asoc->prsctp_enable &&
+   if (asoc->peer.prsctp_capable &&
SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags))
msg->expires_at =
jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive);
@@ -340,7 +340,7 @@ errout:
 /* Check whether this message has expired. */
 int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 {
-   if (!chunk->asoc->prsctp_enable ||
+   if (!chunk->asoc->peer.prsctp_capable ||
!SCTP_PR_POLICY(chunk->sinfo.sinfo_flags)) {
struct sctp_datamsg *msg = chunk->msg;
 
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index e084f35..107233d 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -326,7 +326,7 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
*chunk, gfp_t gfp)
 
sctp_chunk_hold(chunk);
sctp_outq_tail_data(q, chunk);
-   if (chunk->asoc->prsctp_enable &&
+   if (chunk->asoc->peer.prsctp_capable &&
SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
chunk->asoc->sent_cnt_removable++;
if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
@@ -442,7 +442,7 @@ void sctp_prsctp_prune(struct sctp_association *asoc,
 {
struct sctp_transport *transport;
 
-   if (!asoc->prsctp_enable || !asoc->sent_cnt_removable)
+   if (!asoc->peer.prsctp_capable || !asoc->sent_cnt_removable)
return;
 
msg_len = sctp_prsctp_prune_sent(asoc, sinfo,
@@ -1055,7 +1055,7 @@ static int sctp_outq_flush(struct sctp_outq *q, int 
rtx_timeout, gfp_t gfp)
 
/* Mark as failed send. */
sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM);
-   if (asoc->prsctp_enable &&
+   if (asoc->peer.prsctp_capable &&

SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
asoc->sent_cnt_removable--;
sctp_chunk_free(chunk);
@@ -1347,7 +1347,7 @@ int sctp_outq_sack(struct sctp_outq *q, struct sctp_chunk 
*chunk)
tsn = ntohl(tchunk->subh.data_hdr->tsn);
if (TSN_lte(tsn, ctsn)) {
list_del_init(>transmitted_list);
-   if (asoc->prsctp_enable &&
+   if (asoc->peer.prsctp_capable &&
SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
asoc->sent_cnt_removable--;
sctp_chunk_free(tchunk);
-- 
2.1.0



[PATCHv3 net 1/3] sctp: move sent_count to the memory hole in sctp_chunk

2016-09-28 Thread Xin Long
Now pahole sctp_chunk, it has 2 memory holes:
   struct sctp_chunk {
struct list_head   list;
atomic_t   refcnt;
/* XXX 4 bytes hole, try to pack */
...
long unsigned int  prsctp_param;
intsent_count;
/* XXX 4 bytes hole, try to pack */

This patch is to move up sent_count to fill the 1st one and eliminate
the 2nd one.

It's not just another struct compaction, it also fixes the "netperf-
Throughput_Mbps -37.2% regression" issue when overloading the CPU.

Fixes: a6c2f792873a ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index ce93c4b..4f097f5 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -554,6 +554,9 @@ struct sctp_chunk {
 
atomic_t refcnt;
 
+   /* How many times this chunk have been sent, for prsctp RTX policy */
+   int sent_count;
+
/* This is our link to the per-transport transmitted list.  */
struct list_head transmitted_list;
 
@@ -610,9 +613,6 @@ struct sctp_chunk {
 */
unsigned long prsctp_param;
 
-   /* How many times this chunk have been sent, for prsctp RTX policy */
-   int sent_count;
-
/* Which association does this belong to?  */
struct sctp_association *asoc;
 
-- 
2.1.0



[PATCHv3 net 2/3] sctp: remove prsctp_param from sctp_chunk

2016-09-28 Thread Xin Long
Now sctp uses chunk->prsctp_param to save the prsctp param for all the
prsctp polices, we didn't need to introduce prsctp_param to sctp_chunk.
We can just use chunk->sinfo.sinfo_timetolive for RTX and BUF polices,
and reuse msg->expires_at for TTL policy, as the prsctp polices and old
expires policy are mutual exclusive.

This patch is to remove prsctp_param from sctp_chunk, and reuse msg's
expires_at for TTL and chunk's sinfo.sinfo_timetolive for RTX and BUF
polices.

Note that sctp can't use chunk's sinfo.sinfo_timetolive for TTL policy,
as it needs a u64 variables to save the expires_at time.

This one also fixes the "netperf-Throughput_Mbps -37.2% regression"
issue.

Fixes: a6c2f792873a ("sctp: implement prsctp TTL policy")
Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h |  7 ---
 net/sctp/chunk.c   |  9 +++--
 net/sctp/outqueue.c|  4 ++--
 net/sctp/sm_make_chunk.c   | 15 ---
 4 files changed, 9 insertions(+), 26 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 4f097f5..ced0df3 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -606,13 +606,6 @@ struct sctp_chunk {
/* This needs to be recoverable for SCTP_SEND_FAILED events. */
struct sctp_sndrcvinfo sinfo;
 
-   /* We use this field to record param for prsctp policies,
-* for TTL policy, it is the time_to_drop of this chunk,
-* for RTX policy, it is the max_sent_count of this chunk,
-* for PRIO policy, it is the priority of this chunk.
-*/
-   unsigned long prsctp_param;
-
/* Which association does this belong to?  */
struct sctp_association *asoc;
 
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index a55e547..14990e2 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -179,6 +179,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
sctp_association *asoc,
 msg, msg->expires_at, jiffies);
}
 
+   if (asoc->prsctp_enable &&
+   SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags))
+   msg->expires_at =
+   jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive);
+
/* This is the biggest possible DATA chunk that can fit into
 * the packet
 */
@@ -349,14 +354,14 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
}
 
if (SCTP_PR_TTL_ENABLED(chunk->sinfo.sinfo_flags) &&
-   time_after(jiffies, chunk->prsctp_param)) {
+   time_after(jiffies, chunk->msg->expires_at)) {
if (chunk->sent_count)
chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
else
chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
return 1;
} else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
-  chunk->sent_count > chunk->prsctp_param) {
+  chunk->sent_count > chunk->sinfo.sinfo_timetolive) {
chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
return 1;
}
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 72e54a4..e084f35 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -383,7 +383,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association 
*asoc,
 
list_for_each_entry_safe(chk, temp, queue, transmitted_list) {
if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) ||
-   chk->prsctp_param <= sinfo->sinfo_timetolive)
+   chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive)
continue;
 
list_del_init(>transmitted_list);
@@ -418,7 +418,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association 
*asoc,
 
list_for_each_entry_safe(chk, temp, queue, list) {
if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) ||
-   chk->prsctp_param <= sinfo->sinfo_timetolive)
+   chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive)
continue;
 
list_del_init(>list);
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 8c77b87..46ffecc 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -706,20 +706,6 @@ nodata:
return retval;
 }
 
-static void sctp_set_prsctp_policy(struct sctp_chunk *chunk,
-  const struct sctp_sndrcvinfo *sinfo)
-{
-   if (!chunk->asoc->prsctp_enable)
-   return;
-
-   if (SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags))
-   chunk->prsctp_param =
-   jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive);
-   else if (SCTP_PR_RTX_ENABLED(sinfo->sinfo_flags) ||
-SCTP_PR_PRIO_ENABLED(sinfo->sinfo_flags))
-   chunk->prsctp_param = sinfo->sinfo_timetolive;
-}
-
 /* Make a DATA chunk for the given association from the 

[PATCH] tc: f_u32: Fill in 'linkid' provided by user

2016-09-28 Thread Sushma Sitaram
Currently, 'linkid' input by the user is parsed but 'handle' is appended to the 
netlink message.

# tc filter add dev enp1s0f1 protocol ip parent : prio 99 u32 ht 800: \
order 1 link 1: offset at 0 mask 0f00 shift 6 plus 0 eat match ip \
protocol 6 ff

resulted in:
filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0
  match 0006/00ff at 8
offset 0f00>>6 at 0  eat

This patch results in:
filter protocol ip pref 99 u32 fh 800::1 order 1 key ht 800 bkt 0 link 1:
  match 0006/00ff at 8
offset 0f00>>6 at 0  eat


Signed-off-by Sushma Sitaram: Sushma Sitaram 
---
 tc/f_u32.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/f_u32.c b/tc/f_u32.c
index 0ad7ed2..92c1fcd 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -1071,7 +1071,7 @@ static int u32_parse_opt(struct filter_util *qu, char 
*handle,
fprintf(stderr, "\"link\" must be a hash 
table.\n");
return -1;
}
-   addattr_l(n, MAX_MSG, TCA_U32_LINK, , 4);
+   addattr_l(n, MAX_MSG, TCA_U32_LINK, , 4);
} else if (strcmp(*argv, "ht") == 0) {
unsigned int ht;
 



Re: [PATCH v1] mlx4: remove unused fields

2016-09-28 Thread Eric Dumazet
On Wed, 2016-09-28 at 11:00 -0700, David Decotigny wrote:
> From: David Decotigny 
> 
> This also can address following UBSAN warnings:
> [   36.640343] 
> 
> [   36.648772] UBSAN: Undefined behaviour in 
> drivers/net/ethernet/mellanox/mlx4/fw.c:857:26
> [   36.656853] shift exponent 64 is too large for 32-bit type 'int'
> [   36.663348] 
> 
> [   36.671783] 
> 
> [   36.680213] UBSAN: Undefined behaviour in 
> drivers/net/ethernet/mellanox/mlx4/fw.c:861:27
> [   36.688297] shift exponent 35 is too large for 32-bit type 'int'
> [   36.694702] 
> 
> 
> Tested:
>   reboot with UBSAN, no warning.
> 
> Signed-off-by: David Decotigny 
> ---

CC: Tariq Toukan  (mlx4 maintainer)

Note this patch was cooked/tested using net-next, but can be applied to
net tree, with minor fuzz.

Acked-by: Eric Dumazet 

Thanks David.




Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-28 Thread Alexander Duyck
On Wed, Sep 28, 2016 at 10:03 AM, Sowmini Varadhan
 wrote:
> On (09/23/16 17:43), Alexander Duyck wrote:
>> > On (09/23/16 10:38), Alexander Duyck wrote:
>   ;
>> >> almost think of it as us doing something like the inverse of
>> >> pskb_pull_tail.  The general idea here is we want to actually leave
>> >> the data in skb->data, but just reference it from frag[0] so that we
>> >> don't accidentally pull in the 2 byte padding for alignment when
>> >> transmitting the frame.
>
> Some additional findings..
>
> Just to recap how we got here: for the Rx path, the inner packet has
> been set up as an ethernet frame with the IP header at an aligned address
> when it hits vxlan_build_skb.  But that means the (inner) mac address
> was offset by NET_IP_ALIGN so vxlan_build_skb needs to pad the data
> by NET_IP_ALIGN to make the vxh outer ip header align.
>
> Then we'd need to do something like the suggestion above (keep some
> pointers in frag[0]?  do the reverse of a pskb_expand_head to push out
> the inner ip header to the skb_frag_t?), to have the driver skip over the
> pad..
>
> I tried the following for a hack, and it takes care of the tx side
> unaligned access, though, clearly, the memmove needs to be avoided
>
> @@ -1750,10 +1825,38 @@ static int vxlan_build_skb(struct sk_buff *skb, 
> struct d
> if (err)
> goto out_free;
>
> +
> +#if (NET_IP_ALIGN != 0)
> +   {
> +   unsigned char *data;
> +
> +   /* inner packet is an ethernet frame that was set up
> +* so that the IP header is aligned. But that means the
> +* mac address was offset by NET_IP_ALIGN, so we need
> +* to move things up so that the vxh and outer ip header
> +* are now aligned
> +* XXX The Alexander Duyck idea was to only do the
> +* extra __skb_push() for NET_IP_ALIGN, and avoid the
> +* extram memmove and ->inner* adjustments. Plus keep
> +* additional pointers in frag[0] and have the driver pick
> +* up pointers from frag[0] .. need to investigate
> +* that suggestion further.
> +*/
> +   data = skb->data;
> +   skb->data -= NET_IP_ALIGN;
> +   memmove(skb->data, data, skb_headlen(skb));
> +   skb->inner_network_header -= NET_IP_ALIGN;
> +   skb->inner_mac_header -= NET_IP_ALIGN;
> +   skb->inner_transport_header -= NET_IP_ALIGN;
> +   }
> +#endif
> vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
> vxh->vx_flags = VXLAN_HF_VNI;
> vxh->vx_vni = vxlan_vni_field(vni);
>
> In the general case (when the skb passed to vxlan_build_skb is already
> non-linear), wouldn't we end up having to shift all the frags by 1 and/or
> do some type of memory copy of the inner packet? However, I think
> there are some clever things we can do in general, to avoid the memmove..

Right, basically my idea was to just skip the memmove, pull the data
out and add pointers to this spot in the fraglist.  Doing that you
should be pulling tail back so it is equal to data.  Then you just do
an skb_reserve(skb, -NET_IP_ALIGN) and you should be all set to start
adding outer headers.  The problem is you end up having to disable any
offsets such as GSO or checksum offload since you can't really use the
inner header offsets anymore.

> I also looked at the Rx path. Here the suggestion was:
> "we should only pull the outer headers from the page frag, and then
>  when the time is right we drop the outer headers and pull the inner
>  headers from the page frag.  That way we can keep all the headers
>  aligned."
> I hacked up ixgbe_add_rx_frag to always only create nonlinear skb's,
> i.e., always avoid the
>memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
> and then I end up with
> - ixgbe_clean_rx_irq copies outer header ether/ip/udp headers
>   into linear part as needed,
> - then udp_gro_receive -> vxlan_gro_receive pulls up vxlan header
>   into linear part, and then..

This is the point where we need to stop, drop the existing headers,
call skb_reserve(NET_IP_ALIGN), and then pick back up where we left
off.  We just have to make sure the skbuff isn't shared.

> - eth_gro_receive pulls up another 34 bytes for the eth + ip header.
>   This last pull ends up being unaligned.
> I dont know if we can safely drop the outer ip headers at this point
> (have not tried this yet, and I'm not sure we can do this in all udp
> encaps cases..)

In general the interface behind the tunnel shouldn't need to know
about any of the data in front of the tunnel, so you should be able to
drop the original header offsets and reset things if you want so you
could overwrite the old headers.

> one other possibility is to set up the inner frame as part of the
> ->frag_list (note, this is *not* skb_frag_t). I suspect that is 

[PATCH] Add netdev all_adj_list refcnt propagation to fix panic

2016-09-28 Thread David Ahern
Andrew Collins posted this patch as RFC in March:
http://patchwork.ozlabs.org/patch/603101/

It has apparently fallen through the cracks and never applied.

It solves a refcnt problem (thanks Nik for pointing out this patch)
with stacked devices that involves macvlan on a bridge, a bond into
the bridge, and the bridge and macvlan are enslaved to a vrf:

++
|  myvrf |
++
  ||
  |  +-+
  |  | macvlan |
  |  +-+
  ||
  +--+
  |  bridge  |
  +--+
  |
  ++
  | bond0  |
  ++
  |
  ++
  |  swp3  |
  ++

Deleting bond0 hangs waiting for bond0 to become free. The splat in dmesg
is:

[  206.485340] [ cut here ]
[  206.486052] WARNING: CPU: 0 PID: 746 at 
/home/dsa/kernel-3.git/net/core/dev.c:6772 rollback_registered_many+0x28a/0x2da
[  206.487563] Modules linked in: macvlan bonding bridge stp llc vrf
[  206.488946] CPU: 0 PID: 746 Comm: ifdown Not tainted 4.8.0-rc7+ #144
[  206.489768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.7.5-20140531_083030-gandalf 04/01/2014
[  206.490973]   88003b3a3ca8 81287ad3 

[  206.491939]   88003b3a3ce8 8104f19d 
1a743b3a3b58
[  206.492914]  88003c93e000 88003b3a3d48 88003c877400 
88003b3a3d48
[  206.493873] Call Trace:
[  206.494184]  [] dump_stack+0x81/0xb6
[  206.494805]  [] __warn+0xc5/0xe0
[  206.495382]  [] warn_slowpath_null+0x18/0x1a
[  206.496094]  [] rollback_registered_many+0x28a/0x2da
[  206.496888]  [] rollback_registered+0x29/0x36
[  206.497596]  [] unregister_netdevice_queue+0x6a/0xa6
[  206.498383]  [] bonding_store_bonds+0x132/0x186 [bonding]
[  206.499209]  [] class_attr_store+0x1e/0x20
[  206.499893]  [] sysfs_kf_write+0x44/0x4b
[  206.500558]  [] kernfs_fop_write+0x113/0x15d
[  206.501269]  [] __vfs_write+0x21/0xa0
[  206.501897]  [] ? percpu_down_read+0x4e/0x7a
[  206.502595]  [] ? __sb_start_write+0x5a/0xab
[  206.503302]  [] ? __sb_start_write+0x5a/0xab
[  206.504015]  [] vfs_write+0xa2/0xc6
[  206.504637]  [] SyS_write+0x4b/0x79
[  206.505251]  [] entry_SYSCALL_64_fastpath+0x1f/0xbd
[  206.506054] ---[ end trace a578aa9ea7e7176a ]---
[  206.512017] PF_BRIDGE: RTM_SETLINK with unknown ifindex
[  216.765567] unregister_netdevice: waiting for bond0 to become free. Usage 
count = 1

The splate is from this line in rollback_registered_many():

WARN_ON(netdev_has_any_upper_dev(dev));
  ---

This is the original commit message from Andrew:

This is an RFC patch to fix a relatively easily reproducible kernel
panic related to the all_adj_list handling for netdevs in recent kernels.

This is more to generate discussion than anything else.  I don't
particularly like this approach, I'm hoping someone has a better idea.

The following sequence of commands will reproduce the issue:

ip link add link eth0 name eth0.100 type vlan id 100
ip link add link eth0 name eth0.200 type vlan id 200
ip link add name testbr type bridge
ip link set eth0.100 master testbr
ip link set eth0.200 master testbr
ip link add link testbr mac0 type macvlan
ip link delete dev testbr

This creates an upper/lower tree of (excuse the poor ASCII art):

/---eth0.100-eth0
mac0-testbr-
\---eth0.200-eth0

When testbr is deleted, the all_adj_lists are walked, and eth0 is deleted twice 
from
the mac0 list. Unfortunately, during setup in __netdev_upper_dev_link, only one
reference to eth0 is added, so this results in a panic.

This change adds reference count propagation so things are handled properly.

Matthias Schiffer reported a similar crash in batman-adv:

https://github.com/freifunk-gluon/gluon/issues/680
https://www.open-mesh.org/issues/247

which this patch also seems to resolve.

Patch is from Andrew Collins , but did not have
a formal sign-off. The patch applies to top of tree, so no change made on
my part I just pulled it off the web.

From: Andrew Collins 
Signed-off-by: David Ahern 
---
 net/core/dev.c | 69 --
 1 file changed, 38 insertions(+), 31 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c0c291f721d6..84963c99763a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5589,6 +5589,7 @@ static inline bool netdev_adjacent_is_neigh_list(struct 
net_device *dev,
 
 static int __netdev_adjacent_dev_insert(struct net_device *dev,
struct net_device *adj_dev,
+   u16 ref_nr,
struct list_head *dev_list,
void *private, bool master)
 {
@@ -5598,7 +5599,7 @@ static int __netdev_adjacent_dev_insert(struct net_device 
*dev,

Re: [PATCH 3/3] net: fec: align IP header in hardware

2016-09-28 Thread Eric Nelson
Thanks Russell,

On 09/28/2016 10:25 AM, Russell King - ARM Linux wrote:
> On Wed, Sep 28, 2016 at 10:14:52AM -0700, Eric Nelson wrote:
>> Thanks David,
>>
>> On 09/28/2016 09:42 AM, David Laight wrote:
>>> From reading this it seems that the effect of FEC_RACC_SHIFT16 is to
>>> add two bytes of 'junk' to the start of every receive frame.
>>
>> That's right. Two bytes of junk between the MAC header and the
>> IP header.
> 
> That's wrong.  FEC_RACC_SHIFT16 adds two bytes to the _beginning_ of
> the packet, not in the middle of the packet:
> 
>7  RX FIFO Shift-16
>  SHIFT16
>   When this field is set, the actual frame data starts at bit 16
>   of the first word read from the RX FIFO aligning the Ethernet
>   payload on a 32-bit boundary.
> 
>   NOTE: This function only affects the FIFO storage and has no
> influence on the statistics, which use the actual length
> of the frame received.
> 
>   0Disabled.
>   1Instructs the MAC to write two additional bytes in front
>of each frame received into the RX FIFO.
> 
> *in front* of the frame - that's before the Ethernet header.  Not between
> the ethernet and IP headers.
> 

I obviously mis-read this, and haven't dumped any packets to straighten
myself out.



[PATCH v1] mlx4: remove unused fields

2016-09-28 Thread David Decotigny
From: David Decotigny 

This also can address following UBSAN warnings:
[   36.640343] 

[   36.648772] UBSAN: Undefined behaviour in 
drivers/net/ethernet/mellanox/mlx4/fw.c:857:26
[   36.656853] shift exponent 64 is too large for 32-bit type 'int'
[   36.663348] 

[   36.671783] 

[   36.680213] UBSAN: Undefined behaviour in 
drivers/net/ethernet/mellanox/mlx4/fw.c:861:27
[   36.688297] shift exponent 35 is too large for 32-bit type 'int'
[   36.694702] 


Tested:
  reboot with UBSAN, no warning.

Signed-off-by: David Decotigny 
---
 drivers/net/ethernet/mellanox/mlx4/fw.c | 4 
 drivers/net/ethernet/mellanox/mlx4/fw.h | 2 --
 2 files changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c 
b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 090bf81..f9cbc67 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -853,12 +853,8 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct 
mlx4_dev_cap *dev_cap)
dev_cap->max_eqs = 1 << (field & 0xf);
MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MTT_OFFSET);
dev_cap->reserved_mtts = 1 << (field >> 4);
-   MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MRW_SZ_OFFSET);
-   dev_cap->max_mrw_sz = 1 << field;
MLX4_GET(field, outbox, QUERY_DEV_CAP_RSVD_MRW_OFFSET);
dev_cap->reserved_mrws = 1 << (field & 0xf);
-   MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_MTT_SEG_OFFSET);
-   dev_cap->max_mtt_seg = 1 << (field & 0x3f);
MLX4_GET(size, outbox, QUERY_DEV_CAP_NUM_SYS_EQ_OFFSET);
dev_cap->num_sys_eqs = size & 0xfff;
MLX4_GET(field, outbox, QUERY_DEV_CAP_MAX_REQ_QP_OFFSET);
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.h 
b/drivers/net/ethernet/mellanox/mlx4/fw.h
index f11614f..5343a05 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.h
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.h
@@ -80,9 +80,7 @@ struct mlx4_dev_cap {
int max_eqs;
int num_sys_eqs;
int reserved_mtts;
-   int max_mrw_sz;
int reserved_mrws;
-   int max_mtt_seg;
int max_requester_per_qp;
int max_responder_per_qp;
int max_rdma_global;
-- 
2.8.0.rc3.226.g39d4020



Re: ISDN-Gigaset: Fine-tuning for three function implementations

2016-09-28 Thread Paul Bolle
On Wed, 2016-09-28 at 18:50 +0200, SF Markus Elfring wrote:
> Would you like to look once more into an improved patch series for this
> software module a bit later?

I'm afraid I'm not looking forward to receiving an update of this
series, sorry.

Thanks,


Paul Bolle


Re: [PATCH 1/5] ISDN-Gigaset: Use kmalloc_array() in two functions

2016-09-28 Thread Paul Bolle
On Wed, 2016-09-28 at 18:38 +0200, SF Markus Elfring wrote:
> > I'm not going to change code just because some checker suggests to
> > do so.
> 
> The script "checkpatch.pl" can point information out like the
> following.
> 
> WARNING: Prefer kmalloc_array over kmalloc with multiply

Am I being trolled?


Paul Bolle


Re: [PATCH v2 net] net: skbuff: skb_vlan_push: Fix wrong unwinding of skb->data after __vlan_insert_tag call

2016-09-28 Thread Shmulik Ladkani
On Wed, 28 Sep 2016 16:43:38 +0200 Daniel Borkmann  wrote:
> Couldn't we end up with 1) for the act_vlan case when we'd have the
> offset-adjusted skb_vlan_push() fix from here, where we'd then redirect
> to ingress where skb_vlan_pop() would be called? If I'm not missing
> something, skb_vlan_push() would then point to the data location of 1)
> and with your other proposed direct netif_receive_skb() patch, no
> further skb->data adjustments would be done, right?

Right. Then skb_vlan_pop() should expect either (1) or (2).

> Another potential issue (but unrelated to this fix here) I just noticed
> is, whether act_vlan might have the same problem as we fixed in 8065694e6519
> ("bpf: fix checksum for vlan push/pop helper"). So potentially, we could
> end up fixing CHECKSUM_COMPLETE wrongly on ingress, since these 14 bytes
> are already pulled out of the sum at that point.
> 
> > Should we adjust "offset" back, only if resulting offset is >=14 ?  
> 
> If also the checksum one might end up as an issue, maybe it's just best
> to go through the pain and do the push/pull for data plus csum, so both
> skb_vlan_*() functions see the frame starting from mac header temporarily?

Although not related to this specific fix, I see 2 ways addressing the
rcsum problem:

1. Per your suggestion, skb_vlan_*() to expect 'data' at mac_header
   That would simplify things; for this suggested 'data unwind' fix as well

2. Within skb_vlan_*(), deduct (according to initial offset) whether
   we're already "pulled out" of the rcsum, and not invoke the
   skb_postpull/push_rcsum update.

Will meditate some more.

Thanks
Shmulik


Re: [PATCH net-next 1/2] net: phy: Add Wake-on-LAN driver for Microsemi PHYs.

2016-09-28 Thread Florian Fainelli
On 09/28/2016 05:01 AM, Raju Lakkaraju wrote:
> From: Raju Lakkaraju 
> 
> Wake-on-LAN (WoL) is an Ethernet networking standard that allows
> a computer/device to be turned on or awakened by a network message.
> VSC8531 PHY can support this feature configure by driver set function.
> WoL status get by driver get function.
> 
> Tested on Beaglebone Black with VSC 8531 PHY.
> 
> Signed-off-by: Raju Lakkaraju 
> ---
>  drivers/net/phy/mscc.c | 132 
> +
>  1 file changed, 132 insertions(+)
> 
> diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
> index d350deb..ca6ea23 100644
> --- a/drivers/net/phy/mscc.c
> +++ b/drivers/net/phy/mscc.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  enum rgmii_rx_clock_delay {
>   RGMII_RX_CLK_DELAY_0_2_NS = 0,
> @@ -35,6 +36,7 @@ enum rgmii_rx_clock_delay {
>  
>  #define MII_VSC85XX_INT_MASK   25
>  #define MII_VSC85XX_INT_MASK_MASK  0xa000
> +#define MII_VSC85XX_INT_MASK_WOL   0x0040
>  #define MII_VSC85XX_INT_STATUS 26
>  
>  #define MSCC_EXT_PAGE_ACCESS   31
> @@ -46,6 +48,19 @@ enum rgmii_rx_clock_delay {
>  #define RGMII_RX_CLK_DELAY_MASK0x0070
>  #define RGMII_RX_CLK_DELAY_POS 4
>  
> +#define MSCC_PHY_WOL_LOWER_MAC_ADDR21
> +#define MSCC_PHY_WOL_MID_MAC_ADDR  22
> +#define MSCC_PHY_WOL_UPPER_MAC_ADDR23
> +#define MSCC_PHY_WOL_LOWER_PASSWD  24
> +#define MSCC_PHY_WOL_MID_PASSWD25
> +#define MSCC_PHY_WOL_UPPER_PASSWD  26
> +
> +#define MSCC_PHY_WOL_MAC_CONTROL   27
> +#define EDGE_RATE_CNTL_POS 5
> +#define EDGE_RATE_CNTL_MASK0x00E0
> +#define SECURE_ON_ENABLE   0x8000
> +#define SECURE_ON_PASSWD_LEN_4 0x4000
> +
>  /* Microsemi PHY ID's */
>  #define PHY_ID_VSC8531 0x00070570
>  #define PHY_ID_VSC8541 0x00070770
> @@ -58,6 +73,119 @@ static int vsc85xx_phy_page_set(struct phy_device 
> *phydev, u8 page)
>   return rc;
>  }
>  
> +static int vsc85xx_wol_set(struct phy_device *phydev,
> +struct ethtool_wolinfo *wol)
> +{
> + int rc;
> + u16 reg_val;
> + struct ethtool_wolinfo *wol_conf = wol;
> +
> + mutex_lock(>lock);

This mutex is used here because you are using an indirect page access,
right? This is not to protect against multiple calls of wol_set from
different executing threads?

> + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED_2);
> + if (rc != 0)
> + goto out_unlock;
> +
> + if (wol->wolopts & WAKE_MAGIC) {
> + /* Store the device address for the magic packet */
> + reg_val = phydev->attached_dev->dev_addr[4] << 8;
> + reg_val |= phydev->attached_dev->dev_addr[5];
> + phy_write(phydev, MSCC_PHY_WOL_LOWER_MAC_ADDR, reg_val);
> + reg_val = phydev->attached_dev->dev_addr[2] << 8;
> + reg_val |= phydev->attached_dev->dev_addr[3];
> + phy_write(phydev, MSCC_PHY_WOL_MID_MAC_ADDR, reg_val);
> + reg_val = phydev->attached_dev->dev_addr[0] << 8;
> + reg_val |= phydev->attached_dev->dev_addr[1];
> + phy_write(phydev, MSCC_PHY_WOL_UPPER_MAC_ADDR, reg_val);
> + } else {
> + phy_write(phydev, MSCC_PHY_WOL_LOWER_MAC_ADDR, 0);
> + phy_write(phydev, MSCC_PHY_WOL_MID_MAC_ADDR, 0);
> + phy_write(phydev, MSCC_PHY_WOL_UPPER_MAC_ADDR, 0);
> + }
> +
> + reg_val = phy_read(phydev, MSCC_PHY_WOL_MAC_CONTROL);
> + if (wol_conf->wolopts & WAKE_MAGICSECURE)
> + reg_val |= SECURE_ON_ENABLE;
> + else
> + reg_val &= ~SECURE_ON_ENABLE;
> + phy_write(phydev, MSCC_PHY_WOL_MAC_CONTROL, reg_val);
> +
> + if (wol_conf->wolopts & WAKE_MAGICSECURE) {
> + reg_val = wol_conf->sopass[4] << 8;
> + reg_val |= wol_conf->sopass[5];
> + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, reg_val);
> + reg_val = wol_conf->sopass[2] << 8;
> + reg_val |= wol_conf->sopass[3];
> + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, reg_val);
> + reg_val = wol_conf->sopass[0] << 8;
> + reg_val |= wol_conf->sopass[1];
> + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, reg_val);
> + } else {
> + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, 0);
> + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, 0);
> + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, 0);
> + }

How about making the code a little simpler in both cases with something
like this the following:

u16 pwd = { };
unsigned int i;

if (wol_conf->wolopts & WAKE_MAGICECURE)
for (i = 0; i < ARRAY_SIZE(pwd); i++)
pwd[i] = 

Re: [PATCH 3/3] net: fec: align IP header in hardware

2016-09-28 Thread Russell King - ARM Linux
On Wed, Sep 28, 2016 at 10:14:52AM -0700, Eric Nelson wrote:
> Thanks David,
> 
> On 09/28/2016 09:42 AM, David Laight wrote:
> > From reading this it seems that the effect of FEC_RACC_SHIFT16 is to
> > add two bytes of 'junk' to the start of every receive frame.
> 
> That's right. Two bytes of junk between the MAC header and the
> IP header.

That's wrong.  FEC_RACC_SHIFT16 adds two bytes to the _beginning_ of
the packet, not in the middle of the packet:

   7  RX FIFO Shift-16
 SHIFT16
  When this field is set, the actual frame data starts at bit 16
  of the first word read from the RX FIFO aligning the Ethernet
  payload on a 32-bit boundary.

  NOTE: This function only affects the FIFO storage and has no
influence on the statistics, which use the actual length
of the frame received.

  0Disabled.
  1Instructs the MAC to write two additional bytes in front
   of each frame received into the RX FIFO.

*in front* of the frame - that's before the Ethernet header.  Not between
the ethernet and IP headers.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [PATCH RFC 4/6] net: phy: Use phy name when requesting the interrupt

2016-09-28 Thread Florian Fainelli
On 09/28/2016 01:32 AM, Andrew Lunn wrote:
> Using the fixed name "phy_interrupt" is not very informative in
> /proc/interrupts when there are a lot of phys, e.g. a device with an
> Ethernet switch. So when requesting the interrupt, use the name of the
> phy.
> 
> Signed-off-by: Andrew Lunn 

Acked-by: Florian Fainelli 
-- 
Florian


Re: [PATCH 3/3] net: fec: align IP header in hardware

2016-09-28 Thread Eric Nelson
Thanks David,

On 09/28/2016 09:42 AM, David Laight wrote:
> From: Eric Nelson
>> Sent: 26 September 2016 19:40
>> Hi David,
>>
>> On 09/26/2016 02:26 AM, David Laight wrote:
>>> From: Eric Nelson
 Sent: 24 September 2016 15:42
 The FEC receive accelerator (RACC) supports shifting the data payload of
 received packets by 16-bits, which aligns the payload (IP header) on a
 4-byte boundary, which is, if not required, at least strongly suggested
 by the Linux networking layer.
>>> ...
 +  /* align IP header */
 +  val |= FEC_RACC_SHIFT16;
>>>
>>> I can't help feeling that there needs to be corresponding
>>> changes to increase the buffer size by 2 (maybe for large mtu)
>>> and to discard two bytes from the frame length.
>>>
>>
>> In the normal case, the fec driver over-allocates all receive packets to
>> be of size FEC_ENET_RX_FRSIZE (2048) minus the value of rx_align,
>> which is either 0x0f (ARM) or 0x03 (PPC).
>>
>> If the frame length is less than rx_copybreak (typically 256), then
>> the frame length from the receive buffer descriptor is used to
>> control the allocation size for a copied buffer, and this will include
>> the two bytes of padding if RACC_SHIFT16 is set.
>>
>>> If probably ought to be predicated on NET_IP_ALIGN as well.
>>>
>> Can you elaborate?
> 
> From reading this it seems that the effect of FEC_RACC_SHIFT16 is to
> add two bytes of 'junk' to the start of every receive frame.
> 

That's right. Two bytes of junk between the MAC header and the
IP header.

> In the 'copybreak' case the new skb would need to be 2 bytes shorter
> than the length reported by the hardware, and the data copied from
> 2 bytes into the dma buffer.
> 

As it stands, the skb allocated by the copybreak routine will include
the two bytes of padding, and the call to skb_pull_inline will ignore
them.

> The extra 2 bytes also mean the that maximum mtu that can be received
> into a buffer is two bytes less.
>

Right, but I think the max is already high enough that this isn't a
problem.

> If someone sets the mtu to (say) 9k for jumbo frames this might matter.
> Even with fixed 2048 byte buffers it reduces the maximum value the mtu
> can be set to by 2.
> 

As far as I can tell, the fec driver doesn't support jumbo frames, and
the max frame length is currently hard-coded at PKT_MAXBUF_SIZE (1522).

This is well within the 2048-byte allocation, even with optional headers
for VLAN etc.

> Now if NET_IP_ALIGN is zero then it is fine for the rx frame to start
> on a 4n boundary, and the skb are likely to be allocated that way.
> In this case you don't want to extra two bytes of 'junk'.
> 
NET_IP_ALIGN is defaulting to 2 by the conditional in skbuff.h

> OTOH if NET_IP_ALIGN is 2 then you need to 'fiddle' things so that
> the data is dma'd to offset -2 in the skb and then ensure that the
> end of frame is set correctly.
> 

That's what the RACC SHIFT16 bit does.

The FEC hardware isn't capable of DMA'ing to an un-aligned address.
On ARM, it requires 64-bit alignment, but suggests 128-bit alignment.

On other (PPC?) architectures, it requires 32-bit alignment. This is
handled by the rx_align field.

Regards,


Eric


Re: [PATCH RFC 3/6] net: phy: Threaded interrupts allow some simplification

2016-09-28 Thread Florian Fainelli
On 09/28/2016 06:38 AM, Sergei Shtylyov wrote:
> On 09/28/2016 03:28 PM, Andrew Lunn wrote:
> 
> The PHY interrupts are now handled in a threaded interrupt handler,
> which can sleep. The work queue is no longer needed, phy_change() can
> be called directly. Additionally, none of the callers of
> phy_mac_interrupt() did so in interrupt context, so fully remove the

   I did intend to call it from interrupt context (from the ravb
 driver).

> work queue, and document that phy_mac_interrupt() should not be called
> in interrupt context.

   It was intentionally made callable from the interrupt context, I'd
 prefer
 if you wouldn't change that.
>>>
>>>OTOH, it's still not very handy to call because of the 'new_link'
>>> parameter which I'm not sure I can provide...
>>
>> Hi Sergei
>>
>> If there is a need for it, i will leave the work queue and keep this
>> code unchanged.
> 
>Let's hear what Florian says...

The intent is really to have phy_mac_interrupt() callable from hard IRQ
context, not that this matters really too much because link events
already occur in the slow path, but it's nice to have that property
retained IMHO.
-- 
Florian


Re: [PATCH v2 net] net: skbuff: skb_vlan_push: Fix wrong unwinding of skb->data after __vlan_insert_tag call

2016-09-28 Thread Shmulik Ladkani
On Wed, 28 Sep 2016 16:43:38 +0200 Daniel Borkmann  wrote:
> > (1) suppose upon entry we have
> >
> >  DA,SA,0x8100,TCI,0x0800,
> >  ^^
> >  mac_hdr  data
> >
> > initial offset is 18, and after current unwinding code we'll get  
> 
> You mean data points after the 0x0800, right?

Sorry. Yes, exactly as you say. Initially 18 bytes ahead:

DA,SA,0x8100,TCI,0x0800,
^   ^
mac_hdr data


Re: [PATCH net-next] net/vxlan: Avoid unaligned access in vxlan_build_skb()

2016-09-28 Thread Sowmini Varadhan
On (09/23/16 17:43), Alexander Duyck wrote:
> > On (09/23/16 10:38), Alexander Duyck wrote:
  ;
> >> almost think of it as us doing something like the inverse of
> >> pskb_pull_tail.  The general idea here is we want to actually leave
> >> the data in skb->data, but just reference it from frag[0] so that we
> >> don't accidentally pull in the 2 byte padding for alignment when
> >> transmitting the frame.

Some additional findings..

Just to recap how we got here: for the Rx path, the inner packet has
been set up as an ethernet frame with the IP header at an aligned address
when it hits vxlan_build_skb.  But that means the (inner) mac address
was offset by NET_IP_ALIGN so vxlan_build_skb needs to pad the data
by NET_IP_ALIGN to make the vxh outer ip header align. 

Then we'd need to do something like the suggestion above (keep some
pointers in frag[0]?  do the reverse of a pskb_expand_head to push out
the inner ip header to the skb_frag_t?), to have the driver skip over the
pad.. 

I tried the following for a hack, and it takes care of the tx side
unaligned access, though, clearly, the memmove needs to be avoided

@@ -1750,10 +1825,38 @@ static int vxlan_build_skb(struct sk_buff *skb, struct d
if (err)
goto out_free;
 
+
+#if (NET_IP_ALIGN != 0) 
+   {
+   unsigned char *data;
+
+   /* inner packet is an ethernet frame that was set up
+* so that the IP header is aligned. But that means the
+* mac address was offset by NET_IP_ALIGN, so we need
+* to move things up so that the vxh and outer ip header
+* are now aligned
+* XXX The Alexander Duyck idea was to only do the
+* extra __skb_push() for NET_IP_ALIGN, and avoid the
+* extram memmove and ->inner* adjustments. Plus keep
+* additional pointers in frag[0] and have the driver pick
+* up pointers from frag[0] .. need to investigate
+* that suggestion further.
+*/
+   data = skb->data;
+   skb->data -= NET_IP_ALIGN;
+   memmove(skb->data, data, skb_headlen(skb));
+   skb->inner_network_header -= NET_IP_ALIGN;
+   skb->inner_mac_header -= NET_IP_ALIGN;
+   skb->inner_transport_header -= NET_IP_ALIGN;
+   }
+#endif
vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
vxh->vx_flags = VXLAN_HF_VNI;
vxh->vx_vni = vxlan_vni_field(vni);

In the general case (when the skb passed to vxlan_build_skb is already
non-linear), wouldn't we end up having to shift all the frags by 1 and/or
do some type of memory copy of the inner packet? However, I think
there are some clever things we can do in general, to avoid the memmove..

I also looked at the Rx path. Here the suggestion was:
"we should only pull the outer headers from the page frag, and then
 when the time is right we drop the outer headers and pull the inner
 headers from the page frag.  That way we can keep all the headers
 aligned."
I hacked up ixgbe_add_rx_frag to always only create nonlinear skb's, 
i.e., always avoid the
   memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
and then I end up with 
- ixgbe_clean_rx_irq copies outer header ether/ip/udp headers
  into linear part as needed, 
- then udp_gro_receive -> vxlan_gro_receive pulls up vxlan header
  into linear part, and then..
- eth_gro_receive pulls up another 34 bytes for the eth + ip header.
  This last pull ends up being unaligned. 
I dont know if we can safely drop the outer ip headers at this point
(have not tried this yet, and I'm not sure we can do this in all udp
encaps cases..)

one other possibility is to set up the inner frame as part of the 
->frag_list (note, this is *not* skb_frag_t). I suspect that is going
to cause other inefficiencies.

but, as tom has been saying all along, a big part of this problem is that
we are tripping up on the ethernet header in the middle of an
IP packet. Unfortunately I dont think the ietf is going to agree
to never ever do that, so I'm not sure we can win that architectural battle.
 







[PATCH 0/3] [v2] Add basic ACPI support to the Qualcomm Technologies EMAC driver

2016-09-28 Thread Timur Tabi
This patch series adds support to the EMAC driver for extracting addresses,
interrupts, and some _DSDs (properties) from ACPI.  The first two patches
clean up the code, and the third patch adds ACPI-specific functionality.

The first patch fixes a bug with handling the platform_device for the
internal PHY.  This phy is treated as a separate device in both DT and
ACPI, but since the platform is not released automatically when the 
driver unloads, managed functions like devm_ioremap_resource cannot be
used. 

The second patch replaces of_get_mac_address with its platform-independent
equivalent device_get_mac_address.  

The third patch parses the ACPI tables to obtain the platform_device for
the primary EMAC node ("QCOM8070") and the internal phy node ("QCOM8071").

Timur Tabi (3):
  [v2] net: qcom/emac: do not use devm on internal phy pdev
  [v2] net: qcom/emac: use device_get_mac_address
  [v2] net: qcom/emac: initial ACPI support

 drivers/net/ethernet/qualcomm/emac/emac-phy.c   |  37 ++--
 drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 110 ++--
 drivers/net/ethernet/qualcomm/emac/emac.c   |  26 --
 3 files changed, 134 insertions(+), 39 deletions(-)

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



[PATCH 2/3] [v2] net: qcom/emac: use device_get_mac_address

2016-09-28 Thread Timur Tabi
Replace the DT-specific of_get_mac_address() function with
device_get_mac_address, which works on both DT and ACPI platforms.  This
change makes it easier to add ACPI support.

Signed-off-by: Timur Tabi 
---
 drivers/net/ethernet/qualcomm/emac/emac.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index 429b4cb..551df1c 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -531,18 +531,16 @@ static void emac_clks_teardown(struct emac_adapter *adpt)
 static int emac_probe_resources(struct platform_device *pdev,
struct emac_adapter *adpt)
 {
-   struct device_node *node = pdev->dev.of_node;
struct net_device *netdev = adpt->netdev;
struct resource *res;
-   const void *maddr;
+   char maddr[ETH_ALEN];
int ret = 0;
 
/* get mac address */
-   maddr = of_get_mac_address(node);
-   if (!maddr)
-   eth_hw_addr_random(netdev);
-   else
+   if (device_get_mac_address(>dev, maddr, ETH_ALEN))
ether_addr_copy(netdev->dev_addr, maddr);
+   else
+   eth_hw_addr_random(netdev);
 
/* Core 0 interrupt */
ret = platform_get_irq(pdev, 0);
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



[PATCH 1/3] [v2] net: qcom/emac: do not use devm on internal phy pdev

2016-09-28 Thread Timur Tabi
The platform_device returned by of_find_device_by_node() is not
automatically released when the driver unprobes.  Therefore,
managed calls like devm_ioremap_resource() should not be used.
Instead, we manually allocate the resources and then free them
on driver release.

Signed-off-by: Timur Tabi 
---
 drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 42 +++--
 drivers/net/ethernet/qualcomm/emac/emac.c   |  4 +++
 2 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c 
b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
index 6ab0a3c..ad0e420 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
@@ -681,6 +681,7 @@ int emac_sgmii_config(struct platform_device *pdev, struct 
emac_adapter *adpt)
struct resource *res;
const struct of_device_id *match;
struct device_node *np;
+   int ret;
 
np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0);
if (!np) {
@@ -697,25 +698,48 @@ int emac_sgmii_config(struct platform_device *pdev, 
struct emac_adapter *adpt)
match = of_match_device(emac_sgmii_dt_match, _pdev->dev);
if (!match) {
dev_err(>dev, "unrecognized internal phy node\n");
-   return -ENODEV;
+   ret = -ENODEV;
+   goto error_put_device;
}
 
phy->initialize = (emac_sgmii_initialize)match->data;
 
/* Base address is the first address */
res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 0);
-   phy->base = devm_ioremap_resource(_pdev->dev, res);
-   if (IS_ERR(phy->base))
-   return PTR_ERR(phy->base);
+   phy->base = ioremap(res->start, resource_size(res));
+   if (IS_ERR(phy->base)) {
+   ret = PTR_ERR(phy->base);
+   goto error_put_device;
+   }
 
/* v2 SGMII has a per-lane digital digital, so parse it if it exists */
res = platform_get_resource(sgmii_pdev, IORESOURCE_MEM, 1);
if (res) {
-   phy->digital = devm_ioremap_resource(_pdev->dev, res);
-   if (IS_ERR(phy->base))
-   return PTR_ERR(phy->base);
-
+   phy->digital = ioremap(res->start, resource_size(res));
+   if (IS_ERR(phy->digital)) {
+   ret = PTR_ERR(phy->digital);
+   goto error_unmap_base;
+   }
}
 
-   return phy->initialize(adpt);
+   ret = phy->initialize(adpt);
+   if (ret)
+   goto error;
+
+   /* We've remapped the addresses, so we don't need the device any
+* more.  of_find_device_by_node() says we should release it.
+*/
+   put_device(_pdev->dev);
+
+   return 0;
+
+error:
+   if (phy->digital)
+   iounmap(phy->digital);
+error_unmap_base:
+   iounmap(phy->base);
+error_put_device:
+   put_device(_pdev->dev);
+
+   return ret;
 }
diff --git a/drivers/net/ethernet/qualcomm/emac/emac.c 
b/drivers/net/ethernet/qualcomm/emac/emac.c
index e47d387..429b4cb 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac.c
@@ -723,6 +723,10 @@ static int emac_remove(struct platform_device *pdev)
mdiobus_unregister(adpt->mii_bus);
free_netdev(netdev);
 
+   if (adpt->phy.digital)
+   iounmap(adpt->phy.digital);
+   iounmap(adpt->phy.base);
+
return 0;
 }
 
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.



[PATCH 3/3] [v2] net: qcom/emac: initial ACPI support

2016-09-28 Thread Timur Tabi
Add support for reading addresses, interrupts, and _DSD properties
from ACPI tables, just like with device tree.  The HID for the
EMAC device itself is QCOM8070.  The internal PHY is represented
by a child node with a HID of QCOM8071.

The EMAC also has some complex clock initialization requirements
that are not represented by this patch.  This will be addressed
in a future patch.

Signed-off-by: Timur Tabi 
---
 drivers/net/ethernet/qualcomm/emac/emac-phy.c   | 37 ++---
 drivers/net/ethernet/qualcomm/emac/emac-sgmii.c | 72 ++---
 drivers/net/ethernet/qualcomm/emac/emac.c   | 12 +
 3 files changed, 95 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/emac/emac-phy.c 
b/drivers/net/ethernet/qualcomm/emac/emac-phy.c
index c412ba9..da4e90d 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-phy.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-phy.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "emac.h"
 #include "emac-mac.h"
 #include "emac-phy.h"
@@ -167,7 +168,6 @@ static int emac_mdio_write(struct mii_bus *bus, int addr, 
int regnum, u16 val)
 int emac_phy_config(struct platform_device *pdev, struct emac_adapter *adpt)
 {
struct device_node *np = pdev->dev.of_node;
-   struct device_node *phy_np;
struct mii_bus *mii_bus;
int ret;
 
@@ -183,14 +183,37 @@ int emac_phy_config(struct platform_device *pdev, struct 
emac_adapter *adpt)
mii_bus->parent = >dev;
mii_bus->priv = adpt;
 
-   ret = of_mdiobus_register(mii_bus, np);
-   if (ret) {
-   dev_err(>dev, "could not register mdio bus\n");
-   return ret;
+   if (has_acpi_companion(>dev)) {
+   u32 phy_addr;
+
+   ret = mdiobus_register(mii_bus);
+   if (ret) {
+   dev_err(>dev, "could not register mdio bus\n");
+   return ret;
+   }
+   ret = device_property_read_u32(>dev, "phy-channel",
+  _addr);
+   if (ret)
+   /* If we can't read a valid phy address, then assume
+* that there is only one phy on this mdio bus.
+*/
+   adpt->phydev = phy_find_first(mii_bus);
+   else
+   adpt->phydev = mdiobus_get_phy(mii_bus, phy_addr);
+
+   } else {
+   struct device_node *phy_np;
+
+   ret = of_mdiobus_register(mii_bus, np);
+   if (ret) {
+   dev_err(>dev, "could not register mdio bus\n");
+   return ret;
+   }
+
+   phy_np = of_parse_phandle(np, "phy-handle", 0);
+   adpt->phydev = of_phy_find_device(phy_np);
}
 
-   phy_np = of_parse_phandle(np, "phy-handle", 0);
-   adpt->phydev = of_phy_find_device(phy_np);
if (!adpt->phydev) {
dev_err(>dev, "could not find external phy\n");
mdiobus_unregister(mii_bus);
diff --git a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c 
b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
index ad0e420..3d2c05a 100644
--- a/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
+++ b/drivers/net/ethernet/qualcomm/emac/emac-sgmii.c
@@ -14,6 +14,7 @@
  */
 
 #include 
+#include 
 #include 
 #include "emac.h"
 #include "emac-mac.h"
@@ -662,6 +663,24 @@ void emac_sgmii_reset(struct emac_adapter *adpt)
clk_set_rate(adpt->clk[EMAC_CLK_HIGH_SPEED], 12500);
 }
 
+static int emac_sgmii_acpi_match(struct device *dev, void *data)
+{
+   static const struct acpi_device_id match_table[] = {
+   {
+   .id = "QCOM8071",
+   .driver_data = (kernel_ulong_t)emac_sgmii_init_v2,
+   },
+   {}
+   };
+   const struct acpi_device_id *id = acpi_match_device(match_table, dev);
+   emac_sgmii_initialize *initialize = data;
+
+   if (id)
+   *initialize = (emac_sgmii_initialize)id->driver_data;
+
+   return !!id;
+}
+
 static const struct of_device_id emac_sgmii_dt_match[] = {
{
.compatible = "qcom,fsm9900-emac-sgmii",
@@ -679,30 +698,45 @@ int emac_sgmii_config(struct platform_device *pdev, 
struct emac_adapter *adpt)
struct platform_device *sgmii_pdev = NULL;
struct emac_phy *phy = >phy;
struct resource *res;
-   const struct of_device_id *match;
-   struct device_node *np;
int ret;
 
-   np = of_parse_phandle(pdev->dev.of_node, "internal-phy", 0);
-   if (!np) {
-   dev_err(>dev, "missing internal-phy property\n");
-   return -ENODEV;
-   }
+   if (has_acpi_companion(>dev)) {
+   struct device *dev;
 
-   sgmii_pdev = of_find_device_by_node(np);
-   if (!sgmii_pdev) {
-   dev_err(>dev, 

Re: ISDN-Gigaset: Fine-tuning for three function implementations

2016-09-28 Thread SF Markus Elfring
> Two of the five patches introduced bugs. The rest of the series isn't
> free of various nits either. Of course, I was in no mood to be lenient
> when I looked at those three patches.
> 
> I won't take any of these patches, sorry.

Would you like to look once more into an improved patch series for this
software module a bit later?

Regards,
Markus


Re: [PATCHv2 net 0/5] sctp: some fixes of prsctp polices

2016-09-28 Thread Xin Long
>
> That doesn't matter.  This happens all the time.
>
> The way you handle this is you submit patch #1 targetting 'net' and
> wait for it to get into 'net' and propagate into 'net-next'.  When
> that happens you can submit patches #2-5 as a second patch series
> targetting 'net-next'.

Thanks, will reorganize them.


RE: [PATCH 3/3] net: fec: align IP header in hardware

2016-09-28 Thread David Laight
From: Eric Nelson
> Sent: 26 September 2016 19:40
> Hi David,
> 
> On 09/26/2016 02:26 AM, David Laight wrote:
> > From: Eric Nelson
> >> Sent: 24 September 2016 15:42
> >> The FEC receive accelerator (RACC) supports shifting the data payload of
> >> received packets by 16-bits, which aligns the payload (IP header) on a
> >> 4-byte boundary, which is, if not required, at least strongly suggested
> >> by the Linux networking layer.
> > ...
> >> +  /* align IP header */
> >> +  val |= FEC_RACC_SHIFT16;
> >
> > I can't help feeling that there needs to be corresponding
> > changes to increase the buffer size by 2 (maybe for large mtu)
> > and to discard two bytes from the frame length.
> >
> 
> In the normal case, the fec driver over-allocates all receive packets to
> be of size FEC_ENET_RX_FRSIZE (2048) minus the value of rx_align,
> which is either 0x0f (ARM) or 0x03 (PPC).
> 
> If the frame length is less than rx_copybreak (typically 256), then
> the frame length from the receive buffer descriptor is used to
> control the allocation size for a copied buffer, and this will include
> the two bytes of padding if RACC_SHIFT16 is set.
> 
> > If probably ought to be predicated on NET_IP_ALIGN as well.
> >
> Can you elaborate?

>From reading this it seems that the effect of FEC_RACC_SHIFT16 is to
add two bytes of 'junk' to the start of every receive frame.

In the 'copybreak' case the new skb would need to be 2 bytes shorter
than the length reported by the hardware, and the data copied from
2 bytes into the dma buffer.

The extra 2 bytes also mean the that maximum mtu that can be received
into a buffer is two bytes less.
If someone sets the mtu to (say) 9k for jumbo frames this might matter.
Even with fixed 2048 byte buffers it reduces the maximum value the mtu
can be set to by 2.

Now if NET_IP_ALIGN is zero then it is fine for the rx frame to start
on a 4n boundary, and the skb are likely to be allocated that way.
In this case you don't want to extra two bytes of 'junk'.

OTOH if NET_IP_ALIGN is 2 then you need to 'fiddle' things so that
the data is dma'd to offset -2 in the skb and then ensure that the
end of frame is set correctly.

David



Re: [PATCH 1/5] ISDN-Gigaset: Use kmalloc_array() in two functions

2016-09-28 Thread SF Markus Elfring
>> * Multiplications for the size determination of memory allocations
>>   indicated that array data structures should be processed.
>>   Thus use the corresponding function "kmalloc_array".
> 
> Was the current code incorrect?

I suggest to use a safer interface for array allocations.


> What makes kmalloc_array() better?

1. How do you think about the safety checks that this function provides?

2. Will you be also affected by further software evolution here?
   2016-07-26
   mm: faster kmalloc_array(), kcalloc()
   
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=91c6a05f72a996bee5133e76374ab3ad7d3b9b72


> I'm not going to change code just because some checker suggests to do so.

The script "checkpatch.pl" can point information out like the following.

WARNING: Prefer kmalloc_array over kmalloc with multiply


>>   This issue was detected by using the Coccinelle software.
> 
> So? And which coccinelle script was actually used?

How do you think about to look into related information sources?
https://github.com/coccinelle/coccinelle/issues/81

Would you like to experiment any further with an excerpt?


@replacement1@
expression count, target;
type T;
@@
 target =
- kmalloc(sizeof(T) * (count)
+ kmalloc_array(count, sizeof(T)
, ...);

@replacement2@
expression count, pointer, target;
@@
 target =
- kmalloc(sizeof(*pointer) * (count)
+ kmalloc_array(count, sizeof(*pointer)
, ...);


> I couldn't spot a coccinelle script doing that in the current tree.

This is true for such a software update opportunity.


>> * Replace the specification of a data structure by a pointer dereference
>>   to make the corresponding size determination a bit safer according to
>>   the Linux coding style convention.
> 
> I'm not happy with you mixing this with the above, less trivial, change.

I find that it is a useful combination. - A parameter is adjusted together
with a special function name.


>> -drv->cs = kmalloc(minors * sizeof *drv->cs, GFP_KERNEL);
>> +drv->cs = kmalloc_array(minors, sizeof(*drv->cs), GFP_KERNEL);
> 
> For "minors" the same holds as for "channels", above.
> 
> And you snuck in a parentheses change. That should have probably been
> merged with 5/5.

Would you prefer to add them in another update step?

Regards,
Markus


RE: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-28 Thread David Laight
From: Vlastimil Babka
> Sent: 27 September 2016 12:51
...
> Process name suggests it's part of db2 database. It seems it has to implement
> its own interface to select() syscall, because glibc itself seems to have a
> FD_SETSIZE limit of 1024, which is probably why this wasn't an issue for all 
> the
> years...

ISTR the canonical way to increase the size being to set FD_SETSIZE
to a larger value before including any of the headers.

Or doesn't that work with linux and glibc ??

David



Re: [PATCH net-next 1/2] net: phy: Add Wake-on-LAN driver for Microsemi PHYs.

2016-09-28 Thread Andrew Lunn
> +#define MSCC_PHY_WOL_MAC_CONTROL   27
> +#define EDGE_RATE_CNTL_POS 5
> +#define EDGE_RATE_CNTL_MASK0x00E0

This patch does not require these two #defines.

Please indicate in the cover note if the patches depends on other
patches in order to cleanly apply. Or if these patches are going to
conflict with some other patches.

> + reg_val = phy_read(phydev, MSCC_PHY_WOL_MAC_CONTROL);
> + if (wol_conf->wolopts & WAKE_MAGICSECURE)
> + reg_val |= SECURE_ON_ENABLE;
> + else
> + reg_val &= ~SECURE_ON_ENABLE;
> + phy_write(phydev, MSCC_PHY_WOL_MAC_CONTROL, reg_val);
> +
> + if (wol_conf->wolopts & WAKE_MAGICSECURE) {
> + reg_val = wol_conf->sopass[4] << 8;
> + reg_val |= wol_conf->sopass[5];
> + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, reg_val);
> + reg_val = wol_conf->sopass[2] << 8;
> + reg_val |= wol_conf->sopass[3];
> + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, reg_val);
> + reg_val = wol_conf->sopass[0] << 8;
> + reg_val |= wol_conf->sopass[1];
> + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, reg_val);
> + } else {
> + phy_write(phydev, MSCC_PHY_WOL_LOWER_PASSWD, 0);
> + phy_write(phydev, MSCC_PHY_WOL_MID_PASSWD, 0);
> + phy_write(phydev, MSCC_PHY_WOL_UPPER_PASSWD, 0);
> + }

Wouldn't it be better to set the password, and then enable the
password feature?

I don't know much about WOL. Hopefully Florian will add further
comments.

 Andrew


Re: [net-next 5/5] PCI: disable FLR for 82579 device

2016-09-28 Thread Bjorn Helgaas
On Wed, Sep 28, 2016 at 03:33:52PM +, Neftin, Sasha wrote:
> 
> Since I worked with Sasha on this I will provide a bit of information from 
> what I understand of this bug as well.
> 
> On Tue, Sep 27, 2016 at 12:13 PM, Alex Williamson 
>  wrote:
> > On Tue, 27 Sep 2016 13:17:02 -0500
> > Bjorn Helgaas  wrote:
> >
> >> On Sun, Sep 25, 2016 at 10:02:43AM +0300, Neftin, Sasha wrote:
> >> > On 9/24/2016 12:05 AM, Jeff Kirsher wrote:
> >> > >On Fri, 2016-09-23 at 09:01 -0500, Bjorn Helgaas wrote:
> >> > >>On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote:
> >> > >>>From: Sasha Neftin 
> >> > >>>
> >> > >>>82579 has a problem reattaching itself after the device is detached.
> >> > >>>The bug was reported by Redhat. The suggested fix is to disable 
> >> > >>>FLR capability in PCIe configuration space.
> >> > >>>
> >> > >>>Reproduction:
> >> > >>>Attach the device to a VM, then detach and try to attach again.
> >> > >>>
> >> > >>>Fix:
> >> > >>>Disable FLR capability to prevent the 82579 from hanging.
> >> > >>Is there a bugzilla or other reference URL to include here?  
> >> > >>Should this be marked for stable?
> >> > >So the author is in Israel, meaning it is their weekend now.  I do 
> >> > >not believe Sasha monitors email over the weekend, so a response 
> >> > >to your questions won't happen for a few days.
> >> > >
> >> > >I tried searching my archives for more information, but had no 
> >> > >luck finding any additional information.
> >> > >
> 
> I agree that we do probably need to update the patch description since it 
> isn't exactly clear what this is fixing or what was actually broken.
> 
> >> > >>>Signed-off-by: Sasha Neftin 
> >> > >>>Tested-by: Aaron Brown 
> >> > >>>Signed-off-by: Jeff Kirsher 
> >> > >>>---
> >> > >>>  drivers/pci/quirks.c | 21 +
> >> > >>>  1 file changed, 21 insertions(+)
> >> > >>>
> >> > >>>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
> >> > >>>44e0ff3..59fba6e 100644
> >> > >>>--- a/drivers/pci/quirks.c
> >> > >>>+++ b/drivers/pci/quirks.c
> >> > >>>@@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct 
> >> > >>>pci_dev *pdev)
> >> > >>> }
> >> > >>>  }
> >> > >>>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, 
> >> > >>>quirk_intel_qat_vf_cap);
> >> > >>>+/*
> >> > >>>+ * Workaround FLR issues for 82579
> >> > >>>+ * This code disables the FLR (Function Level Reset) via PCIe, 
> >> > >>>+in
> >> > >>>order
> >> > >>>+ * to workaround a bug found while using device passthrough, 
> >> > >>>+ where the
> >> > >>>+ * interface would become non-responsive.
> >> > >>>+ * NOTE: the FLR bit is Read/Write Once (RWO) in config space, 
> >> > >>>+ so if
> >> > >>>+ * the BIOS or kernel writes this register * then this 
> >> > >>>+ workaround will
> >> > >>>+ * not work.
> >> > >>This doesn't sound like a root cause.  Is the issue a hardware 
> >> > >>erratum?  Linux PCI core bug?  VFIO bug?  Device firmware bug?
> >> > >>
> >> > >>The changelog suggests that the problem only affects passthrough, 
> >> > >>which suggests some sort of kernel bug related to how passthrough 
> >> > >>is implemented.
> >>
> >> If this bug affects all scenarios, not just passthrough, the 
> >> changelog should not mention passthrough.
> >>
> >> > >>>+ */
> >> > >>>+static void quirk_intel_flr_cap_dis(struct pci_dev *dev) {
> >> > >>>+int pos = pci_find_capability(dev, PCI_CAP_ID_AF);
> >> > >>>+if (pos) {
> >> > >>>+u8 cap;
> >> > >>>+pci_read_config_byte(dev, pos + PCI_AF_CAP, );
> >> > >>>+cap = cap & (~PCI_AF_CAP_FLR);
> >> > >>>+pci_write_config_byte(dev, pos + PCI_AF_CAP, cap);
> >> > >>>+}
> >> > >>>+}
> >> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502,
> >> > >>>quirk_intel_flr_cap_dis);
> >> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503,
> >> > >>>quirk_intel_flr_cap_dis);
> >> > >>>--
> >> > >>>2.7.4
> >> > >>>
> >> > >>>--
> >> > >>>To unsubscribe from this list: send the line "unsubscribe 
> >> > >>>linux-pci" in the body of a message to majord...@vger.kernel.org 
> >> > >>>More majordomo info at  
> >> > >>>http://vger.kernel.org/majordomo-info.html
> >> >
> >> > Hello,
> >> >
> >> > Original bugzilla thread could be found here:
> >> > https://bugzilla.redhat.com/show_bug.cgi?format=multiple=966840
> >>
> >> That bugzilla is private and I can't read it.
> >
> > Hmm, I can, but I don't see anything in it that supports this.  Is 
> > that really the right bz?  It's the right hardware, but has all sorts 
> > of FUD about the version of various other components in the stack.
> 
> It looks like we had a local copy of the bugzilla saved here, though it only 
> goes up to comment 13 which is where I think we started working this on our 
> side.  I believe what this 

Re: [PATCH v4 net-next] net: phy: Add Edge-rate driver for Microsemi PHYs.

2016-09-28 Thread Andrew Lunn
> +Optional properties:
> +- vsc8531,vddmac : The vddmac in mV.
> +- vsc8531,edge-slowdown  : % the edge should be slowed down relative to
> +   the fastest possible edge time. Native sign
> +   need not enter.
> +   Edge rate sets the drive strength of the MAC
> +   interface output signals.  Changing the drive
> +   strength will affect the edge rate of the output
> +   signal.  The goal of this setting is to help
> +   reduce electrical emission (EMI) by being able
> +   to reprogram drive strength and in effect slow
> +   down the edge rate if desired.  Table 5 shows the

Hi Raju

There is no table five here? Is that a reference to a data sheet
table?

> +Example:
> +
> +vsc8531_0: ethernet-phy@0 {
> +compatible = "ethernet-phy-id0007.0570";
> +vsc8531,vddmac   = /bits/ 16 ;
> +vsc8531,edge-slowdown= /bits/ 8  <17>;

No, real values please:

vsc8531,vddmac  = <2000>;
vsc8531,edge-slowdown   = <21>;

The driver should then do the maths to figure out the nearest magic
value to write to the register, or complain the settings are out of
range with an -EINVAL.

FYI: No floating point maths are allowed in the kernel.

   Andrew


Local variable ordering (was Re: [PATCH v5 0/7] Reduce cache miss for snmp_fold_field)

2016-09-28 Thread Edward Cree
On 28/09/16 14:45, hejianet wrote:
>
>
> On 9/28/16 5:08 PM, David Miller wrote:
>> From: Jia He 
>> Date: Wed, 28 Sep 2016 14:22:21 +0800
>>
>>> v5:
>>> - order local variables from longest to shortest line
>> I still see many cases where this problem still exists.  Please
>> do not resubmit this patch series until you fix all of them.
>>
>> Patch #2:
>>
>> -static int snmp_seq_show(struct seq_file *seq, void *v)
>> +static int snmp_seq_show_ipstats(struct seq_file *seq, void *v)
>>   {
>>   int i;
>> +u64 buff64[IPSTATS_MIB_MAX];
>>   struct net *net = seq->private;
>>
>> The order should be "net" then "buff64" then "i".
> Sorry for my bad eyesight and quick hand :(
> B.R.
> Jia
A few months back I wrote a script to help find these:
https://github.com/ecree-solarflare/xmastree
It can check a source file or a patch.
Posting in the hope that people will find it useful.

-Ed


Re: [PATCHv2 net 0/5] sctp: some fixes of prsctp polices

2016-09-28 Thread marcelo . leitner
On Wed, Sep 28, 2016 at 11:17:16AM -0400, David Miller wrote:
> From: Xin Long 
> Date: Wed, 28 Sep 2016 20:35:40 +0800
> 
> >>
> >>> This patchset is to improve some codes about prsctp polices, and also
> >>> to fix some issues.
> >>>
> >>> v1->v2:
> >>>   - wrap the check of chunk->sent_count in a macro:
> >>> sctp_chunk_retransmitted in patch 2/5.
> >>
> >> This series is a mix of bug fixes (patch #1) which should be targetting
> >> 'net' and simplifications/cleanups (patch #2-5) which should be targetting
> >> 'net-next'.
> >>
> >> Please do not mix things up like this, and submit patches targetting
> >> the appropriate tree.
> >>
> > understand, I wanted to divide them.
> > but this patchset is special, the following patches
> > depend on patch 1/5.
> 
> That doesn't matter.  This happens all the time.
> 
> The way you handle this is you submit patch #1 targetting 'net' and
> wait for it to get into 'net' and propagate into 'net-next'.  When
> that happens you can submit patches #2-5 as a second patch series
> targetting 'net-next'.

Thanks for your patience for explaining this, David, appreciated.

  Marcelo



[PATCH net-next] net: do not export sk_stream_write_space

2016-09-28 Thread Eric Dumazet
From: Eric Dumazet 

Since commit 900f65d361d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()") we no longer need
to export sk_stream_write_space()

From: Eric Dumazet 
Cc: Neal Cardwell 
---
 net/core/stream.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/net/core/stream.c b/net/core/stream.c
index 
159516a11b7e84f7e64e0acfa2dd55dab82307bd..1086c8b280a868101df7b44691eccac40bd7b3e1
 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -43,7 +43,6 @@ void sk_stream_write_space(struct sock *sk)
rcu_read_unlock();
}
 }
-EXPORT_SYMBOL(sk_stream_write_space);
 
 /**
  * sk_stream_wait_connect - Wait for a socket to get into the connected state




[PATCH nf-next v4 0/2] fixes for recent nf_compact hooks

2016-09-28 Thread Aaron Conole
Two possible error conditions were caught during an extended testing
session, and by a build robot.  These patches fix the two issues (a
missing handler when config is changed, and a potential NULL
dereference).

Aaron Conole (2):
  netfilter: Fix potential null pointer dereference
  nf_set_hooks_head: acommodate different kconfig

 net/netfilter/core.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
2.5.5



[PATCH nf-next v4 1/2] netfilter: Fix potential null pointer dereference

2016-09-28 Thread Aaron Conole
It's possible for nf_hook_entry_head to return NULL.  If two
nf_unregister_net_hook calls happen simultaneously with a single hook
entry in the list, both will enter the nf_hook_mutex critical section.
The first will successfully delete the head, but the second will see
this NULL pointer and attempt to dereference.

This fix ensures that no null pointer dereference could occur when such
a condition happens.

Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole 
---
 net/netfilter/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 360c63d..e58e420 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
 
mutex_lock(_hook_mutex);
hooks_entry = nf_hook_entry_head(net, reg);
-   if (hooks_entry->orig_ops == reg) {
+   if (hooks_entry && hooks_entry->orig_ops == reg) {
nf_set_hooks_head(net, reg,
  nf_entry_dereference(hooks_entry->next));
goto unlock;
-- 
2.7.4



Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference

2016-09-28 Thread Aaron Conole
Eric Dumazet  writes:

> On Wed, 2016-09-28 at 10:56 -0400, Aaron Conole wrote:
>> Eric Dumazet  writes:
>> 
>> > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote:
>> >> It's possible for nf_hook_entry_head to return NULL.  If two
>> >> nf_unregister_net_hook calls happen simultaneously with a single hook
>> >> entry in the list, both will enter the nf_hook_mutex critical section.
>> >> The first will successfully delete the head, but the second will see
>> >> this NULL pointer and attempt to dereference.
>> >> 
>> >> This fix ensures that no null pointer dereference could occur when such
>> >> a condition happens.
>> >> 
>> >> Signed-off-by: Aaron Conole 
>> >> ---
>> >>  net/netfilter/core.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> 
>> >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
>> >> index 360c63d..e58e420 100644
>> >> --- a/net/netfilter/core.c
>> >> +++ b/net/netfilter/core.c
>> >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const 
>> >> struct nf_hook_ops *reg)
>> >>  
>> >>   mutex_lock(_hook_mutex);
>> >>   hooks_entry = nf_hook_entry_head(net, reg);
>> >> - if (hooks_entry->orig_ops == reg) {
>> >> + if (hooks_entry && hooks_entry->orig_ops == reg) {
>> >>   nf_set_hooks_head(net, reg,
>> >> nf_entry_dereference(hooks_entry->next));
>> >>   goto unlock;
>> >
>> > When was the bug added exactly ?
>> 
>> Sunday, on the nf-next tree.
>> 
>> > For all bug fixes, you need to add a Fixes: tag.
>> >
>> > Like :
>> >
>> > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked 
>> > list")
>> 
>> I would but it's in nf-next tree, and I'm not sure how pulls go.  If
>> they are done via patch imports, then the sha sums will be wrong and the
>> commit message will be misleading.  If the sums are preserved, then I
>> can resubmit with this information.
>> 
>
> I gave the (12 digits) sha-1 as present in David Miller net-next tree.
>
> https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d
>
>
> This wont change, because David never rebases his tree under normal
> operations.
>
> Thanks.

Thank you very much, Eric.  I've reposted.

-Aaron


[PATCH nf-next v4 2/2] nf_set_hooks_head: accommodate different kconfig

2016-09-28 Thread Aaron Conole
When CONFIG_NETFILTER_INGRESS is unset (or no), we need to handle
the request for registration properly by dropping the hook.  This
releases the entry during the set.

Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")
Signed-off-by: Aaron Conole 
---
 net/netfilter/core.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e58e420..61e8a9d 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -90,10 +90,12 @@ static void nf_set_hooks_head(struct net *net, const struct 
nf_hook_ops *reg,
 {
switch (reg->pf) {
case NFPROTO_NETDEV:
+#ifdef CONFIG_NETFILTER_INGRESS
/* We already checked in nf_register_net_hook() that this is
 * used from ingress.
 */
rcu_assign_pointer(reg->dev->nf_hooks_ingress, entry);
+#endif
break;
default:
rcu_assign_pointer(net->nf.hooks[reg->pf][reg->hooknum],
@@ -107,10 +109,15 @@ int nf_register_net_hook(struct net *net, const struct 
nf_hook_ops *reg)
struct nf_hook_entry *hooks_entry;
struct nf_hook_entry *entry;
 
-   if (reg->pf == NFPROTO_NETDEV &&
-   (reg->hooknum != NF_NETDEV_INGRESS ||
-!reg->dev || dev_net(reg->dev) != net))
-   return -EINVAL;
+   if (reg->pf == NFPROTO_NETDEV) {
+#ifndef CONFIG_NETFILTER_INGRESS
+   if (reg->hooknum == NF_NETDEV_INGRESS)
+   return -EOPNOTSUPP;
+#endif
+   if (reg->hooknum != NF_NETDEV_INGRESS ||
+   !reg->dev || dev_net(reg->dev) != net)
+   return -EINVAL;
+   }
 
entry = kmalloc(sizeof(*entry), GFP_KERNEL);
if (!entry)
-- 
2.7.4



RE: [net-next 5/5] PCI: disable FLR for 82579 device

2016-09-28 Thread Neftin, Sasha

Since I worked with Sasha on this I will provide a bit of information from what 
I understand of this bug as well.

On Tue, Sep 27, 2016 at 12:13 PM, Alex Williamson  
wrote:
> On Tue, 27 Sep 2016 13:17:02 -0500
> Bjorn Helgaas  wrote:
>
>> On Sun, Sep 25, 2016 at 10:02:43AM +0300, Neftin, Sasha wrote:
>> > On 9/24/2016 12:05 AM, Jeff Kirsher wrote:
>> > >On Fri, 2016-09-23 at 09:01 -0500, Bjorn Helgaas wrote:
>> > >>On Thu, Sep 22, 2016 at 11:39:01PM -0700, Jeff Kirsher wrote:
>> > >>>From: Sasha Neftin 
>> > >>>
>> > >>>82579 has a problem reattaching itself after the device is detached.
>> > >>>The bug was reported by Redhat. The suggested fix is to disable 
>> > >>>FLR capability in PCIe configuration space.
>> > >>>
>> > >>>Reproduction:
>> > >>>Attach the device to a VM, then detach and try to attach again.
>> > >>>
>> > >>>Fix:
>> > >>>Disable FLR capability to prevent the 82579 from hanging.
>> > >>Is there a bugzilla or other reference URL to include here?  
>> > >>Should this be marked for stable?
>> > >So the author is in Israel, meaning it is their weekend now.  I do 
>> > >not believe Sasha monitors email over the weekend, so a response 
>> > >to your questions won't happen for a few days.
>> > >
>> > >I tried searching my archives for more information, but had no 
>> > >luck finding any additional information.
>> > >

I agree that we do probably need to update the patch description since it isn't 
exactly clear what this is fixing or what was actually broken.

>> > >>>Signed-off-by: Sasha Neftin 
>> > >>>Tested-by: Aaron Brown 
>> > >>>Signed-off-by: Jeff Kirsher 
>> > >>>---
>> > >>>  drivers/pci/quirks.c | 21 +
>> > >>>  1 file changed, 21 insertions(+)
>> > >>>
>> > >>>diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
>> > >>>44e0ff3..59fba6e 100644
>> > >>>--- a/drivers/pci/quirks.c
>> > >>>+++ b/drivers/pci/quirks.c
>> > >>>@@ -4431,3 +4431,24 @@ static void quirk_intel_qat_vf_cap(struct 
>> > >>>pci_dev *pdev)
>> > >>> }
>> > >>>  }
>> > >>>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x443, 
>> > >>>quirk_intel_qat_vf_cap);
>> > >>>+/*
>> > >>>+ * Workaround FLR issues for 82579
>> > >>>+ * This code disables the FLR (Function Level Reset) via PCIe, 
>> > >>>+in
>> > >>>order
>> > >>>+ * to workaround a bug found while using device passthrough, 
>> > >>>+ where the
>> > >>>+ * interface would become non-responsive.
>> > >>>+ * NOTE: the FLR bit is Read/Write Once (RWO) in config space, 
>> > >>>+ so if
>> > >>>+ * the BIOS or kernel writes this register * then this 
>> > >>>+ workaround will
>> > >>>+ * not work.
>> > >>This doesn't sound like a root cause.  Is the issue a hardware 
>> > >>erratum?  Linux PCI core bug?  VFIO bug?  Device firmware bug?
>> > >>
>> > >>The changelog suggests that the problem only affects passthrough, 
>> > >>which suggests some sort of kernel bug related to how passthrough 
>> > >>is implemented.
>>
>> If this bug affects all scenarios, not just passthrough, the 
>> changelog should not mention passthrough.
>>
>> > >>>+ */
>> > >>>+static void quirk_intel_flr_cap_dis(struct pci_dev *dev) {
>> > >>>+int pos = pci_find_capability(dev, PCI_CAP_ID_AF);
>> > >>>+if (pos) {
>> > >>>+u8 cap;
>> > >>>+pci_read_config_byte(dev, pos + PCI_AF_CAP, );
>> > >>>+cap = cap & (~PCI_AF_CAP_FLR);
>> > >>>+pci_write_config_byte(dev, pos + PCI_AF_CAP, cap);
>> > >>>+}
>> > >>>+}
>> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1502,
>> > >>>quirk_intel_flr_cap_dis);
>> > >>>+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x1503,
>> > >>>quirk_intel_flr_cap_dis);
>> > >>>--
>> > >>>2.7.4
>> > >>>
>> > >>>--
>> > >>>To unsubscribe from this list: send the line "unsubscribe 
>> > >>>linux-pci" in the body of a message to majord...@vger.kernel.org 
>> > >>>More majordomo info at  
>> > >>>http://vger.kernel.org/majordomo-info.html
>> >
>> > Hello,
>> >
>> > Original bugzilla thread could be found here:
>> > https://bugzilla.redhat.com/show_bug.cgi?format=multiple=966840
>>
>> That bugzilla is private and I can't read it.
>
> Hmm, I can, but I don't see anything in it that supports this.  Is 
> that really the right bz?  It's the right hardware, but has all sorts 
> of FUD about the version of various other components in the stack.

It looks like we had a local copy of the bugzilla saved here, though it only 
goes up to comment 13 which is where I think we started working this on our 
side.  I believe what this patch is attempting to resolve is related to comment 
8 where the driver returned "probe of
:00:19.0 failed with error ‐2" instead of correctly probing the interface.

So the bug as reported was that e1000e had a problem reattaching itself to the 
PHY after it was attached to a VM.  

Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference

2016-09-28 Thread Eric Dumazet
On Wed, 2016-09-28 at 10:56 -0400, Aaron Conole wrote:
> Eric Dumazet  writes:
> 
> > On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote:
> >> It's possible for nf_hook_entry_head to return NULL.  If two
> >> nf_unregister_net_hook calls happen simultaneously with a single hook
> >> entry in the list, both will enter the nf_hook_mutex critical section.
> >> The first will successfully delete the head, but the second will see
> >> this NULL pointer and attempt to dereference.
> >> 
> >> This fix ensures that no null pointer dereference could occur when such
> >> a condition happens.
> >> 
> >> Signed-off-by: Aaron Conole 
> >> ---
> >>  net/netfilter/core.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >> 
> >> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> >> index 360c63d..e58e420 100644
> >> --- a/net/netfilter/core.c
> >> +++ b/net/netfilter/core.c
> >> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const 
> >> struct nf_hook_ops *reg)
> >>  
> >>mutex_lock(_hook_mutex);
> >>hooks_entry = nf_hook_entry_head(net, reg);
> >> -  if (hooks_entry->orig_ops == reg) {
> >> +  if (hooks_entry && hooks_entry->orig_ops == reg) {
> >>nf_set_hooks_head(net, reg,
> >>  nf_entry_dereference(hooks_entry->next));
> >>goto unlock;
> >
> > When was the bug added exactly ?
> 
> Sunday, on the nf-next tree.
> 
> > For all bug fixes, you need to add a Fixes: tag.
> >
> > Like :
> >
> > Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")
> 
> I would but it's in nf-next tree, and I'm not sure how pulls go.  If
> they are done via patch imports, then the sha sums will be wrong and the
> commit message will be misleading.  If the sums are preserved, then I
> can resubmit with this information.
> 

I gave the (12 digits) sha-1 as present in David Miller net-next tree.

https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=e3b37f11e6e4e6b6f02cc762f182ce233d2c1c9d


This wont change, because David never rebases his tree under normal
operations.

Thanks.





Re: [PATCH] ipv6 addrconf: enable use of proc_dointvec_minmax in addrconf_sysctl

2016-09-28 Thread David Miller
From: Maciej Żenczykowski 
Date: Wed, 28 Sep 2016 22:23:10 +0900

> Anyway, enough, I give up, this isn't worth my time, and it's also not
> worth your time.
> I removed the dependency from the other patches and squashed them all into
> 1 to make reviewing easier.

Splitting things up is sometimes warranted, but you have to keep
together the things that strongly need each other.

And new facilities require examples of their use.  They really do.

I simply can't look at your patch and empty commit message and figure
out why you needed to do what you were doing.

Any developer should be able to look at a patch being proposed and be
able to understand it standing upon on it's own.  This means they
shouldn't have to know what happened in this discussion thread or that
one in order to evaluate and audit the patch properly.


Re: [PATCHv2 net 0/5] sctp: some fixes of prsctp polices

2016-09-28 Thread David Miller
From: Xin Long 
Date: Wed, 28 Sep 2016 20:35:40 +0800

>>
>>> This patchset is to improve some codes about prsctp polices, and also
>>> to fix some issues.
>>>
>>> v1->v2:
>>>   - wrap the check of chunk->sent_count in a macro:
>>> sctp_chunk_retransmitted in patch 2/5.
>>
>> This series is a mix of bug fixes (patch #1) which should be targetting
>> 'net' and simplifications/cleanups (patch #2-5) which should be targetting
>> 'net-next'.
>>
>> Please do not mix things up like this, and submit patches targetting
>> the appropriate tree.
>>
> understand, I wanted to divide them.
> but this patchset is special, the following patches
> depend on patch 1/5.

That doesn't matter.  This happens all the time.

The way you handle this is you submit patch #1 targetting 'net' and
wait for it to get into 'net' and propagate into 'net-next'.  When
that happens you can submit patches #2-5 as a second patch series
targetting 'net-next'.


Re: [PATCH RFC 0/4] xfs: Transmit flow steering

2016-09-28 Thread Rick Jones


Here is a quick look at performance tests for the result of trying the
prototype fix for the packet reordering problem with VMs sending over
an XPS-configured NIC.  In particular, the Emulex/Avago/Broadcom
Skyhawk.  The fix was applied to a 4.4 kernel.

Before: 3884 Mbit/s
After: 8897 Mbit/s

That was from a VM on a node with a Skyhawk and 2 E5-2640 processors
to baremetal E5-2640 with a BE3.  Physical MTU was 1500, the VM's
vNIC's MTU was 1400.  Systems were HPE ProLiants in OS Control Mode
for power management, with the "performance" frequency governor
loaded. An OpenStack Mitaka setup with Distributed Virtual Router.

We had some other NIC types in the setup as well.  XPS was also
enabled on the ConnectX3-Pro.  It was not enabled on the 82599ES (a
function of the kernel being used, which had it disabled from the
first reports of XPS negatively affecting VM traffic at the beginning
of the year)

Average Mbit/s From NIC type To Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v39224 9271
BE3, E5-26409016 9022
82599, E5-2640  9192 9003
BCM57840, E5-2640   9213 9153
Skyhawk, E5-26403884 8897

For completeness:
Average Mbit/s To NIC type from Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v39322 9144
BE3, E5-26409074 9017
82599, E5-2640  8670 8564
BCM57840, E5-2640   2468 *   7979
Skyhawk, E5-26408897 9269

* This is the Busted bnx2x NIC FW GRO implementation issue.  It was
  not visible in the "After" because the system was setup to disable
  the NIC FW GRO by the time it booted on the fix kernel.

Average Transactions/s Between NIC type and Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v3   12421 12612
BE3, E5-26408178  8484
82599, E5-2640  8499  8549
BCM57840, E5-2640   8544  8560
Skyhawk, E5-26408537  8701

happy benchmarking,

Drew Balliet
Jeurg Haefliger
rick jones

The semi-cooked results with additional statistics:

554M  - BE3
544+M - ConnectX-3 Pro
560M - 82599ES
630M - BCM57840
650M - Skyhawk

(substitute is simply replacing a system name with the model of NIC and CPU)
Bulk To (South) and From (North) VM, Before:
$ ../substitute.sh 
vxlan_554m_control_performance_gvnr_dvr_northsouth_stream.log | 
~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 
4 -f 7 -f 8

Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count
North,560M,E5-2640,554FLB,E5-2640,8148.090,9048.830,9235.400,9192.868,9315.980,9338.845,9339.500,113
North,630M,E5-2640,554FLB,E5-2640,8909.980,9113.238,9234.750,9213.140,9299.442,9336.206,9337.830,47
North,544+M,E5-2670v3,554FLB,E5-2640,9013.740,9182.546,9229.620,9224.025,9264.036,9299.206,9301.970,99
North,650M,E5-2640,554FLB,E5-2640,3187.680,3393.724,3796.160,3884.765,4405.096,4941.391,4956.300,129
North,554M,E5-2640,554FLB,E5-2640,8700.930,8855.768,9026.030,9016.061,9158.846,9213.687,9226.150,135
South,554FLB,E5-2640,560M,E5-2640,7754.350,8193.114,8718.540,8670.612,9026.436,9262.355,9285.010,113
South,554FLB,E5-2640,630M,E5-2640,1897.660,2068.290,2514.430,2468.323,2787.162,2942.934,2957.250,53
South,554FLB,E5-2640,544+M,E5-2670v3,9298.260,9314.432,9323.220,9322.207,9328.324,9330.704,9331.080,100
South,554FLB,E5-2640,650M,E5-2640,8407.050,8907.136,9304.390,9206.776,9321.320,9325.347,9326.410,103
South,554FLB,E5-2640,554M,E5-2640,7844.900,8632.530,9199.385,9074.535,9308.070,9319.224,9322.360,132
0 too-short lines ignored.

Bulk To (South) and From (North) VM, After:

$ ../substitute.sh 
vxlan_554m_control_performance_gvnr_xpsfix_dvr_northsouth_stream.log | 
~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 
4 -f 7 -f 8

Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count
North,560M,E5-2640,554FLB,E5-2640,7576.790,8213.890,9182.870,9003.190,9295.975,9315.878,9318.160,36
North,630M,E5-2640,554FLB,E5-2640,8811.800,8924.000,9206.660,9153.076,9306.287,9315.152,9315.790,12
North,544+M,E5-2670v3,554FLB,E5-2640,9135.990,9228.520,9277.465,9271.875,9324.545,9339.604,9339.780,46
North,650M,E5-2640,554FLB,E5-2640,8133.420,8483.340,8995.040,8897.779,9129.056,9165.230,9165.860,43
North,554M,E5-2640,554FLB,E5-2640,8438.390,8879.150,9048.590,9022.813,9181.540,9248.650,9297.660,101
South,554FLB,E5-2640,630M,E5-2640,7347.120,7592.565,7951.325,7979.951,8365.400,8575.837,8579.890,16
South,554FLB,E5-2640,560M,E5-2640,7719.510,8044.496,8602.750,8564.741,9172.824,9248.686,9259.070,45

Re: [RFC PATCH net-next 2/2] sfc: report 4-tuple UDP hashing to ethtool, if it's enabled

2016-09-28 Thread Edward Cree
On 28/09/16 10:12, David Laight wrote:
> If you invert the above and add a goto...
>   if (!efx->rx_hash_udp_4tuple)
>   goto set_ip;
I don't mind gotos...
>>  case SCTP_V4_FLOW:
>>  case AH_ESP_V4_FLOW:
>>  case IPV4_FLOW:
> set_ip:
...but this adds a label where we effectively already have one.
I wish C allowed goto case labels.
> It might look better.
>   David

It just bugs me that it would have this unnecessary goto and label.

Alternate ways to maybe make it look better, or not:

* Remove the /* else fall further */ comment, does this make the
  indentation more or less confusing?
* Include braces on the if, even though there's only one statement
  inside.

Also, how strong are people's reaction to this?  If it's just "I
personally wouldn't do it that way", then I'm tempted to go ahead
anyway.  But if it's "NAK NAK NAK burn the heretic", that's
another matter.

-Ed


Re: [PATCH nf-next v3 1/2] netfilter: Fix potential null pointer dereference

2016-09-28 Thread Aaron Conole
Eric Dumazet  writes:

> On Wed, 2016-09-28 at 09:12 -0400, Aaron Conole wrote:
>> It's possible for nf_hook_entry_head to return NULL.  If two
>> nf_unregister_net_hook calls happen simultaneously with a single hook
>> entry in the list, both will enter the nf_hook_mutex critical section.
>> The first will successfully delete the head, but the second will see
>> this NULL pointer and attempt to dereference.
>> 
>> This fix ensures that no null pointer dereference could occur when such
>> a condition happens.
>> 
>> Signed-off-by: Aaron Conole 
>> ---
>>  net/netfilter/core.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
>> index 360c63d..e58e420 100644
>> --- a/net/netfilter/core.c
>> +++ b/net/netfilter/core.c
>> @@ -160,7 +160,7 @@ void nf_unregister_net_hook(struct net *net, const 
>> struct nf_hook_ops *reg)
>>  
>>  mutex_lock(_hook_mutex);
>>  hooks_entry = nf_hook_entry_head(net, reg);
>> -if (hooks_entry->orig_ops == reg) {
>> +if (hooks_entry && hooks_entry->orig_ops == reg) {
>>  nf_set_hooks_head(net, reg,
>>nf_entry_dereference(hooks_entry->next));
>>  goto unlock;
>
> When was the bug added exactly ?

Sunday, on the nf-next tree.

> For all bug fixes, you need to add a Fixes: tag.
>
> Like :
>
> Fixes: e3b37f11e6e4 ("netfilter: replace list_head with single linked list")

I would but it's in nf-next tree, and I'm not sure how pulls go.  If
they are done via patch imports, then the sha sums will be wrong and the
commit message will be misleading.  If the sums are preserved, then I
can resubmit with this information.

> So that 100 different people in stable teams do not have to do the
> archeology themselves ...
>
> Thanks.

Thanks for the review, Eric.


Re: [PATCH v5] net: ip, diag -- Add diag interface for raw sockets

2016-09-28 Thread Cyrill Gorcunov
On Wed, Sep 28, 2016 at 06:49:51AM -0700, Eric Dumazet wrote:
> > 
> > 4.7 is pretty widespread, so I've to think...
> 
> Sorry, 4.4.7 it was
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg128714.html

Ah, thanks for info!


[PATCH net-next v5] bpf: allow access into map value arrays

2016-09-28 Thread Josef Bacik
Suppose you have a map array value that is something like this

struct foo {
unsigned iter;
int array[SOME_CONSTANT];
};

You can easily insert this into an array, but you cannot modify the contents of
foo->array[] after the fact.  This is because we have no way to verify we won't
go off the end of the array at verification time.  This patch provides a start
for this work.  We accomplish this by keeping track of a minimum and maximum
value a register could be while we're checking the code.  Then at the time we
try to do an access into a MAP_VALUE we verify that the maximum offset into that
region is a valid access into that memory region.  So in practice, code such as
this

unsigned index = 0;

if (foo->iter >= SOME_CONSTANT)
foo->iter = index;
else
index = foo->iter++;
foo->array[index] = bar;

would be allowed, as we can verify that index will always be between 0 and
SOME_CONSTANT-1.  If you wish to use signed values you'll have to have an extra
check to make sure the index isn't less than 0, or do something like index %=
SOME_CONSTANT.

Signed-off-by: Josef Bacik 
Acked-by: Alexei Starovoitov 
---
V4->V5:
-rebased onto net-next.
-dealt with adding map values to other map values, made it so we only adjust
 min/max if we are messing with CONST or INV variables, otherwise we set it to
 RANGE_MIN/RANGE_MAX.
-added a new testcase for the adding map values to map values case.
-moved a reset of the register values until after the register validation checks
 for BPF_LD

V3->V4:
-fixed the alignment check to deal with MAP_VALUE_ADJ properly.
-updated the error messages for invalid ranges to be more clear.
-fixed test_verifier.c to work with the updated error messages.
-fixed a couple of typos.

V2->V3:
-added testcases to test_verifier.c
-fixed a problem where we wouldn't adjust min/max values if the val == *_RANGE
  values, which would cause us to think any access to the map was a-ok.
-Handle the case where the CONST_IMM register is the dst_reg instead of the
  src_reg.
-Deal with overflows from the jmp handling.
-Fixed the states_equal logic to make more sense.

V1->V2:
-Instead of using (u64)-1 as the limit which could overflow in some cases
 and give us weird behavior, make the limit 1gib so that any math doesn't
 overflow and it's still outside of the realm of reasonable offsets.

 include/linux/bpf.h  |   7 +
 include/linux/bpf_verifier.h |  12 ++
 kernel/bpf/verifier.c| 329 ---
 samples/bpf/libbpf.h |   8 ++
 samples/bpf/test_verifier.c  | 243 +++-
 5 files changed, 577 insertions(+), 22 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5691fdc..c201017 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -139,6 +139,13 @@ enum bpf_reg_type {
 */
PTR_TO_PACKET,
PTR_TO_PACKET_END,   /* skb->data + headlen */
+
+   /* PTR_TO_MAP_VALUE_ADJ is used for doing pointer math inside of a map
+* elem value.  We only allow this if we can statically verify that
+* access from this register are going to fall within the size of the
+* map element.
+*/
+   PTR_TO_MAP_VALUE_ADJ,
 };
 
 struct bpf_prog;
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c5cb661..7035b99 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -10,8 +10,19 @@
 #include  /* for enum bpf_reg_type */
 #include  /* for MAX_BPF_STACK */
 
+ /* Just some arbitrary values so we can safely do math without overflowing and
+  * are obviously wrong for any sort of memory access.
+  */
+#define BPF_REGISTER_MAX_RANGE (1024 * 1024 * 1024)
+#define BPF_REGISTER_MIN_RANGE -(1024 * 1024 * 1024)
+
 struct bpf_reg_state {
enum bpf_reg_type type;
+   /*
+* Used to determine if any memory access using this register will
+* result in a bad access.
+*/
+   u64 min_value, max_value;
union {
/* valid when type == CONST_IMM | PTR_TO_STACK | UNKNOWN_VALUE 
*/
s64 imm;
@@ -81,6 +92,7 @@ struct bpf_verifier_env {
u32 id_gen; /* used to generate unique reg IDs */
bool allow_ptr_leaks;
bool seen_direct_write;
+   bool varlen_map_value_access;
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7ada315..99a7e5b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -182,6 +182,7 @@ static const char * const reg_type_str[] = {
[CONST_PTR_TO_MAP]  = "map_ptr",
[PTR_TO_MAP_VALUE]  = "map_value",
[PTR_TO_MAP_VALUE_OR_NULL] = "map_value_or_null",
+   [PTR_TO_MAP_VALUE_ADJ]  = "map_value_adj",
[FRAME_PTR] = "fp",
[PTR_TO_STACK]  = "fp",
[CONST_IMM] 

[PATCH v8 1/8] thunderbolt: Macro rename

2016-09-28 Thread Amir Levy
This first patch updates the NHI Thunderbolt controller registers file to
reflect that it is not only for Cactus Ridge.
No functional change intended.

Signed-off-by: Amir Levy 
Signed-off-by: Andreas Noever 
---
 drivers/thunderbolt/nhi_regs.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/thunderbolt/nhi_regs.h b/drivers/thunderbolt/nhi_regs.h
index 86b996c..75cf069 100644
--- a/drivers/thunderbolt/nhi_regs.h
+++ b/drivers/thunderbolt/nhi_regs.h
@@ -1,11 +1,11 @@
 /*
- * Thunderbolt Cactus Ridge driver - NHI registers
+ * Thunderbolt driver - NHI registers
  *
  * Copyright (c) 2014 Andreas Noever 
  */
 
-#ifndef DSL3510_REGS_H_
-#define DSL3510_REGS_H_
+#ifndef NHI_REGS_H_
+#define NHI_REGS_H_
 
 #include 
 
-- 
2.7.4



[PATCH v8 4/8] thunderbolt: Networking state machine

2016-09-28 Thread Amir Levy
This patch builds the peer to peer communication path.
Communication is established by a negotiation process whereby messages are
sent back and forth between the peers until a connection is established.
This includes the Thunderbolt Network driver communication with the second
peer via Intel Connection Manager(ICM) firmware.
  ++++
  |Host 1  ||Host 2  |
  ||||
  | +---+  || +---+  |
  | |Thunderbolt|  || |Thunderbolt|  |
  | |Networking |  || |Networking |  |
  | |Driver |  || |Driver |  |
  | +---+  || +---+  |
  |  ^ ||  ^ |
  |  | ||  | |
  | ++---+ || ++---+ |
  | |Thunderbolt |   | || |Thunderbolt |   | |
  | |Controller  v   | || |Controller  v   | |
  | | +---+  | || | +---+  | |
  | | |ICM|<-+-++-+>|ICM|  | |
  | | +---+  | || | +---+  | |
  | ++ || ++ |
  ++++
Note that this patch only establishes the link between the two hosts and
not Network Packet handling - this is dealt with in the next patch.

Signed-off-by: Amir Levy 
---
 drivers/thunderbolt/icm/Makefile  |   2 +-
 drivers/thunderbolt/icm/icm_nhi.c | 262 -
 drivers/thunderbolt/icm/net.c | 783 ++
 drivers/thunderbolt/icm/net.h |  70 
 4 files changed, 1109 insertions(+), 8 deletions(-)
 create mode 100644 drivers/thunderbolt/icm/net.c

diff --git a/drivers/thunderbolt/icm/Makefile b/drivers/thunderbolt/icm/Makefile
index f0d0fbb..94a2797 100644
--- a/drivers/thunderbolt/icm/Makefile
+++ b/drivers/thunderbolt/icm/Makefile
@@ -1,2 +1,2 @@
 obj-${CONFIG_THUNDERBOLT_ICM} += thunderbolt-icm.o
-thunderbolt-icm-objs := icm_nhi.o
+thunderbolt-icm-objs := icm_nhi.o net.o
diff --git a/drivers/thunderbolt/icm/icm_nhi.c 
b/drivers/thunderbolt/icm/icm_nhi.c
index 23047d3..a849698 100644
--- a/drivers/thunderbolt/icm/icm_nhi.c
+++ b/drivers/thunderbolt/icm/icm_nhi.c
@@ -64,6 +64,13 @@ static const struct nla_policy nhi_genl_policy[NHI_ATTR_MAX 
+ 1] = {
.len = TBT_ICM_RING_MAX_FRAME_SIZE },
[NHI_ATTR_MSG_FROM_ICM] = { .type = NLA_BINARY,
.len = TBT_ICM_RING_MAX_FRAME_SIZE },
+   [NHI_ATTR_LOCAL_ROUTE_STRING]   = {
+   .len = sizeof(struct route_string) },
+   [NHI_ATTR_LOCAL_UUID]   = { .len = sizeof(uuid_be) },
+   [NHI_ATTR_REMOTE_UUID]  = { .len = sizeof(uuid_be) },
+   [NHI_ATTR_LOCAL_DEPTH]  = { .type = NLA_U8, },
+   [NHI_ATTR_ENABLE_FULL_E2E]  = { .type = NLA_FLAG, },
+   [NHI_ATTR_MATCH_FRAME_ID]   = { .type = NLA_FLAG, },
 };
 
 /* NHI genetlink family */
@@ -480,6 +487,29 @@ int nhi_mailbox(struct tbt_nhi_ctxt *nhi_ctxt, u32 cmd, 
u32 data, bool deinit)
return 0;
 }
 
+static inline bool nhi_is_path_disconnected(u32 cmd, u8 num_ports)
+{
+   return (cmd >= DISCONNECT_PORT_A_INTER_DOMAIN_PATH &&
+   cmd < (DISCONNECT_PORT_A_INTER_DOMAIN_PATH + num_ports));
+}
+
+static int nhi_mailbox_disconn_path(struct tbt_nhi_ctxt *nhi_ctxt, u32 cmd)
+   __releases(_list_mutex)
+{
+   struct port_net_dev *port;
+   u32 port_num = cmd - DISCONNECT_PORT_A_INTER_DOMAIN_PATH;
+
+   port = &(nhi_ctxt->net_devices[port_num]);
+   mutex_lock(>state_mutex);
+
+   mutex_unlock(_list_mutex);
+   port->medium_sts = MEDIUM_READY_FOR_APPROVAL;
+   if (port->net_dev)
+   negotiation_events(port->net_dev, MEDIUM_DISCONNECTED);
+   mutex_unlock(>state_mutex);
+   return  0;
+}
+
 static int nhi_mailbox_generic(struct tbt_nhi_ctxt *nhi_ctxt, u32 mb_cmd)
__releases(_list_mutex)
 {
@@ -526,13 +556,90 @@ static int nhi_genl_mailbox(__always_unused struct 
sk_buff *u_skb,
return -ERESTART;
 
nhi_ctxt = nhi_search_ctxt(*(u32 *)info->userhdr);
-   if (nhi_ctxt && !nhi_ctxt->d0_exit)
-   return nhi_mailbox_generic(nhi_ctxt, mb_cmd);
+   if (nhi_ctxt && !nhi_ctxt->d0_exit) {
+
+   /* rwsem is released later by the below functions */
+   if (nhi_is_path_disconnected(cmd, nhi_ctxt->num_ports))
+   return nhi_mailbox_disconn_path(nhi_ctxt, cmd);
+   else
+   return nhi_mailbox_generic(nhi_ctxt, mb_cmd);
+
+   }
 
mutex_unlock(_list_mutex);
return -ENODEV;
 }
 
+static int 

[PATCH v8 6/8] thunderbolt: Kconfig for Thunderbolt Networking

2016-09-28 Thread Amir Levy
Update to the Kconfig Thunderbolt description to add
Thunderbolt networking as an option.
The menu item "Thunderbolt support" now offers:
"Apple Hardware Support" (existing)
and/or
"Thunderbolt Networking" (new)

You can choose the driver for your platform or build both drivers -
each driver will detect if it can run on the specific platform.
If the Thunderbolt Networking option is chosen, Thunderbolt Networking
will be enabled between Linux non-Apple systems, macOS and
Windows based systems.
Thunderbolt Networking will not affect any other Thunderbolt feature that
was previous available to Linux users on either Apple or
non-Apple platforms.

Signed-off-by: Amir Levy 
---
 drivers/thunderbolt/Kconfig  | 27 +++
 drivers/thunderbolt/Makefile |  3 ++-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/thunderbolt/Kconfig b/drivers/thunderbolt/Kconfig
index c121acc..376e5bb 100644
--- a/drivers/thunderbolt/Kconfig
+++ b/drivers/thunderbolt/Kconfig
@@ -1,13 +1,32 @@
-menuconfig THUNDERBOLT
-   tristate "Thunderbolt support for Apple devices"
+config THUNDERBOLT
+   tristate "Thunderbolt support"
depends on PCI
select CRC32
help
- Cactus Ridge Thunderbolt Controller driver
+ Thunderbolt Controller driver
+
+if THUNDERBOLT
+
+config THUNDERBOLT_APPLE
+   tristate "Apple hardware support"
+   help
  This driver is required if you want to hotplug Thunderbolt devices on
  Apple hardware.
 
  Device chaining is currently not supported.
 
- To compile this driver a module, choose M here. The module will be
+ To compile this driver as a module, choose M here. The module will be
  called thunderbolt.
+
+config THUNDERBOLT_ICM
+   tristate "Thunderbolt Networking"
+   help
+ This driver is required if you want Thunderbolt Networking on
+ non-Apple hardware.
+ It creates a virtual Ethernet device that enables computer to
+ computer communication over a Thunderbolt cable.
+
+ To compile this driver as a module, choose M here. The module will be
+ called thunderbolt_icm.
+
+endif
diff --git a/drivers/thunderbolt/Makefile b/drivers/thunderbolt/Makefile
index 5d1053c..b6aa6a3 100644
--- a/drivers/thunderbolt/Makefile
+++ b/drivers/thunderbolt/Makefile
@@ -1,3 +1,4 @@
-obj-${CONFIG_THUNDERBOLT} := thunderbolt.o
+obj-${CONFIG_THUNDERBOLT_APPLE} := thunderbolt.o
 thunderbolt-objs := nhi.o ctl.o tb.o switch.o cap.o path.o tunnel_pci.o 
eeprom.o
 
+obj-${CONFIG_THUNDERBOLT_ICM} += icm/
-- 
2.7.4



[PATCH v8 7/8] thunderbolt: Networking doc

2016-09-28 Thread Amir Levy
Adding Thunderbolt(TM) networking documentation.

Signed-off-by: Amir Levy 
---
 Documentation/00-INDEX   |   2 +
 Documentation/thunderbolt/networking.txt | 132 +++
 2 files changed, 134 insertions(+)
 create mode 100644 Documentation/thunderbolt/networking.txt

diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index cb9a6c6..a448ba1 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -439,6 +439,8 @@ this_cpu_ops.txt
- List rationale behind and the way to use this_cpu operations.
 thermal/
- directory with information on managing thermal issues (CPU/temp)
+thunderbolt/
+   - directory with info regarding Thunderbolt.
 trace/
- directory with info on tracing technologies within linux
 unaligned-memory-access.txt
diff --git a/Documentation/thunderbolt/networking.txt 
b/Documentation/thunderbolt/networking.txt
new file mode 100644
index 000..88d1c12
--- /dev/null
+++ b/Documentation/thunderbolt/networking.txt
@@ -0,0 +1,132 @@
+Intel Thunderbolt(TM) Networking driver
+===
+
+Copyright(c) 2013 - 2016 Intel Corporation.
+
+Contact Information:
+Intel Thunderbolt mailing list 
+Edited by Amir Levy 
+
+Overview
+
+
+* The Thunderbolt Networking driver enables peer to peer networking on 
non-Apple
+  platforms running Linux.
+
+* The driver creates a virtual Ethernet device that enables computer to 
computer
+  communication over the Thunderbolt cable.
+
+* Using Thunderbolt Networking you can perform high speed file transfers 
between
+  computers, perform PC migrations and/or set up small workgroups with shared
+  storage without compromising any other Thunderbolt functionality.
+
+* The driver is located in drivers/thunderbolt/icm.
+
+* This driver will function only on non-Apple platforms with firmware based
+  Thunderbolt controllers that support Thunderbolt Networking.
+
+  ++++
+  |Host 1  ||Host 2  |
+  ||||
+  |   +---+||+---+   |
+  |   |Network||||Network|   |
+  |   |Stack  ||||Stack  |   |
+  |   +---+||+---+   |
+  |   ^||^   |
+  |   ||||   |
+  |   v||v   |
+  | +---+  ||  +---+ |
+  | |Thunderbolt|  ||  |Thunderbolt| |
+  | |Networking |  ||  |Networking | |
+  | |Driver |  ||  |Driver | |
+  | +---+  ||  +---+ |
+  |   ^||^   |
+  |   ||||   |
+  |   v||v   |
+  | +---+  ||  +---+ |
+  | |Thunderbolt|  ||  |Thunderbolt| |
+  | |Controller |<-++->|Controller | |
+  | |with ICM   |  ||  |with ICM   | |
+  | |enabled|  ||  |enabled| |
+  | +---+  ||  +---+ |
+  ++++
+
+Files
+=
+
+The following files are located in the drivers/thunderbolt/icm directory:
+
+- icm_nhi.c/h: These files allow communication with the firmware (Intel
+  Connection Manager) based controller. They also create an interface for
+  netlink communication with a user space daemon.
+
+- net.c/net.h: These files implement the 'eth' interface for the
+  Thunderbolt(TM) Networking.
+
+Interface to User Space
+===
+
+The interface to the user space module is implemented through a Generic 
Netlink.
+This is the communications protocol between the Thunderbolt driver and the user
+space application.
+
+Note that this interface mediates user space communication with ICM.
+(Existing Linux tools can be used to configure the network interface.)
+
+The Thunderbolt Daemon utilizes this interface to communicate with the driver.
+To be accessed by the user space module, both kernel and user space modules
+have to register with the same GENL_NAME.
+For the purpose of the Thunderbolt Network driver, "thunderbolt" is used.
+The registration is done at driver initialization time for all instances
+of the Thunderbolt controllers. The communication is carried through 
pre-defined
+Thunderbolt messages. Each specific message has a callback function that is
+called when the related message is received.
+
+Message Definitions:
+* NHI_CMD_UNSPEC: Not used.
+* NHI_CMD_SUBSCRIBE: Subscription request from daemon to driver to open the
+  communication channel.
+* NHI_CMD_UNSUBSCRIBE: Request from daemon to driver to unsubscribe and
+  to close communication channel.
+* NHI_CMD_QUERY_INFORMATION: Request information from the driver 

[PATCH v8 3/8] thunderbolt: Communication with the ICM (firmware)

2016-09-28 Thread Amir Levy
This patch provides the communication protocol between the
Intel Connection Manager(ICM) firmware that is operational in the
Thunderbolt controller in non-Apple hardware.
The ICM firmware-based controller is used for establishing and maintaining
the Thunderbolt Networking connection - we need to be able to communicate
with it.

Signed-off-by: Amir Levy 
---
 drivers/thunderbolt/icm/Makefile  |2 +
 drivers/thunderbolt/icm/icm_nhi.c | 1251 +
 drivers/thunderbolt/icm/icm_nhi.h |   82 +++
 drivers/thunderbolt/icm/net.h |  217 +++
 4 files changed, 1552 insertions(+)
 create mode 100644 drivers/thunderbolt/icm/Makefile
 create mode 100644 drivers/thunderbolt/icm/icm_nhi.c
 create mode 100644 drivers/thunderbolt/icm/icm_nhi.h
 create mode 100644 drivers/thunderbolt/icm/net.h

diff --git a/drivers/thunderbolt/icm/Makefile b/drivers/thunderbolt/icm/Makefile
new file mode 100644
index 000..f0d0fbb
--- /dev/null
+++ b/drivers/thunderbolt/icm/Makefile
@@ -0,0 +1,2 @@
+obj-${CONFIG_THUNDERBOLT_ICM} += thunderbolt-icm.o
+thunderbolt-icm-objs := icm_nhi.o
diff --git a/drivers/thunderbolt/icm/icm_nhi.c 
b/drivers/thunderbolt/icm/icm_nhi.c
new file mode 100644
index 000..23047d3
--- /dev/null
+++ b/drivers/thunderbolt/icm/icm_nhi.c
@@ -0,0 +1,1251 @@
+/***
+ *
+ * Intel Thunderbolt(TM) driver
+ * Copyright(c) 2014 - 2016 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ 
**/
+
+#include 
+#include 
+#include 
+#include 
+#include "icm_nhi.h"
+#include "net.h"
+
+#define NHI_GENL_VERSION 1
+#define NHI_GENL_NAME "thunderbolt"
+
+#define DEVICE_DATA(num_ports, dma_port, nvm_ver_offset, nvm_auth_on_boot,\
+   support_full_e2e) \
+   ((num_ports) | ((dma_port) << 4) | ((nvm_ver_offset) << 10) | \
+((nvm_auth_on_boot) << 22) | ((support_full_e2e) << 23))
+#define DEVICE_DATA_NUM_PORTS(device_data) ((device_data) & 0xf)
+#define DEVICE_DATA_DMA_PORT(device_data) (((device_data) >> 4) & 0x3f)
+#define DEVICE_DATA_NVM_VER_OFFSET(device_data) (((device_data) >> 10) & 0xfff)
+#define DEVICE_DATA_NVM_AUTH_ON_BOOT(device_data) (((device_data) >> 22) & 0x1)
+#define DEVICE_DATA_SUPPORT_FULL_E2E(device_data) (((device_data) >> 23) & 0x1)
+
+#define USEC_TO_256_NSECS(usec) DIV_ROUND_UP((usec) * NSEC_PER_USEC, 256)
+
+/* NHI genetlink commands */
+enum {
+   NHI_CMD_UNSPEC,
+   NHI_CMD_SUBSCRIBE,
+   NHI_CMD_UNSUBSCRIBE,
+   NHI_CMD_QUERY_INFORMATION,
+   NHI_CMD_MSG_TO_ICM,
+   NHI_CMD_MSG_FROM_ICM,
+   NHI_CMD_MAILBOX,
+   NHI_CMD_APPROVE_TBT_NETWORKING,
+   NHI_CMD_ICM_IN_SAFE_MODE,
+   __NHI_CMD_MAX,
+};
+#define NHI_CMD_MAX (__NHI_CMD_MAX - 1)
+
+/* NHI genetlink policy */
+static const struct nla_policy nhi_genl_policy[NHI_ATTR_MAX + 1] = {
+   [NHI_ATTR_DRV_VERSION]  = { .type = NLA_NUL_STRING, },
+   [NHI_ATTR_NVM_VER_OFFSET]   = { .type = NLA_U16, },
+   [NHI_ATTR_NUM_PORTS]= { .type = NLA_U8, },
+   [NHI_ATTR_DMA_PORT] = { .type = NLA_U8, },
+   [NHI_ATTR_SUPPORT_FULL_E2E] = { .type = NLA_FLAG, },
+   [NHI_ATTR_MAILBOX_CMD]  = { .type = NLA_U32, },
+   [NHI_ATTR_PDF]  = { .type = NLA_U32, },
+   [NHI_ATTR_MSG_TO_ICM]   = { .type = NLA_BINARY,
+   .len = TBT_ICM_RING_MAX_FRAME_SIZE },
+   [NHI_ATTR_MSG_FROM_ICM] = { .type = NLA_BINARY,
+   .len = TBT_ICM_RING_MAX_FRAME_SIZE },
+};
+
+/* NHI genetlink family */
+static struct genl_family nhi_genl_family = {
+   .id = GENL_ID_GENERATE,
+   .hdrsize= FIELD_SIZEOF(struct tbt_nhi_ctxt, id),
+   .name   = NHI_GENL_NAME,
+   .version= NHI_GENL_VERSION,
+   .maxattr= NHI_ATTR_MAX,
+};
+
+static LIST_HEAD(controllers_list);
+static DEFINE_MUTEX(controllers_list_mutex);
+static atomic_t subscribers = ATOMIC_INIT(0);
+/*
+ * Some of the received generic netlink messages are replied in a different
+ * context. The reply has to include the netlink portid of sender, therefore
+ * saving it in global variable (current assuption is one sender).
+ */
+static u32 portid;
+
+static bool nhi_nvm_authenticated(struct tbt_nhi_ctxt *nhi_ctxt)
+{
+   enum icm_operation_mode op_mode;
+   u32 *msg_head, port_id, reg;
+   

[PATCH v8 8/8] thunderbolt: Adding maintainer entry

2016-09-28 Thread Amir Levy
Add Amir Levy as maintainer for Thunderbolt(TM) ICM driver

Signed-off-by: Amir Levy 
---
 MAINTAINERS | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 01bff8e..a4a4614 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10358,7 +10358,13 @@ F: include/uapi/linux/stm.h
 THUNDERBOLT DRIVER
 M: Andreas Noever 
 S: Maintained
-F: drivers/thunderbolt/
+F: drivers/thunderbolt/*
+
+THUNDERBOLT ICM DRIVER
+M: Amir Levy 
+S: Maintained
+F: drivers/thunderbolt/icm/
+F: Documentation/thunderbolt/networking.txt
 
 TI BQ27XXX POWER SUPPLY DRIVER
 R: Andrew F. Davis 
-- 
2.7.4



  1   2   3   >